Elixir may not be the most popular language for web scraping, but its ability to handle parallel tasks efficiently makes it a great candidate for large-scale projects.
In this step-by-step tutorial, you'll learn how to perform web scraping in Elixir with Crawly, from basic data extraction to handling pagination and avoiding blocks.
Can You Scrape Websites with Elixir?
Yes, Elixir is a solid choice for web scraping, particularly at scale. Its process distribution model and built-in concurrency make it well-suited for scraping large numbers of pages simultaneously without the overhead typical of other languages.
At the same time, Elixir isn't the most popular language for online data scraping. But it stands out for its exceptional process distribution capabilities and extreme scalability. It excels at handling parallel tasks, making it particularly well-suited for large-scale scraping projects.
Python web scraping is much more common due to its extensive ecosystem. JavaScript with Node.js is also popular in this domain.
Prerequisites
Set up your Elixir environment for web scraping.
Set Up an Elixir Project
Before getting started, make sure you have Elixir installed on your computer. Otherwise, visit the Elixir installation page and follow the instructions for your OS. On Windows, you'll first have to install Erlang.
Then, use the mix new command to initialize an Elixir project called elixir_scraper:
mix new elixir_scraper --sup
Great! Your elixir_scraper will now contain a blank web scraping Elixir project.
Install the Tools
To perform web scraping in Elixir, you'll need the following two libraries:
- Crawly: A Scrapy-like framework for crawling sites and exporting structured data from their pages. The library acts as a pipeline that executes tasks one at a time on the scraped items.
- Floky: A simple, yet complete, HTML parser to select nodes via CSS selectors and retrieve data from them.
Add them to your project's dependencies by making sure the mix.exs file contains the latest version of each library:
defp deps do
[
{:crawly, "~> 0.17.2"},
{:floki, "~> 0.38.2"}
]
end
After this operation, your mix.exs will look as follows:
defmodule ElixirScraper.MixProject do
use Mix.Project
def project do
[
app: :elixir_scraper,
version: "0.1.0",
elixir: "~> 1.16",
start_permanent: Mix.env() == :prod,
deps: deps()
]
end
# Run "mix help compile.app" to learn about applications.
def application do
[
extra_applications: [:logger],
mod: {ElixirScraper.Application, []}
]
end
# Run "mix help deps" to learn about dependencies.
defp deps do
[
{:crawly, "~> 0.17.2"},
{:floki, "~> 0.38.2"}
]
end
end
Install the libraries with this command:
mix deps.get
Well done! Your Elixir web scraping project is ready!
Tutorial: How to Do Web Scraping with Elixir
Doing web scraping using Elixir involves the four steps below:
- Create a Crawly spider for the target site.
- Set up Crawly to visit the desired webpage.
- Use Floky to parse the HTML content and populate some Crawly items.
- Configure Crawly to export the scraped items to CSV.
The target site is ScrapingCourse.com, a demo e-commerce website featuring a paginated product list. The goal of the Elixir Crawly scraper you're about to create is to extract all product data from each page.
Time to write some code!
Step 1: Create Your Spider
In Crawly, a spider is an Elixir module that defines a specific site's scraping process. It specifies how to extract structured data from pages and follow links for crawling. Specifically, it's Crawly.Spider behavior module that must contain the following functions:
base_url(): Returns the base URL of the target site. Crawly uses it to filter out requests unrelated to the specified website.init(): Returns a list of URLs the crawler should start from.parse_item(): Defines the parsing logic to convert scraped data into Crawly items, which can be exported in different formats. It also specifies how to make the subsequent requests to crawl the site.
Launch the command below to set up a new Crawly spider:
mix crawly.gen.spider --filepath ./lib/scrapingcourse_spider.ex --spidername ScrapingcourseSpider
The first time you run this command, be patient as it'll take a while. The reason is that it'll initialize all the Crawly files needed to run the project.
This will create a scrapingcourse_spider.ex file in the /lib folder containing the ScrapingcourseSpider module below:
defmodule ScrapingcourseSpider do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: "https://books.toscrape.com/"
@impl Crawly.Spider
def init() do
[start_urls: ["https://books.toscrape.com/index.html"]]
end
@impl Crawly.Spider
@doc """
Extract items and requests to follow from the given response
"""
def parse_item(response) do
# Extract item field from the response here. Usually it's done this way:
# {:ok, document} = Floki.parse_document(response.body)
# item = %{
# title: document |> Floki.find("title") |> Floki.text(),
# url: response.request_url
# }
extracted_items = []
# Extract requests to follow from the response. Don't forget that you should
# supply request objects here. Usually it's done via
#
# urls = document |> Floki.find(".pagination a") |> Floki.attribute("href")
# Don't forget that you need absolute urls
# requests = Crawly.Utils.requests_from_urls(urls)
next_requests = []
%Crawly.ParsedItem{items: extracted_items, requests: next_requests}
end
end
Update base_url() so that it returns the base URL of the target site:
@impl Crawly.Spider
def base_url(), do: "https://www.scrapingcourse.com/ecommerce/"
Clean out the comments in parse_item() and empty the start_urls array returned by init(). This will be your starting Scrapingcourse spider module:
defmodule SrapingcourseSpider do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: "https://www.scrapingcourse.com/ecommerce/"
@impl Crawly.Spider
def init() do
[start_urls: []]
end
@impl Crawly.Spider
@doc """
Extract items and requests to follow from the given response
"""
def parse_item(response) do
extracted_items = []
next_requests = []
%Crawly.ParsedItem{items: extracted_items, requests: next_requests}
end
end
To run it, launch the command below:
iex -S mix run -e "Crawly.Engine.start_spider(SrapingcourseSpider)"
On Windows, iex must be iex.bat. So the command becomes:
iex.bat -S mix run -e "Crawly.Engine.start_spider(ScrapingcourseSpider)"
If it all went as planned, that will generate this output:
[debug] Opening/checking dynamic spiders storage
[debug] Using the following folder to load extra spiders: ./spiders
[debug] Could not load spiders: %MatchError{term: {:error, :enoent}}
[debug] Starting data storage
[debug] Starting the manager for ScrapingcourseSpider
[debug] Starting requests storage worker for ScrapingcourseSpider...
[debug] Started 4 workers for ScrapingcourseSpider
Perfect, get ready to start scraping some data!
Step 2: Connect to the Target Page
To instruct the Crawly spider to visit the target page, add it to the start_urls array returned by init():
@impl Crawly.Spider
def init() do
[start_urls: ["https://www.scrapingcourse.com/ecommerce/"]]
end
The Elixir web scraping script will now perform a request to the desired page. To verify that, retrieve the source HTML of the page and log it in parse_item():
# extract the HTML form the target page and log it
html = response.body
Logger.info("HTML of the target page:\n#{html}")
The complete scrapingcourse_spider.ex will be:
defmodule ScrapingcourseSpider do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: "https://www.scrapingcourse.com/ecommerce/"
@impl Crawly.Spider
def init() do
[start_urls: ["https://www.scrapingcourse.com/ecommerce/"]]
end
@impl Crawly.Spider
@doc """
Extract items and requests to follow from the given response
"""
def parse_item(response) do
# extract the HTML form the target page and log it
html = response.body
Logger.info("HTML of the target page:\n#{html}")
extracted_items = []
next_requests = []
%Crawly.ParsedItem{items: extracted_items, requests: next_requests}
end
end
Run it, and it'll now print:
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->
<title>Ecommerce Test Site to Learn Web Scraping – ScrapingCourse.com</title>
<!--- ... --->
</head>
<body class="home archive ...">
<p class="woocommerce-result-count">Showing 1–16 of 188 results</p>
<ul class="products columns-4">
<!--- ... --->
</ul>
</body>
</html>
Wonderful! Your Elixir scraping script now connects to the target page. It's time to parse its HTML content and extract some data from it.
Step 3: Extract Specific Data from the Scraped Page
This step aims to define a CSS selector strategy to select the HTML product nodes and retrieve data from them.
Open the target page in your browser and inspect a product HTML element in the DevTools:
Take a look at the HTML code and note that you can select all products with this CSS selector:
li.product
li is the HTML tag, while product is the element's class attribute.
Given a product element, the useful information to extract is:
- The URL is in the
a.woocommerce-LoopProduct-link. - The image is in the
img.attachment-woocommerce_thumbnailnode. - The name is in the
h2.woocommerce-loop-product__titlenode. - The price is in the
span.pricenodes.
Put that knowledge into practice by implementing the following parsing logic in parse_item().
First, pass the response body to Floky to parse the HTML.
# parse the response HTML body
{:ok, document} = Floki.parse_document(response.body)
Then, select all the HTML product elements on the page and iterate to convert them to Crawly items:
product_items =
document
|> Floki.find("li.product")
|> Enum.map(fn x ->
%{
url: Floki.find(x, "a.woocommerce-LoopProduct-link")|> Floki.attribute("href") |> Floki.text(),
name: Floki.find(x, "h2.woocommerce-loop-product__title") |> Floki.text(),
image: Floki.find(x, "img.attachment-woocommerce_thumbnail") |> Floki.attribute("src") |> Floki.text(),
price: Floki.find(x, "span.price") |> Floki.text(),
}
end)
Crawly automatically passes the parsed items returned by the parse_item() function to the next task in the pipeline.
Thus, make sure to add them to the return object as below:
%Crawly.ParsedItem{items: product_items}
Put it all together, and you'll get this ScrapingcourseSpider module:
defmodule ScrapingcourseSpider do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: "https://www.scrapingcourse.com/ecommerce/"
@impl Crawly.Spider
def init() do
[start_urls: ["https://www.scrapingcourse.com/ecommerce/"]]
end
@impl Crawly.Spider
def parse_item(response) do
# parse the response HTML body
{:ok, document} = Floki.parse_document(response.body)
# select all product elements on the page
# and convert them to scraped items
product_items =
document
|> Floki.find("li.product")
|> Enum.map(fn x ->
%{
url: Floki.find(x, "a.woocommerce-LoopProduct-link")|> Floki.attribute("href") |> Floki.text(),
name: Floki.find(x, "h2.woocommerce-loop-product__title") |> Floki.text(),
image: Floki.find(x, "img.attachment-woocommerce_thumbnail") |> Floki.attribute("src") |> Floki.text(),
price: Floki.find(x, "span.price") |> Floki.text(),
}
end)
%Crawly.ParsedItem{items: product_items}
end
end
Execute it, and it'll automatically log the scraped items as in these lines:
[debug] Stored item: %{name: "Abominable Hoodie", image: "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg", url: "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/", price: "$69.00"}
# omitted for brevity...
[debug] Stored item: %{name: "Artemis Running Short", image: "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg", url: "https://www.scrapingcourse.com/ecommerce/product/artemis-running-short/", price: "$45.00"}
As you can see, the scraped items contain the desired data. Mission achieved!
Step 4: Convert Scraped Data Into a CSV File
Crawly supports the CSV and JSON export formats out of the box. To export the scraped items to CSV, you need to configure the task in the pipeline. Add a config folder to your project, then create the config file.exs configuration file inside it:
import Config
config :crawly,
middlewares: [],
pipelines: [
{Crawly.Pipelines.CSVEncoder, fields: [:url, :name, :image, :price]},
{Crawly.Pipelines.WriteToFile, extension: "csv", folder: "output"}
]
Crawly.Pipelines.CSVEncoder instructs Crawly to convert the scraped items to CSV format. Bear in mind that you must specify the item attributes you want to appear in the CSV file in the fields attribute. Use Crawly.Pipelines.WriteToFile to set the output folder and file extension.
Create a blank output folder in your project's root and run the web scraping Elixir spider:
iex -S mix run -e "Crawly.Engine.start_spider(ScrapingcourseSpider)"
Wait for the script to complete, and the CSV file below will appear in the output directory.
Open it, and you'll see:
Et voilĂ ! You just performed web scraping in Elixir with Crawly!
Advanced Web Scraping Techniques with Elixir
Now that you know the basics, it's time to explore advanced Elixir web scraping techniques.
Scrape and Get Data from Paginated Pages
The current output only involves the product data from the home page. However, the target site has many pages. You need to perform web crawling to scrape them all and retrieve all products. If you're unfamiliar with that, read our guide on web crawling vs web scraping.
Crawly makes it easy to implement web crawling. All you have to do is pass the requests to the page you want to visit next to the requests field of the object returned by parse_item().
The web scraping Elixir tool will add each request to the queue and visit the page only if it hasn't been visited yet. In detail, it'll apply the parse_item() function to each page, thereby scraping new products.
Thus, inspect the pagination element on the page to learn how to extract the URLs from it:
Note that you can select each link element with the CSS selector below:
a.page-numbers
Get their destination URLs from the href attribute and convert it to a Crawly request with the request_from_url() utility:
# find the URLs of the next pages to visit and
# convert them to Crawly requests
next_requests =
document
|> Floki.find("a.page-numbers")
|> Floki.attribute("href")
|> Enum.map(
fn url -> Crawly.Utils.request_from_url(url)
end)
Next, pass them to the return object:
%Crawly.ParsedItem{items: product_items, requests: next_requests}
Before launching the spider, update the global configurations in config.exs. Use the closespider_itemcount option to make it scrape only one page at a time, and closespider_itemcount to make it stop after scraping at least 50 items:
import Config
config :crawly,
concurrent_requests_per_domain: 1,
closespider_itemcount: 50,
middlewares: [],
pipelines: [
{Crawly.Pipelines.CSVEncoder, fields: [:url, :name, :image, :price]},
{Crawly.Pipelines.WriteToFile, extension: "csv", folder: "output"}
]
Set concurrent_requests_per_domain to control how many parallel requests Crawly makes per domain. The default limit is 8. If not set, Crawly will use its internal default. Increase this value for faster crawls on sites that allow it, but keep it low to avoid triggering rate limits.
Perfect! Run the spider again:
iex -S mix run -e "Crawly.Engine.start_spider(ScrapingcourseSpider)"
This time, the script will go through many pagination pages and scrape them all. The final result will be a CSV file storing this data:
Amazing! You just learn how to perform Elixir web crawling!
Avoid Getting Blocked When Scraping with Elixir
The biggest challenge when doing web scraping with Elixir is getting blocked. Many sites know how valuable their data is, even if publicly available on their pages. So, they adopt anti-bot technologies to detect and block automated scripts. Those solutions can stop your spider.
Two tips for performing web scraping without getting blocked are to set a real-world User-Agent and use a proxy to protect your IP. You can set custom User-Agents globally in Crawly with the [Middlewares.UserAgent](https://hexdocs.pm/crawly/Crawly.Middlewares.UserAgent.html#run/3) middleware in config.exs:
import Config
config :crawly,
# ...
middlewares: [
# other middlewares...
{
Crawly.Middlewares.UserAgent, user_agents: [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.2210.121"
]
}
],
pipelines: [
#...
]
Crawly will now automatically rotate over the User-Agent strings in the user_agents array.
Configuring a proxy server depends instead on the Crawly.Middlewares.RequestOptions middleware. Get the URL of a free proxy from a site like Free Proxy List and then use it in the configuration file as follows:
import Config
config :crawly,
# ...
middlewares: [
# other middlewares...
{Crawly.Middlewares.RequestOptions, [proxy: {"231.32.4.13", 3671}]}
],
pipelines: [
#...
]
The chosen proxy server will no longer work at the time of following this tutorial. The reason is that free proxies are short-lived and unreliable. Plus, they're data-greedy and only good for learning purposes!
Don't forget that these two tips are just baby steps to bypass anti-bot measures. Advanced solutions like Cloudflare can still detect your Elixir web-scraping script as a bot. For example, try to get the HTML of a the Antibot challenge page:
defmodule AntibotSpider do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: "https://www.scrapingcourse.com/"
@impl Crawly.Spider
def init() do
[start_urls: ["https://www.scrapingcourse.com/antibot-challenge"]]
end
@impl Crawly.Spider
@doc """
Extract items and requests to follow from the given response
"""
def parse_item(response) do
# extract the HTML form the target page and log it
html = response.body
Logger.info(html)
%Crawly.ParsedItem{items: []}
end
end
The Crawly spider will receive the following 403 error page:
<!DOCTYPE html>
<!-- omitted for brevity -->
<head>
<title>Attention Required! | Cloudflare</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<!-- omitted for brevity -->
How to avoid that? With ZenRows! Not only does this service offer the best anti-bot toolkit, but it also rotates your User Agent, adds IP rotation, and more.
Try the power of ZenRows with Crawly by scraping the Antibot Challenge page that blocked you earlier. Sign up and go to the Universal Scraper API Playground. Paste the target URL in the link box and activate Adaptive Stealth Mode.
Next, format your Elixir request as shown:
defmodule AntibotSpider do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: "https://api.zenrows.com"
@impl Crawly.Spider
def init() do
[start_urls: ["https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&mode=auto"]]
end
@impl Crawly.Spider
def parse_item(response) do
# log the HTML returned by ZenRows
IO.puts(response.body)
%Crawly.ParsedItem{items: [], requests: []}
end
end
Run your spider again, and this time it'll print the source HTML of the G2 page:
<!DOCTYPE html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/assets/favicon-fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico" rel="shortcut icon" type="image/x-icon" />
<title>Airtable Reviews 2024: Details, Pricing, & Features | G2</title>
<!-- omitted for brevity ... -->
Wow! You just integrated ZenRows into the Crawly Elixir library.
Scraping JavaScript-Rendered Pages with Elixir
Crawly comes with Splash integration through the fetcher configuration option. This allows the scraping of pages that require JavaScript for rendering or data retrieval. Follow the instructions below to use Crawly with Splash to scrape the Infinite Scrolling demo page:
That loads new data via JavaScript as the user scrolls down. Thus, it’s a great example of a dynamic-content page that requires a headless browser.
Make sure you have Docker installed on your machine. Then, download the Splash image using this command and run it on port 8050 using the command below it:
docker pull scrapinghub/splash
docker run -it -p 8050:8050 --rm scrapinghub/splash
For a complete tutorial on how to set up Splash, follow our guide on Scrapy Splash.
Configure Crawly to use Splash for fetching and rendering data by setting the following fetcher option in config.exl:
import Config
config :crawly,
fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html", wait: 3]},
middlewares: [
# ...
],
pipelines: [
# ...
]
Create an infinite_scrolling_spider.ex spider containing the InfiniteScrollingSpider module below:
defmodule InfiniteScrollingSpider do use Crawly.Spider
@impl Crawly.Spider def base_url(), do: "https://scrapingclub.com/"
@impl Crawly.Spider def init() do [start_urls: ["https://scrapingclub.com/exercise/list_infinite_scroll/"]] end
@impl Crawly.Spider @doc """ Extract items and requests to follow from the given response """ def parse_item(response) do {:ok, document} = Floki.parse_document(response.body)
product_items =
document
|> Floki.find(".post")
|> Enum.map(fn x ->
%{
url: Floki.find(x, "h4 a")|> Floki.attribute("href") |> Floki.text(),
name: Floki.find(x, "h4") |> Floki.text(),
image: Floki.find(x, "img") |> Floki.attribute("src") |> Floki.text(),
price: Floki.find(x, "h5") |> Floki.text(),
}
end)
%Crawly.ParsedItem{items: product_items}
end end
Execute your new spider:
iex -S mix run -e "Crawly.Engine.start_spider(InfiniteScrollingSpider)"
That will log:
[debug] Stored item: %{name: "Short Dress", image: "/static/img/90008-E.jpg", url: "/exercise/list_basic_detail/90008-E/", price: "$24.99"}
# omitted for brevity...
[debug] Stored item: %{name: "Fitted Dress", image: "/static/img/94766-A.jpg", url: "/exercise/list_basic_detail/94766-A/", price: "$34.99"}
Wonderful! You're now an Elixir web scraping master!
Conclusion
Elixir's concurrency model and fault tolerance make it a strong choice for large-scale scraping pipelines. Tasks that would overwhelm a single-threaded scraper run efficiently across Elixir's lightweight processes. You've now seen how to build a full scraping pipeline with Elixir's Crawly, handle pagination, set custom headers and proxies, and render JavaScript-heavy pages.
The one challenge no scraper can fully solve on its own is advanced anti-bot protection. When you get blocked, ZenRows handles it for you, automatically rotating IPs, bypassing CAPTCHAs, and rendering JavaScript without any local browser overhead.