Want to experiment with a new approach to building a web scraping script? Good idea! How about using a purely functional language such as Haskell? The language's scripting capabilities and concise syntax make it a great option for the job.
This tutorial will guide you through building a complete Haskell web scraping script using scalpel
and webdriver
.
Let's dive in!
Why Use Haskell for Web Scraping?
When it comes to web scraping, most developers go for Python or JavaScript due to their popularity and community support.
But while Haskell may not be the best language for web scraping, it's still an interesting choice for at least three reasons:
- It provides a different development experience from imperative scripting languages.
- It has an extremely concise syntax that relies on easy-to-understand pure functions.
- It comes with complete libraries for HTML parsing and browser automation.
Thanks to its rich ecosystem and popularity within functional languages, web scraping in Haskell is worth giving it a go. See for yourself: let's go through a step-by-step process of building a scraper.
Prerequisites
Prepare your Haskell environment for web scraping with the Scalpel package.
Install Haskell
To set up a Haskell project with external dependencies, you need GHC and Stack. GHC is the Haskell compiler, while Stack is the Haskell package manager. The recommended method to install them both is to use GHCup.
On Windows, launch GHCup with the following Powershell command:
Set-ExecutionPolicy Bypass -Scope Process -Force;[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; try { Invoke-Command -ScriptBlock ([ScriptBlock]::Create((Invoke-WebRequest https://www.haskell.org/ghcup/sh/bootstrap-haskell.ps1 -UseBasicParsing))) -ArgumentList $true } catch { Write-Error $_ }
On macOS and Linux, execute:
curl --proto '=https' --tlsv1.2 -sSf https://get-ghcup.haskell.org | sh
During the setup process, you'll have to answer a few questions. Make sure to answer "Y" (Yes) when asked if you want to install Stack. For the other questions, the default answer will be ok.
The installation process may take several minutes, so be patient.
Great! You now have everything you need to create a Haskell web scraping project.
Create Your Haskell Project
Run the stack new
command to initialize a new Haskell project called web-scraper
:
stack new web-scraper
Keep in mind that the project name must follow the Cabal naming convention. Otherwise, you'll get the following error:
Expected a package name acceptable to Cabal, but got: <your_project_name>
An acceptable package name comprises an alphanumeric 'word'; or two or more
such words, with the words separated by a hyphen/minus character ('-'). A word
cannot be comprised only of the characters '0' to '9'.
An alphanumeric character is one in one of the Unicode Letter categories
(Lu (uppercase), Ll (lowercase), Lt (titlecase), Lm (modifier), or Lo (other))
or Number categories (Nd (decimal), Nl (letter), or No (other)).
Good! The web-scraper
folder will now contain your Haskell project. Open it in our Haskell IDE. Visual Studio Code with the Haskell extension will do.
Take a look at the Main.hs
file inside /app
:
module Main (main) where
import Lib
main :: IO ()
main = someFunc
This defines a Main
module where the main function calls someFunc
from the imported Lib
module. If you look at the Lib.hs
file inside /src
, you'll see that sumeFunc simply prints "someFunc"
.
Launch the command below in the project folder to run your Haskell application:
stack run
It may have to download a few extra dependencies the first time, so be patient.
The result in the terminal will be:
someFunc
Well done! You're ready to get started with web scraping in Haskell.
How to Do Web Scraping With Haskell
Follow this step-by-step section to learn how to build a web scraper in Haskell. The target site will be ScrapeMe.
Retrieve all product data from this site with web scraping in Haskell!
Step 1: Scrape by Requesting Your Target Page
The easiest way to perform web scraping with Haskell is to use Scalpel. This library provides both an HTTP client to retrieve HTML pages and HTML parsers to parse them. Add it to your project's dependencies by appending it to the dependencies
section of package.yaml
:
dependencies:
# ...
- scalpel
Now, run the stack
command below to build your project and install Scalpel:
stack build
Next, import it in Main.hs
:
import Text.HTML.Scalpel
Call the scrapeURL
function to get the HTML document associated with the target URL. This also executes a Scraper
function on that content. To extract the raw HTML, use the htmls
function applied to anySelector
:
{-# LANGUAGE OverloadedStrings #-}
import Text.HTML.Scalpel
main :: IO ()
main = do
-- retrieve the HTML content from the URL
-- and print it
htmlCode <- scrapeURL "https://scrapeme.live/shop/" $ htmls anySelector
maybe printError printHtml htmlCode
where
printError = putStrLn "Could not connect to the specified URL"
printHtml = mapM_ putStrLn
This snippet also enables the OverloadedStrings
language extension required by Scalpel. Note that you need to use maybe
as scrapeURL
returns an optional value.
Run your web scraping Haskell script, and it'll print:
<!doctype html>
<html lang="en-GB">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=2.0">
<link rel="profile" href="http://gmpg.org/xfn/11">
<link rel="pingback" href="https://scrapeme.live/xmlrpc.php">
<title>Products -- ScrapeMe</title>
<!-- Omitted for brevity... -->
Awesome! Your web scraper connects to the target page as desired. It's time to see how to extract some data from it.
Step 2: Extract Data From One Element
To scrape an HTML element on a page, select it on the DOM and then apply the data extraction logic. Define a proper HTML node selection strategy by inspecting the HTML code of the target page.
Open the target page of your script in the browser, right-click on a product HTML node, and select "Inspect." This will open the following DevTools section:
Expand the HTML code and note how each product node is a li element with a product class.
Inside a product node, you'll see the following elements:
- An
<a>
node containing the URL of the product. - An
<img>
node displaying the product image. - An
<h2>
node with the product name. - A
<span>
node with aprice
class storing the product price.
You now have everything you need to implement the web scraping in Haskell. Define a new scrapeProduct
function and break it down into a few inner functions. Use the chroot
function from Scalpel to select the product node and apply a data parsing logic to it:
scrapeProduct :: IO (Maybe String)
-- connect to the target page and apply the scraping function
scrapeProduct = scrapeURL "https://scrapeme.live/shop/" product
where
product :: Scraper String String
-- select the product HTML element to scrape
product = chroot ("li" @: [hasClass "product"]) productData
productData :: Scraper String String
productData = do
-- data extraction logic
url <- attr "href" "a"
image <- attr "src" "img"
name <- text "h2"
price <- text $ "span" @: [hasClass "price"]
-- return the scraped data as a string
return $ "URL: " ++ url ++ ", Image: " ++ image ++ ", Name: " ++ name ++ ", Price: " ++ price
The text
function returns the string content contained in the element. attr
returns the value of the specified HTML attribute from the given node.
Call the scrapeProduct
function inside main
:
main :: IO ()
main = do
-- call the product scraping function
product <- scrapeProduct
case product of
Just x -> print x
Nothing -> putStrLn "Did not find the desired element"
Your Main.hs
file will now contain:
{-# LANGUAGE OverloadedStrings #-}
module Main (main) where
import Text.HTML.Scalpel
scrapeProduct :: IO (Maybe String)
-- connect to the target page and apply the scraping function
scrapeProduct = scrapeURL "https://scrapeme.live/shop/" product
where
product :: Scraper String String
-- select the product HTML element to scrape
product = chroot ("li" @: [hasClass "product"]) productData
productData :: Scraper String String
productData = do
-- data extraction logic
url <- attr "href" "a"
image <- attr "src" "img"
name <- text "h2"
price <- text $ "span" @: [hasClass "price"]
-- return the scraped data as a string
return $ "URL: " ++ url ++ ", Image: " ++ image ++ ", Name: " ++ name ++ ", Price: " ++ price
main :: IO ()
main = do
-- call the product scraping function
product <- scrapeProduct
case product of
Just x -> print x
Nothing -> putStrLn "Did not find the desired element"
Run your Haskell web scraping script, and it'll produce this output:
"URL: https://scrapeme.live/shop/Bulbasaur/, Image: https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png, Name: Bulbasaur, Price: £63.00"
Amazing! The scraping logic returns the desired data. Now, let's learn how to scrape all the products on the page.
Step 3: Extract Data From All Elements
The target webpage contains several products. To keep track of the data extracted from each of them, define a Product
type:
data Product = Product
{ url :: String
, image :: String
, name :: String
, price :: String
} deriving Show
Then, extend the scrapeProduct
function to target all product nodes and return an array of Product
. Use chroots
instead of chroot
to select multiple nodes on the page:
scrapeProducts :: IO (Maybe [Product])
-- connect to the target page and apply the scraping function
scrapeProducts = scrapeURL "https://scrapeme.live/shop/" products
where
products :: Scraper String [Product]
-- select all product HTML elements on the page
products = chroots ("li" @: [hasClass "product"]) productData
productData :: Scraper String Product
productData = do
-- data extraction logic
url <- attr "href" "a"
image <- attr "src" "img"
name <- text "h2"
price <- text $ "span" @: [hasClass "price"]
-- return the scraped data as a Product
return $ Product url image name price
Here’s what the Main.hs
scraping script will look like:
{-# LANGUAGE OverloadedStrings #-}
module Main (main) where
import Text.HTML.Scalpel
-- custom type to represent the data
-- contained in a product HTML element
data Product = Product
{ url :: String
, image :: String
, name :: String
, price :: String
} deriving Show
scrapeProducts :: IO (Maybe [Product])
-- connect to the target page and apply the scraping function
scrapeProducts = scrapeURL "https://scrapeme.live/shop/" products
where
products :: Scraper String [Product]
-- select all product HTML elements on the page
products = chroots ("li" @: [hasClass "product"]) productData
productData :: Scraper String Product
productData = do
-- data extraction logic
url <- attr "href" "a"
image <- attr "src" "img"
name <- text "h2"
price <- text $ "span" @: [hasClass "price"]
-- return the scraped data as a Product
return $ Product url image name price
main :: IO ()
main = do
-- call the product scraping function
products <- scrapeProducts
case products of
Just x -> print x
Nothing -> putStrLn "Did not find the desired elements"
Run it, and it'll print:
[Product {url = "https://scrapeme.live/shop/Bulbasaur/", image = "https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png", name = "Bulbasaur", price = "£63.00"},
-- omitted for brevity...
Product {url = "https://scrapeme.live/shop/Pidgey/", image = "https://scrapeme.live/wp-content/uploads/2018/08/016-350x350.png", name = "Pidgey", price = "£159.00"}]
That's it! The printed objects match the products on the page.
Step 4: Export Your Data Into a CSV File
The simplest way to save the scraped data in a CSV file is to use cassava
. This package is a popular Haskell utility for parsing and encoding comma-separated values.
Add cassava and its sub-dependency bytestring
to the dependencies
section of package.yaml
:
dependencies:
# ...
- cassava
- bytestring
Reload the dependencies of your project with:
stack build
Then, import the required libraries:
import Data.Csv
import qualified Data.ByteString.Lazy as BL
To export data to CSV, cassava
requires you to define some custom types. As you already have a custom Product
type, you can speed up the process through instance derivation via GHC generics.
First, enable generic derivation with the following extension:
{-# LANGUAGE DeriveGeneric #-}
Then, import GHC.Generics
:
import GHC.Generics
Add Generic
to the deriving
members of the Product
type:
data Product = Product
{ url :: String
, image :: String
, name :: String
, price :: String
} deriving (Show, Generic)
You can now define the required cassava
types as below:
instance ToNamedRecord Product
instance DefaultOrdered Product
Use them to export the collected data stored in products
to CSV with encodeDefaultOrderedByName
:
main :: IO ()
main = do
-- call the product scraping function
products <- scrapeProducts
case products of
Just x -> BL.writeFile "products.csv" $ encodeDefaultOrderedByName x
Nothing -> putStrLn "Did not find the desired elements."
Put it all together, and you'll get:
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}
module Main (main) where
import Text.HTML.Scalpel
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import GHC.Generics
-- custom type to represent the data
-- contained in a product HTML element
data Product = Product
{ url :: String
, image :: String
, name :: String
, price :: String
} deriving (Show, Generic)
-- custom types required to export data to CSV
instance ToNamedRecord Product
instance DefaultOrdered Product
scrapeProducts :: IO (Maybe [Product])
-- connect to the target page and apply the scraping function
scrapeProducts = scrapeURL "https://scrapeme.live/shop/" products
where
products :: Scraper String [Product]
-- select all product HTML elements on the page
products = chroots ("li" @: [hasClass "product"]) productData
productData :: Scraper String Product
productData = do
-- data extraction logic
url <- attr "href" "a"
image <- attr "src" "img"
name <- text "h2"
price <- text $ "span" @: [hasClass "price"]
-- return the scraped data as a Product
return $ Product url image name price
main :: IO ()
main = do
-- call the product scraping function
products <- scrapeProducts
case products of
-- export the scraped data to CSV
Just x -> BL.writeFile "products.csv" $ encodeDefaultOrderedByName x
Nothing -> putStrLn "Did not find the desired elements."
Launch the Haskell web scraping script:
stack run
Wait for the script execution to end, and a products.csv
file will appear in the project's folder. Open it, and you'll see the following:
Et voilà ! You've just performed web scraping with Haskell.
Haskell for Advanced Web Scraping
Now that you know the basics, it's time to dive into more advanced web scraping Haskell techniques.
Scrape Multiple Pages With Haskell
Currently, the Haskell scraping script retrieves data from the products on a single page. However, the target site consists of several web pages. To scrape them all, you need to go through each with web crawling.
Learn more about the differences between web crawling vs. web scraping.
Crawling a site involves discovering all its pages and visiting them by following their links. To avoid hitting a page twice, the task requires support data structures and custom logic.
Web crawling in Haskell is possible but complex and error-prone.
However, there's a smarter approach:Â take a look at the URLs of the pagination pages on the site. These all have the following format:
https://scrapeme.live/shop/page/<page>/
Change the definition of the scrapeProducts
function so that it accepts the URL of the page as input:
scrapeProductsPage :: String -> IO (Maybe [Product])
scrapeProductsPage pageUrl = scrapeURL pageUrl products
where
-- ...
Next, define a new scrapeProducts
function. This will call scrapeProductsPage on the pagination pages and concatenate the results:
scrapeProducts :: IO (Maybe [Product])
scrapeProducts = do
-- scrape products from each page and concatenate the results
productLists <- mapM scrapeProductsPage [ "https://scrapeme.live/shop/page/" ++ show page | page <- [1..5] ]
return $ concat <$> sequence productLists
This is just a sample script. Visiting only five pages is enough to achieve the goal, but in production, make sure to visit all pages on your target site.
Integrate the crawling logic into your web scraping Haskell script, and you'll get:
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}
module Main (main) where
import Text.HTML.Scalpel
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import GHC.Generics
-- custom type to represent the data
-- contained in a product HTML element
data Product = Product
{ url :: String
, image :: String
, name :: String
, price :: String
} deriving (Show, Generic)
-- custom types required to export data to CSV
instance ToNamedRecord Product
instance DefaultOrdered Product
scrapeProductsPage :: String -> IO (Maybe [Product])
-- connect to the target page and apply the scraping function
scrapeProductsPage pageUrl = scrapeURL pageUrl products
where
products :: Scraper String [Product]
-- select all product HTML elements on the page
products = chroots ("li" @: [hasClass "product"]) productData
productData :: Scraper String Product
productData = do
-- data extraction logic
url <- attr "href" "a"
image <- attr "src" "img"
name <- text "h2"
price <- text $ "span" @: [hasClass "price"]
-- return the scraped data as a Product
return $ Product url image name price
scrapeProducts :: IO (Maybe [Product])
scrapeProducts = do
-- scrape products from each page and concatenate the results
productLists <- mapM scrapeProductsPage [ "https://scrapeme.live/shop/page/" ++ show page | page <- [1..5] ]
return $ concat <$> sequence productLists
main :: IO ()
main = do
-- call the product scraping function
products <- scrapeProducts
case products of
-- export the scraped data to CSV
Just x -> BL.writeFile "products.csv" $ encodeDefaultOrderedByName x
Nothing -> putStrLn "Did not find the desired elements."
Run the scraper again:
stack run
This time, it'll retrieve data from 5 product pagination pages. The output CSV file will then contain many more records:
Congrats! You've just learned how to perform web crawling and scraping in Haskell.
Avoid Getting Blocked When Scraping With Haskell
Companies are aware of how valuable their data is. That's why they don't want to give it up for free, even when it's publicly available on their site.
More and more sites are adopting anti-bot technologies to protect data and improve user experience. These systems can detect and block automated scripts, such as your Haskell scraper.
The anti-bot systems pose one of the biggest challenges for Haskell web scraping. While it may not be easy, the right workaround will let you do web scraping without getting blocked. Two effective ways of eluding less sophisticated anti-bots are:
- Setting a real
User-Agent
header. - Configuring a proxy to hide your IP.
Adopt them with the instructions below!
Scalpel doesn't natively support user agent and proxy customization. What you can do instead is extend Network.HTTP
, the HTTP client the library uses behind the scenes.
Get the URL of a proxy server from a site like Free Proxy List and a User Agent string of a real browser. Then, use them to customize Network.HTTP
:
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE OverloadedStrings #-}
import Data.Default (def)
import qualified Network.HTTP.Client as HTTP
import qualified Network.HTTP.Client.TLS as HTTP
import qualified Network.HTTP.Types.Header as HTTP
import System.Environment
import Text.HTML.Scalpel
-- create a new manager based on the default TLS manager
-- with a custom user agent and proxy configuration
managerSettings :: HTTP.ManagerSettings
managerSettings =
HTTP.tlsManagerSettings
{ HTTP.managerModifyRequest = \req -> do
req' <- HTTP.managerModifyRequest HTTP.tlsManagerSettings req
return $
req'
{
-- custom user agent
HTTP.requestHeaders =
(HTTP.hUserAgent, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")
: HTTP.requestHeaders req'
-- custom proxy
, HTTP.proxy = Just $ HTTP.Proxy "http://218.85.21.58" 8080
}
}
main :: IO ()
main = do
-- perform an HTTP request to the target page
-- via the customized HTTP client
manager <- Just <$> HTTP.newManager managerSettings
htmlCode <- scrapeURLWithConfig (def {manager}) "https://scrapeme.live/shop/" $ htmls anySelector
maybe printError printHtml htmlCode
where
printError = putStrLn "ERROR: Could not connect to the specified URL"
printHtml = mapM_ putStrLn
The http://218.85.21.58:8080
proxy server will no longer work by the time you read this. Free proxies are short-lived, unreliable, and data-greedy. They’re only suitable for learning purposes, so never use them in production!
For the code above to work, you'll also need to add the following dependencies to package.yaml
:
dependencies:
# ...
- http-types
- data-default
- http-client
- http-client-tls
Those two tips may be enough to bypass simple anti-bot measures. But what about more advanced and sophisticated solutions such as Cloudflare?
Check what happens when running the above script against this Cloudflare-protected page:
https://www.g2.com/products/notion/reviews
The result will be the HTML of the following 403 Forbidden
page:
<!DOCTYPE html>
<html class="no-js" lang="en-US">
<head>
<title>Attention Required! | Cloudflare</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="robots" content="noindex, nofollow" />
<!-- omitted for brevity -->
Unsurprisingly, Cloudflare detected your script as a bot. However, don't give up. What you need is a web scraping API, such as ZenRows. This next-generation tool supports User Agent and IP rotation and comes with the best anti-bot toolkit.
Let's see how to boost your Haskell scraping script with ZenRows. Sign up for free to receive your first 1,000 credits and then reach the Request Builder page:
Let's use the G2.com page mentioned earlier as the destination:
- Paste the target URL (
https://www.g2.com/products/notion/reviews
) into the "URL to Scrape" input. - Enable the "JS Rendering" mode (User Agent rotation and the AI-powered anti-bot toolkit are always included by default).Â
- Toggle the "Premium Proxy" check to get rotating IPs.
- Select "cURL" and then the "API" mode to get the ZenRows API URL to call in your script.
Use the generated URL in scrapeURL
:
{-# LANGUAGE OverloadedStrings #-}
import Text.HTML.Scalpel
main :: IO ()
main = do
-- retrieve the HTML content from the URL
-- and print it
htmlCode <- scrapeURL "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fnotion%2Freviews&js_render=true&premium_proxy=true" $ htmls anySelector
maybe printError printHtml htmlCode
where
printError = putStrLn "Could not connect to the specified URL"
printHtml = mapM_ putStrLn
Launch the above script. This time, it'll return the HTML of the target Cloudflare-protected page as desired:
<!DOCTYPE html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/assets/favicon-fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico" rel="shortcut icon" type="image/x-icon" />
<title>Notion Reviews 2024: Details, Pricing, & Features | G2</title>
<!-- omitted for brevity ... -->
Nice one! That's how easy it is to use ZenRows for web scraping in Haskell.
Use a Headless Browser With Haskell
Scalpel is a Haskell package that can only deal with static content web pages. If your target pages use JavaScript to dynamically load or render data, you need a different solution.
You must use a tool that renders pages in a controllable browser instance. One of the most popular and used headless browser libraries is Selenium. Haskell isn't a language officially supported by the project. However, there's a community-driven port called webdriver
.
Add it to your project's dependencies. As the library works with non-standard text strings, you'll also need the text
package:
dependencies:
# ...
- text
- webdriver
webdriver
is an old-fashioned library that still relies on Selenium 2. Download the server .jar
executable and put it in your project's directory. You'll require Java to run it.
You'll also need the executable of the driver matching the version of the browser you want to control. Here, we're going to use Chrome. Download the right version of Chrome driver and copy the executable to the same folder as the Selenium 2 server.
Execute the Selenium 2 server with:
java --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED -jar selenium-server-standalone-2.53.1.jar
Let's target a new page to showcase Selenium's capabilities in Haskell. The Infinite Scrolling demo uses JavaScript for rendering purposes and is a perfect example of a dynamic content page. It loads new products in the browser as the user scrolls down:
Use webdriver to scrape data from it in Haskell:
{-# LANGUAGE OverloadedStrings #-}
module Main (main) where
import qualified Data.Text as T
import Test.WebDriver
data Product = Product
{ name :: String,
price :: String
}
deriving (Show)
extractProduct :: Element -> WD Product
extractProduct productElement = do
-- select the name and price elements
nameElement <- findElemFrom productElement $ ByCSS "h4"
priceElement <- findElemFrom productElement $ ByCSS "h5"
-- extract the text content of name and price elements
name <- getText nameElement
price <- getText priceElement
-- create a Product object
return $ Product (T.unpack name) (T.unpack price)
scrapeProducts :: IO [Product]
scrapeProducts = runSession chromeConfig $ do
-- visit the target site
openPage "https://scrapingclub.com/exercise/list_infinite_scroll/"
-- select the product elements
productElements <- findElems $ ByCSS ".post"
-- iterate over the product elements and apply
-- the scraping logic on each of them
products <- traverse extractProduct productElements
-- close selenium
closeSession
return products
-- configure the Chrome instance to control
chromeConfig :: WDConfig
chromeConfig = useBrowser chrome defaultConfig
main :: IO [()]
main = do
-- scrape the products and print them
products <- scrapeProducts
mapM print products
Run this script:
stack run
It'll produce:
Product {name = "Short Dress", price = "$24.99"}
Product {name = "Patterned Slacks", price = "$29.99"}
Product {name = "Short Chiffon Dress", price = "$49.99"}
Product {name = "Off-the-shoulder Dress", price = "$59.99"}
Product {name = "V-neck Top", price = "$24.99"}
Product {name = "Short Chiffon Dress", price = "$49.99"}
Product {name = "V-neck Top", price = "$24.99"}
Product {name = "V-neck Top", price = "$24.99"}
Product {name = "Short Lace Dress", price = "$59.99"}
Product {name = "Fitted Dress", price = "$34.99"}
Congrats! You're now a Haskell web scraping champion.
Conclusion
This guided tutorial walked you through the process of performing web scraping in Haskell. You learned the fundamentals and then explored more complex aspects and techniques.
Haskell is the most popular purely functional programming language in the world. As such, it boasts a large ecosystem of libraries. scalpel
provides an API to perform web scraping and crawling in Haskell on static pages. For sites using JavaScript, you can use webdriver
.
However, no matter how good your Haskell scraper is and which libraries you use, anti-scraping measures will be able to block it. Bypass them all with ZenRows, a scraping API with the most effective built-in anti-bot bypass functionality. Try ZenRows for free today!