How to Web Scrape With Haskell: Tutorial 2026

July 18, 2024 · 8 min read

Table of contents

Why use Haskell for web scraping?
Prerequisites
- Install Haskell
- Create your Haskell project
How to web scrape with Haskell
- Requesting target page
- Extract data from one element
- Extract data from all elements
- Export to a CSV file
Haskell for advanced web scraping
- Scrape multiple pages
- Avoid getting blocked
- Use a headless browser
Conclusion

Want to experiment with a new approach to building a web scraping script? Good idea! How about using a purely functional language such as Haskell? The language's scripting capabilities and concise syntax make it a great option for the job.

This tutorial will guide you through building a complete Haskell web scraping script using scalpel and webdriver.

Let's dive in!

Why Use Haskell for Web Scraping?

When it comes to web scraping, most developers go for Python or JavaScript due to their popularity and community support.

But while Haskell may not be the best language for web scraping, it's still an interesting choice for at least three reasons:

It provides a different development experience from imperative scripting languages.
It has an extremely concise syntax that relies on easy-to-understand pure functions.
It comes with complete libraries for HTML parsing and browser automation.

Thanks to its rich ecosystem and popularity within functional languages, web scraping in Haskell is worth giving it a go. See for yourself: let's go through a step-by-step process of building a scraper.

Prerequisites

Prepare your Haskell environment for web scraping with the Scalpel package.

Install Haskell

To set up a Haskell project with external dependencies, you need GHC and Stack. GHC is the Haskell compiler, while Stack is the Haskell package manager. The recommended method to install them both is to use GHCup.

On Windows, launch GHCup with the following Powershell command:

                    Terminal
                
Set-ExecutionPolicy Bypass -Scope Process -Force;[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; try { Invoke-Command -ScriptBlock ([ScriptBlock]::Create((Invoke-WebRequest https://www.haskell.org/ghcup/sh/bootstrap-haskell.ps1 -UseBasicParsing))) -ArgumentList $true } catch { Write-Error $_ }

Copied!

On macOS and Linux, execute:

                    Terminal
                
curl --proto '=https' --tlsv1.2 -sSf https://get-ghcup.haskell.org | sh

Copied!

During the setup process, you'll have to answer a few questions. Make sure to answer "Y" (Yes) when asked if you want to install Stack. For the other questions, the default answer will be ok.

Note

The installation process may take several minutes, so be patient.

Great! You now have everything you need to create a Haskell web scraping project.

Frustrated that your web scrapers are blocked once and again?

ZenRows API handles rotating proxies and headless browsers for you.

Try for FREE

Create Your Haskell Project

Run the stack new command to initialize a new Haskell project called web-scraper:

                    Terminal
                
stack new web-scraper

Copied!

Keep in mind that the project name must follow the Cabal naming convention. Otherwise, you'll get the following error:

                    Output
                
Expected a package name acceptable to Cabal, but got: <your_project_name>

An acceptable package name comprises an alphanumeric 'word'; or two or more
such words, with the words separated by a hyphen/minus character ('-'). A word
cannot be comprised only of the characters '0' to '9'.

An alphanumeric character is one in one of the Unicode Letter categories
(Lu (uppercase), Ll (lowercase), Lt (titlecase), Lm (modifier), or Lo (other))
or Number categories (Nd (decimal), Nl (letter), or No (other)).

Copied!

Good! The web-scraper folder will now contain your Haskell project. Open it in our Haskell IDE. Visual Studio Code with the Haskell extension will do.

Take a look at the Main.hs file inside /app:

                    Main.hs
                
module Main (main) where

import Lib

main :: IO ()
main = someFunc

Copied!

This defines a Main module where the main function calls someFunc from the imported Lib module. If you look at the Lib.hs file inside /src, you'll see that sumeFunc simply prints "someFunc".

Launch the command below in the project folder to run your Haskell application:

                    Terminal
                
stack run

Copied!

It may have to download a few extra dependencies the first time, so be patient.

The result in the terminal will be:

                    Output
                
someFunc

Copied!

Well done! You're ready to get started with web scraping in Haskell.

How to Do Web Scraping With Haskell

Follow this step-by-step section to learn how to build a web scraper in Haskell. The target site will be ScrapeMe.

ScrapeMe Interface — Click to open the image in full screen

Retrieve all product data from this site with web scraping in Haskell!

Step 1: Scrape by Requesting Your Target Page

The easiest way to perform web scraping with Haskell is to use Scalpel. This library provides both an HTTP client to retrieve HTML pages and HTML parsers to parse them. Add it to your project's dependencies by appending it to the dependencies section of package.yaml:

                    package.yaml
                
dependencies:
# ...
- scalpel

Copied!

Now, run the stack command below to build your project and install Scalpel:

                    Terminal
                
stack build

Copied!

Next, import it in Main.hs:

                    Main.hs
                
import Text.HTML.Scalpel

Copied!

Call the scrapeURL function to get the HTML document associated with the target URL. This also executes a Scraper function on that content. To extract the raw HTML, use the htmls function applied to anySelector:

                    Main.hs
                
{-# LANGUAGE OverloadedStrings #-}

import Text.HTML.Scalpel

main :: IO ()
main = do
    -- retrieve the HTML content from the URL
    -- and print it
    htmlCode <- scrapeURL "https://scrapeme.live/shop/" $ htmls anySelector
    maybe printError printHtml htmlCode
    where
        printError = putStrLn "Could not connect to the specified URL"
        printHtml = mapM_ putStrLn

  
  

  
Copied!

This snippet also enables the OverloadedStrings language extension required by Scalpel. Note that you need to use maybe as scrapeURL returns an optional value.

Run your web scraping Haskell script, and it'll print:

                    Output
                
<!doctype html>
<html lang="en-GB">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=2.0">
<link rel="profile" href="http://gmpg.org/xfn/11">
<link rel="pingback" href="https://scrapeme.live/xmlrpc.php">
<title>Products -- ScrapeMe</title>
<!-- Omitted for brevity... -->

  
  

  
Copied!

Awesome! Your web scraper connects to the target page as desired. It's time to see how to extract some data from it.

Step 2: Extract Data From One Element

To scrape an HTML element on a page, select it on the DOM and then apply the data extraction logic. Define a proper HTML node selection strategy by inspecting the HTML code of the target page.

Open the target page of your script in the browser, right-click on a product HTML node, and select "Inspect." This will open the following DevTools section:

ScrapeMe Inspect — Click to open the image in full screen

Expand the HTML code and note how each product node is a li element with a product class.

Inside a product node, you'll see the following elements:

An <a> node containing the URL of the product.
An <img> node displaying the product image.
An <h2> node with the product name.
A <span> node with a price class storing the product price.

You now have everything you need to implement the web scraping in Haskell. Define a new scrapeProduct function and break it down into a few inner functions. Use the chroot function from Scalpel to select the product node and apply a data parsing logic to it:

                    Main.hs
                
scrapeProduct :: IO (Maybe String)
-- connect to the target page and apply the scraping function
scrapeProduct = scrapeURL "https://scrapeme.live/shop/" product
  where
    product :: Scraper String String
    -- select the product HTML element to scrape
    product = chroot ("li" @: [hasClass "product"]) productData

    productData :: Scraper String String
    productData = do
      -- data extraction logic
      url <- attr "href" "a"
      image <- attr "src" "img"
      name <- text "h2"
      price <- text $ "span" @: [hasClass "price"]
      -- return the scraped data as a string
      return $ "URL: " ++ url ++ ", Image: " ++ image ++ ", Name: " ++ name ++ ", Price: " ++ price

  
  

  
Copied!

The text function returns the string content contained in the element. attr returns the value of the specified HTML attribute from the given node.

Call the scrapeProduct function inside main:

                    Main.hs
                
main :: IO ()
main = do
  -- call the product scraping function
  product <- scrapeProduct
  case product of
    Just x -> print x
    Nothing -> putStrLn "Did not find the desired element"

  
  

  
Copied!

Your Main.hs file will now contain:

                    Main.hs
                
{-# LANGUAGE OverloadedStrings #-}

module Main (main) where

import Text.HTML.Scalpel

scrapeProduct :: IO (Maybe String)
-- connect to the target page and apply the scraping function
scrapeProduct = scrapeURL "https://scrapeme.live/shop/" product
  where
    product :: Scraper String String
    -- select the product HTML element to scrape
    product = chroot ("li" @: [hasClass "product"]) productData

    productData :: Scraper String String
    productData = do
      -- data extraction logic
      url <- attr "href" "a"
      image <- attr "src" "img"
      name <- text "h2"
      price <- text $ "span" @: [hasClass "price"]
      -- return the scraped data as a string
      return $ "URL: " ++ url ++ ", Image: " ++ image ++ ", Name: " ++ name ++ ", Price: " ++ price

main :: IO ()
main = do
  -- call the product scraping function
  product <- scrapeProduct
  case product of
    Just x -> print x
    Nothing -> putStrLn "Did not find the desired element"

  
  

  
Copied!

Run your Haskell web scraping script, and it'll produce this output:

                    Output
                
"URL: https://scrapeme.live/shop/Bulbasaur/, Image: https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png, Name: Bulbasaur, Price: £63.00"

Copied!

Amazing! The scraping logic returns the desired data. Now, let's learn how to scrape all the products on the page.

Step 3: Extract Data From All Elements

The target webpage contains several products. To keep track of the data extracted from each of them, define a Product type:

                    Main.hs
                
data Product = Product
  { url :: String
  , image :: String
  , name :: String
  , price :: String
  } deriving Show

Copied!

Then, extend the scrapeProduct function to target all product nodes and return an array of Product. Use chroots instead of chroot to select multiple nodes on the page:

                    Main.hs
                
scrapeProducts :: IO (Maybe [Product])
-- connect to the target page and apply the scraping function
scrapeProducts = scrapeURL "https://scrapeme.live/shop/" products
  where
    products :: Scraper String [Product]
    -- select all product HTML elements on the page
    products = chroots ("li" @: [hasClass "product"]) productData

    productData :: Scraper String Product
    productData = do
      -- data extraction logic
      url <- attr "href" "a"
      image <- attr "src" "img"
      name <- text "h2"
      price <- text $ "span" @: [hasClass "price"]
      -- return the scraped data as a Product
      return $ Product url image name price

  
  

  
Copied!

Here’s what the Main.hs scraping script will look like:

                    Main.hs
                
{-# LANGUAGE OverloadedStrings #-}

module Main (main) where

import Text.HTML.Scalpel

-- custom type to represent the data
-- contained in a product HTML element
data Product = Product
  { url :: String
  , image :: String
  , name :: String
  , price :: String
  } deriving Show

scrapeProducts :: IO (Maybe [Product])
-- connect to the target page and apply the scraping function
scrapeProducts = scrapeURL "https://scrapeme.live/shop/" products
  where
    products :: Scraper String [Product]
    -- select all product HTML elements on the page
    products = chroots ("li" @: [hasClass "product"]) productData

    productData :: Scraper String Product
    productData = do
      -- data extraction logic
      url <- attr "href" "a"
      image <- attr "src" "img"
      name <- text "h2"
      price <- text $ "span" @: [hasClass "price"]
      -- return the scraped data as a Product
      return $ Product url image name price

main :: IO ()
main = do
  -- call the product scraping function
  products <- scrapeProducts
  case products of
    Just x -> print x
    Nothing -> putStrLn "Did not find the desired elements"

  
  

  
Copied!

Run it, and it'll print:

                    Output
                
[Product {url = "https://scrapeme.live/shop/Bulbasaur/", image = "https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png", name = "Bulbasaur", price = "£63.00"},
-- omitted for brevity...
Product {url = "https://scrapeme.live/shop/Pidgey/", image = "https://scrapeme.live/wp-content/uploads/2018/08/016-350x350.png", name = "Pidgey", price = "£159.00"}]

Copied!

That's it! The printed objects match the products on the page.

Step 4: Export Your Data Into a CSV File

The simplest way to save the scraped data in a CSV file is to use cassava. This package is a popular Haskell utility for parsing and encoding comma-separated values.

Add cassava and its sub-dependency bytestring to the dependencies section of package.yaml:

                    package.yaml
                
dependencies:
# ...
- cassava
- bytestring

Copied!

Reload the dependencies of your project with:

                    Terminal
                
stack build

Copied!

Then, import the required libraries:

                    Main.hs
                
import Data.Csv
import qualified Data.ByteString.Lazy as BL

Copied!

To export data to CSV, cassava requires you to define some custom types. As you already have a custom Product type, you can speed up the process through instance derivation via GHC generics.

First, enable generic derivation with the following extension:

                    Main.hs
                
{-# LANGUAGE DeriveGeneric #-}

Copied!

Then, import GHC.Generics:

                    Main.hs
                
import GHC.Generics

Copied!

Add Generic to the deriving members of the Product type:

                    Main.hs
                
data Product = Product
  { url :: String
  , image :: String
  , name :: String
  , price :: String
  } deriving (Show, Generic)

Copied!

You can now define the required cassava types as below:

                    Main.hs
                
instance ToNamedRecord Product
instance DefaultOrdered Product

Copied!

Use them to export the collected data stored in products to CSV with encodeDefaultOrderedByName:

                    Main.hs
                
main :: IO ()
main = do
  -- call the product scraping function
  products <- scrapeProducts
  case products of
    Just x -> BL.writeFile "products.csv" $ encodeDefaultOrderedByName x
    Nothing -> putStrLn "Did not find the desired elements."

  
  

  
Copied!

Put it all together, and you'll get:

                    Main.hs
                
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}

module Main (main) where

import Text.HTML.Scalpel
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import GHC.Generics

-- custom type to represent the data
-- contained in a product HTML element
data Product = Product
  { url :: String
  , image :: String
  , name :: String
  , price :: String
  } deriving (Show, Generic)

-- custom types required to export data to CSV
instance ToNamedRecord Product
instance DefaultOrdered Product

scrapeProducts :: IO (Maybe [Product])
-- connect to the target page and apply the scraping function
scrapeProducts = scrapeURL "https://scrapeme.live/shop/" products
  where
    products :: Scraper String [Product]
    -- select all product HTML elements on the page
    products = chroots ("li" @: [hasClass "product"]) productData

    productData :: Scraper String Product
    productData = do
      -- data extraction logic
      url <- attr "href" "a"
      image <- attr "src" "img"
      name <- text "h2"
      price <- text $ "span" @: [hasClass "price"]
      -- return the scraped data as a Product
      return $ Product url image name price

main :: IO ()
main = do
  -- call the product scraping function
  products <- scrapeProducts
  case products of
    -- export the scraped data to CSV
    Just x -> BL.writeFile "products.csv" $ encodeDefaultOrderedByName x
    Nothing -> putStrLn "Did not find the desired elements."

  
  

  
Copied!

Launch the Haskell web scraping script:

                    Terminal
                
stack run

Copied!

Wait for the script execution to end, and a products.csv file will appear in the project's folder. Open it, and you'll see the following:

ScrapeMe Scraped Data CSV — Click to open the image in full screen

Et voilà! You've just performed web scraping with Haskell.

Haskell for Advanced Web Scraping

Now that you know the basics, it's time to dive into more advanced web scraping Haskell techniques.

Scrape Multiple Pages With Haskell

Currently, the Haskell scraping script retrieves data from the products on a single page. However, the target site consists of several web pages. To scrape them all, you need to go through each with web crawling.

Note

Learn more about the differences between web crawling vs. web scraping.

Crawling a site involves discovering all its pages and visiting them by following their links. To avoid hitting a page twice, the task requires support data structures and custom logic.

Web crawling in Haskell is possible but complex and error-prone.

However, there's a smarter approach: take a look at the URLs of the pagination pages on the site. These all have the following format:

                    Example
                
https://scrapeme.live/shop/page/<page>/

Copied!

ScrapeMe Pages Pagination — Click to open the image in full screen

Change the definition of the scrapeProducts function so that it accepts the URL of the page as input:

                    Main.hs
                
scrapeProductsPage :: String -> IO (Maybe [Product])
scrapeProductsPage pageUrl = scrapeURL pageUrl products
  where
    -- ...

Copied!

Next, define a new scrapeProducts function. This will call scrapeProductsPage on the pagination pages and concatenate the results:

                    Main.hs
                
scrapeProducts :: IO (Maybe [Product])
scrapeProducts = do
  -- scrape products from each page and concatenate the results
  productLists <- mapM scrapeProductsPage [ "https://scrapeme.live/shop/page/" ++ show page | page <- [1..5] ]
  return $ concat <$> sequence productLists

Copied!

Note

This is just a sample script. Visiting only five pages is enough to achieve the goal, but in production, make sure to visit all pages on your target site.

Integrate the crawling logic into your web scraping Haskell script, and you'll get:

                    Main.hs
                
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}

module Main (main) where

import Text.HTML.Scalpel
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import GHC.Generics

-- custom type to represent the data
-- contained in a product HTML element
data Product = Product
  { url :: String
  , image :: String
  , name :: String
  , price :: String
  } deriving (Show, Generic)

-- custom types required to export data to CSV
instance ToNamedRecord Product
instance DefaultOrdered Product

scrapeProductsPage :: String -> IO (Maybe [Product])
-- connect to the target page and apply the scraping function
scrapeProductsPage pageUrl = scrapeURL pageUrl products
  where
    products :: Scraper String [Product]
    -- select all product HTML elements on the page
    products = chroots ("li" @: [hasClass "product"]) productData

    productData :: Scraper String Product
    productData = do
      -- data extraction logic
      url <- attr "href" "a"
      image <- attr "src" "img"
      name <- text "h2"
      price <- text $ "span" @: [hasClass "price"]
      -- return the scraped data as a Product
      return $ Product url image name price

scrapeProducts :: IO (Maybe [Product])
scrapeProducts = do
  -- scrape products from each page and concatenate the results
  productLists <- mapM scrapeProductsPage [ "https://scrapeme.live/shop/page/" ++ show page | page <- [1..5] ]
  return $ concat <$> sequence productLists

main :: IO ()
main = do
  -- call the product scraping function
  products <- scrapeProducts
  case products of
    -- export the scraped data to CSV
    Just x -> BL.writeFile "products.csv" $ encodeDefaultOrderedByName x
    Nothing -> putStrLn "Did not find the desired elements."

  
  

  
Copied!

Run the scraper again:

                    Terminal
                
stack run

Copied!

This time, it'll retrieve data from 5 product pagination pages. The output CSV file will then contain many more records:

ScrapeMe Pagination Products CSV — Click to open the image in full screen

Congrats! You've just learned how to perform web crawling and scraping in Haskell.

Avoid Getting Blocked When Scraping With Haskell

Companies are aware of how valuable their data is. That's why they don't want to give it up for free, even when it's publicly available on their site.

More and more sites are adopting anti-bot systems like Akamai, DataDome, Cloudflare, PerimeterX, etc., to protect data and improve user experience.. These systems can detect and block automated scripts, such as your Haskell scraper.

The anti-bot systems pose one of the biggest challenges for Haskell web scraping. While it may not be easy, the right workaround will let you do web scraping without getting blocked. Two effective ways of eluding less sophisticated anti-bots are:

Setting a real User-Agent header.
Configuring a proxy to hide your IP.

Adopt them with the instructions below!

Scalpel doesn't natively support user agent and proxy customization. What you can do instead is extend Network.HTTP, the HTTP client the library uses behind the scenes.

Get the URL of a proxy server from a site like Free Proxy List and a User Agent string of a real browser. Then, use them to customize Network.HTTP:

                    Main.hs
                
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Default (def)
import qualified Network.HTTP.Client as HTTP
import qualified Network.HTTP.Client.TLS as HTTP
import qualified Network.HTTP.Types.Header as HTTP
import System.Environment
import Text.HTML.Scalpel

-- create a new manager based on the default TLS manager
-- with a custom user agent and proxy configuration
managerSettings :: HTTP.ManagerSettings
managerSettings =
  HTTP.tlsManagerSettings
    { HTTP.managerModifyRequest = \req -> do
        req' <- HTTP.managerModifyRequest HTTP.tlsManagerSettings req
        return $
          req'
            {
              -- custom user agent
              HTTP.requestHeaders =
                (HTTP.hUserAgent, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")
                  : HTTP.requestHeaders req'
              -- custom proxy
              , HTTP.proxy = Just $ HTTP.Proxy "http://218.85.21.58" 8080
            }
    }

main :: IO ()
main = do
  -- perform an HTTP request to the target page
  -- via the customized HTTP client
  manager <- Just <$> HTTP.newManager managerSettings
  htmlCode <- scrapeURLWithConfig (def {manager}) "https://scrapeme.live/shop/" $ htmls anySelector
  maybe printError printHtml htmlCode
  where
    printError = putStrLn "ERROR: Could not connect to the specified URL"
    printHtml = mapM_ putStrLn

  
  

  
Copied!

Note

The http://218.85.21.58:8080 proxy server will no longer work by the time you read this. Free proxies are short-lived, unreliable, and data-greedy. They’re only suitable for learning purposes, so never use them in production!

For the code above to work, you'll also need to add the following dependencies to package.yaml:

                    package.yaml
                
dependencies:
# ...
- http-types
- data-default
- http-client
- http-client-tls 

Copied!

Those two tips may be enough to bypass simple anti-bot measures. But what about more advanced and sophisticated solutions such as Cloudflare?

Check what happens when running the above script against this Cloudflare-protected page:

                    Example
                
https://www.g2.com/products/notion/reviews

Copied!

The result will be the HTML of the following 403 Forbidden page:

                    Output
                
<!DOCTYPE html>
<html class="no-js" lang="en-US">
<head>
<title>Attention Required! | Cloudflare</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="robots" content="noindex, nofollow" />
<!-- omitted for brevity -->

  
  

  
Copied!

Unsurprisingly, Cloudflare detected your script as a bot. However, don't give up. What you need is a web scraping API, such as ZenRows. This next-generation tool supports User Agent and IP rotation and comes with the best anti-bot toolkit.

Let's see how to boost your Haskell scraping script with ZenRows. Sign up for free to receive your first 1,000 credits and then reach the Request Builder page:

building a scraper with zenrows — Click to open the image in full screen

Let's use the G2.com page mentioned earlier as the destination:

Paste the target URL (https://www.g2.com/products/notion/reviews) into the "URL to Scrape" input.
Enable the "JS Rendering" mode (User Agent rotation and the AI-powered anti-bot toolkit are always included by default).
Toggle the "Premium Proxy" check to get rotating IPs.
Select "cURL" and then the "API" mode to get the ZenRows API URL to call in your script.

Use the generated URL in scrapeURL:

                    Main.hs
                
{-# LANGUAGE OverloadedStrings #-}

import Text.HTML.Scalpel

main :: IO ()
main = do
    -- retrieve the HTML content from the URL
    -- and print it
    htmlCode <- scrapeURL "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fnotion%2Freviews&js_render=true&premium_proxy=true" $ htmls anySelector
    maybe printError printHtml htmlCode
    where
        printError = putStrLn "Could not connect to the specified URL"
        printHtml = mapM_ putStrLn

  
  

  
Copied!

Launch the above script. This time, it'll return the HTML of the target Cloudflare-protected page as desired:

                    Output
                
<!DOCTYPE html>
<head>
  <meta charset="utf-8" />
  <link href="https://www.g2.com/assets/favicon-fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico" rel="shortcut icon" type="image/x-icon" />
  <title>Notion Reviews 2024: Details, Pricing, &amp; Features | G2</title>
  <!-- omitted for brevity ... -->

Copied!

Nice one! That's how easy it is to use ZenRows for web scraping in Haskell.

Use a Headless Browser With Haskell

Scalpel is a Haskell package that can only deal with static content web pages. If your target pages use JavaScript to dynamically load or render data, you need a different solution.

You must use a tool that renders pages in a controllable browser instance. One of the most popular and used headless browser libraries is Selenium. Haskell isn't a language officially supported by the project. However, there's a community-driven port called webdriver.

Add it to your project's dependencies. As the library works with non-standard text strings, you'll also need the text package:

                    package.yaml
                
dependencies:
# ...
- text
- webdriver

Copied!

webdriver is an old-fashioned library that still relies on Selenium 2. Download the server .jar executable and put it in your project's directory. You'll require Java to run it.

You'll also need the executable of the driver matching the version of the browser you want to control. Here, we're going to use Chrome. Download the right version of Chrome driver and copy the executable to the same folder as the Selenium 2 server.

Execute the Selenium 2 server with:

                    Terminal
                
java --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED -jar selenium-server-standalone-2.53.1.jar

Copied!

Let's target a new page to showcase Selenium's capabilities in Haskell. The Infinite Scrolling demo uses JavaScript for rendering purposes and is a perfect example of a dynamic content page. It loads new products in the browser as the user scrolls down:

infinite scrolling demo page — Click to open the image in full screen

Use webdriver to scrape data from it in Haskell:

                    Main.hs
                
{-# LANGUAGE OverloadedStrings #-}

module Main (main) where

import qualified Data.Text as T
import Test.WebDriver

data Product = Product
  { name :: String,
    price :: String
  }
  deriving (Show)

extractProduct :: Element -> WD Product
extractProduct productElement = do
  -- select the name and price elements
  nameElement <- findElemFrom productElement $ ByCSS "h4"
  priceElement <- findElemFrom productElement $ ByCSS "h5"

  -- extract the text content of name and price elements
  name <- getText nameElement
  price <- getText priceElement

  -- create a Product object
  return $ Product (T.unpack name) (T.unpack price)

scrapeProducts :: IO [Product]
scrapeProducts = runSession chromeConfig $ do
  -- visit the target site
  openPage "https://scrapingclub.com/exercise/list_infinite_scroll/"

  -- select the product elements
  productElements <- findElems $ ByCSS ".post"
  -- iterate over the product elements and apply
  -- the scraping logic on each of them
  products <- traverse extractProduct productElements

  -- close selenium
  closeSession

  return products
-- configure the Chrome instance to control
chromeConfig :: WDConfig
chromeConfig = useBrowser chrome defaultConfig

main :: IO [()]
main = do
  -- scrape the products and print them
  products <- scrapeProducts
  mapM print products

  
  

  
Copied!

Run this script:

                    Terminal
                
stack run

Copied!

It'll produce:

                    Output
                
Product {name = "Short Dress", price = "$24.99"}
Product {name = "Patterned Slacks", price = "$29.99"}
Product {name = "Short Chiffon Dress", price = "$49.99"}
Product {name = "Off-the-shoulder Dress", price = "$59.99"}
Product {name = "V-neck Top", price = "$24.99"}
Product {name = "Short Chiffon Dress", price = "$49.99"}
Product {name = "V-neck Top", price = "$24.99"}
Product {name = "V-neck Top", price = "$24.99"}
Product {name = "Short Lace Dress", price = "$59.99"}
Product {name = "Fitted Dress", price = "$34.99"}

  
  

  
Copied!

Congrats! You're now a Haskell web scraping champion.

Conclusion

This guided tutorial walked you through the process of performing web scraping in Haskell. You learned the fundamentals and then explored more complex aspects and techniques.

Haskell is the most popular purely functional programming language in the world. As such, it boasts a large ecosystem of libraries. scalpel provides an API to perform web scraping and crawling in Haskell on static pages. For sites using JavaScript, you can use webdriver.

However, no matter how good your Haskell scraper is and which libraries you use, anti-scraping measures will be able to block it. Bypass them all with ZenRows, a scraping API with the most effective built-in anti-bot bypass functionality. Try ZenRows for free today!

Why Use Haskell for Web Scraping?

Prerequisites

Install Haskell

Create Your Haskell Project

How to Do Web Scraping With Haskell

Step 1: Scrape by Requesting Your Target Page

Step 2: Extract Data From One Element

Step 3: Extract Data From All Elements

Step 4: Export Your Data Into a CSV File

Haskell for Advanced Web Scraping

Scrape Multiple Pages With Haskell

Avoid Getting Blocked When Scraping With Haskell

Use a Headless Browser With Haskell

Conclusion

Ready to get started?