How to Scrape a Website Behind a Login With Python

Updated: September 4, 2024 · 8 min read

Table of contents

Can you scrape websites that require a login?
Scraping websites requiring username and password
Scraping websites with CSRF token authentication
Scraping WAF-protected websites
- Basic WAF protections
- Bypassing advanced protections using ZenRows
Conclusion
Frequent questions
- How do you log into a website with Python?

After spending years building web scrapers for all kinds of sites, I've found that login barriers are among the toughest challenges. This guide shares what actually works for getting past login walls, from basic authentication to the most advanced protections.

Let's skip the theory and get to the practical solutions.

For educational purposes only, we'll go through the following methods:

How to scrape sites requiring simple usernames and passwords.
Scrape websites with CSRF token authentication for login.
Scraping behind the login on WAF-protected websites.

Yes, it's technically possible to scrape behind a login. However, to comply with personal data and privacy matters, you must be mindful of the target site's scraping rules and laws, such as the General Data Protection Regulation (GDPR). This is especially important when scraping social media platforms, as they often handle sensitive personal information and have strict policies regarding data access.

It's also essential to have some general knowledge about HTTP Request methods. If you're new to web scraping, read our beginner-friendly guide on web scraping with Python to master the fundamentals.

In the next sections, we'll explore the steps of scraping data behind site logins with Python. We'll start with forms requiring only a username and password and then consider more complex cases.

This tutorial assumes you've set up Python3 on your machine. If you haven't, download and install the latest version from the Python download page.

We'll use Python's Requests as the HTTP client and parse HTML content with BeautifulSoup. Install both libraries using pip:

                    Terminal
                
pip3 install requests beautifulsoup4

Copied!

The test website for this section is the simple Login Challenge page.

Here's what the page looks like, requiring authentication before viewing product data:

Scrapingcourse login challenge page — Click to open the image in full screen

Before going further, open that page with a browser such as Chrome and analyze what happens when attempting to log in.

Right-click anywhere on the page and select Inspect to open the developer console. Then, go to the Network tab.

Now, fill in the credentials and hit the login button (use the demo credentials attached to the top of the login form). In the Network tab, click All. Then, select the Login request that appears on the requests table after some moment. Go to the Payload section. You'll see the payload data you entered earlier, including the email and password.

Login Page Network Tab With Payload — Click to open the image in full screen

Create a similar payload in your Python script and post the request to the Login page to bypass the authentication wall. Once the response arrives, the program uses BeautifulSoup to parse the HTML of the page and extract its title. Here's the code to do that:

                    Example
                
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup

# the URL of the login page
login_url = "https://www.scrapingcourse.com/login"

# the payload with your login credentials
payload = {
    "email": "[email protected]",
    "password": "password",
}

# send the POST request to login
response = requests.post(login_url, data=payload)


# if the request went Ok, you should get a 200 status
print(f"Status code: {response.status_code}")

# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# find the page title
page_title = soup.title.string

# print the result page title
print(f"Page title: {page_title}")

  
  

  
Copied!

See the output below with the status code and dashboard page title, indicating that you've logged in successfully:

                    Output
                
Status code: 200
Page title: Success Page - ScrapingCourse.com

Copied!

Great! You've just learned to scrape a website behind a simple login with Python. Now, let's try using a bit more complex protection.

Frustrated that your web scrapers are blocked once and again?

ZenRows API handles rotating proxies and headless browsers for you.

Try for FREE

Most websites have implemented additional security measures to stop hackers and malicious bots, making it more difficult to log in. One of these measures requires a CSRF (Cross-Site Request Forgery) token in the authentication process.

This time, we'll use the Login with CSRF challenge page as a test website to show you how to access CSRF-protected login.

See what the page looks like below:

CSRF Loging Challenge Demo Page — Click to open the image in full screen

Try the previous scraper with this page. You'll see it outputs the following error message, indicating that you can't bypass CSRF protection:

                    Output
                
419
Page title: Page Expired

Copied!

Step #1: Inspect the Page Network Tab

You'll use your browser's Developer Tools to determine if your target website requires CSRF or an authenticity_token.

Open that page on a browser like Chrome, right-click any part and click Inspect. Go to the Network tab. Enter the given credentials (provided at the top of the login form), hit the login button, and click csrf from the request table.

You'll see an extra _token payload now sent with the email and password, showing that the website requires a CSRF token:

CSRF Login Network Tab — Click to open the image in full screen

You could copy and paste this token into your payload. However, that's not recommended because the CSRF token is practically dynamic. A better approach is to grab the CSRF token dynamically while performing a request.

Step 2: Retrieve the CSRF Token Dynamically

You'll now retrieve the CSRF token from the HTML of the login form. Let's inspect the HTML source of the login form. Go back to the CSRF Login Challenge page and right-click the login form. Then, click Inspect. You'll see a hidden _token input field in the form:

CSRF Login Page Inspection — Click to open the image in full screen

The next step is to dynamically obtain the value of the hidden _token input field and add it to the payload. Let's do that.

Step 3: Add the CSRF Token to the Payload

Here, you'll use the Requests library Session object to persist the current login session and prevent it from expiring prematurely. Once you log in through a session, you don't need to log in again for subsequent requests in the same session.

Obtain the login page using the Session object:

                    Example
                
# ...

# create a session object
session = requests.Session()

# retrieve the page with CSRF token
response = session.get(login_url)

Copied!

Use BeautifulSoup to extract the CSRF token dynamically from the hidden input field you inspected previously. Add the extracted token to the payload and send a POST request using the current session:

                    Example
                
# ...

# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# extract the CSRF token from the HTML dynamically
csrf_token = soup.find("input", {"name": "_token"})["value"]

# the payload with your login credentials and the CSRF token
payload = {
    "_token": csrf_token,
    "email": "[email protected]",
    "password": "password",
}

# send the POST request to login
response = session.post(login_url, data=payload)

  
  

  
Copied!

Note

You can add request headers such as User Agent to your request to make your scraper more human-like. Check out our tutorial on changing the User Agent in Python Requests to learn more.

Step 4: Extract Product Data

Let's extract specific product data from the result product page using the current session. Before you begin, inspect the page to view its product elements. Right-click the first product and select Inspect.

You'll see that each product is in a div tag with the class name product-item:

Login Product Success Page Inspection — Click to open the image in full screen

Remember you parsed the login page HTML earlier. Re-parse the result page separately in another BeautifulSoup instance. Then, scrape the product name and price from each parent element using a for loop. Append the scraped data to an empty list and print it:

                    Example
                
# ...

# re-parse the HTML content of the current product page using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# extract the parent div
parent_element = soup.find_all(class_="product-item")

# specify an empty product_data list to collect extracted data
product_data = []

# loop through each parent div to extract specific data
for product in parent_element:
    data = {
        "Name": product.find(class_="product-name").text,
        "Price": product.find(class_="product-price").text,
    }

    # append the extracted data to the empty list
    product_data.append(data)

# print the product data
print(product_data)

  
  

  
Copied!

Merge the snippets, and you'll get the following final code:

                    Example
                
import requests
from bs4 import BeautifulSoup

# the URL of the login page
login_url = "https://www.scrapingcourse.com/login/csrf"

# create a session object
session = requests.Session()

# retrieve the the page with CSRF token
response = session.get(login_url)

# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# extract the CSRF token from the HTML dynamically
csrf_token = soup.find("input", {"name": "_token"})["value"]

# the payload with your login credentials and the CSRF token
payload = {
    "_token": csrf_token,
    "email": "[email protected]",
    "password": "password",
}

# send the POST request to login
response = session.post(login_url, data=payload)

# re-parse the HTML content of the current product page using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# extract the parent div
parent_element = soup.find_all(class_="product-item")

# specify an empty product_data list to collect extracted data
product_data = []

# loop through each parent div to extract specific data
for product in parent_element:
    data = {
        "Name": product.find(class_="product-name").text,
        "Price": product.find(class_="product-price").text,
    }

    # append the extracted data to the empty list
    product_data.append(data)

# print the product data
print(product_data)

  
  

  
Copied!

Here's the output:

                    Output
                
[
    {'Name': 'Chaz Kangeroo Hoodie', 'Price': '$52'},
    {'Name': 'Teton Pullover Hoodie', 'Price': '$70'},

    #... other products omitted for brevity
   
    {'Name': 'Grayson Crewneck Sweatshirt', 'Price': '$64'},
    {'Name': 'Ajax Full-Zip Sweatshirt', 'Price': '$69'}
]

Copied!

Excellent! You just scraped a CSRF-authenticated website with Python's Requests and BeautifulSoup.

Scraping Behind the Login on WAF-Protected Websites

On many websites, you'll still get an Access Denied screen or receive an HTTP error like 403 forbidden error despite sending the correct username, password, and CSRF token. Even using the proper request headers won't work. All these indicate that the site uses advanced protections, like client-side browser verification.

Client-side verification is a security measure to block bots and scrapers from accessing websites, implemented mainly by WAFs (Web Application Firewalls) like Cloudflare, Akamai, PerimeterX, and other advanced anti-bot systems.

Let's find a solution to this problem.

Basic WAF Protections With Selenium and Undetected-ChromeDriver

The risk of being blocked is high if you use just the Requests and BeautifulSoup libraries to handle logins that require human-like interactions.

One of the alternatives that can help mitigate this issue is headless browsers. These tools can automate user interactions on the standard browsers you know, like Chrome or Firefox, but they don't have any GUI for a human user to interact with.

Although base Selenium implementation isn't enough for scraping WAF-protected sites, some extensions, such as Undetected ChromeDriver, are available to aid you.

Undetected ChromeDriver is a stealth ChromeDriver automation library that uses several evasion techniques to avoid detection. Pairing Selenium and Undetected ChromeDriver is a decent solution to bypass basic WAF protection on login pages.

Let's see how it works using a simple Cloudflare-protected DataCamp login page as a demo website. We assume you already have a DataCamp account. Otherwise, create one to get your credentials.

Now, install Selenium and Undetected ChromeDriver using pip:

                    Terminal
                
pip3 install selenium undetected-chromedriver

Copied!

Import the required libraries:

                    Example
                
# pip3 install selenium undetected-chromedriver
import undetected_chromedriver as uc

Copied!

Create an undetectable browser instance in non-headless mode and move to the login page.

                    Example
                
if __name__ == "__main__":

    # instantiate Chrome options
    options = uc.ChromeOptions()
    # add headless mode
    options.headless = False

    # instantiate a Chrome browser and add the options
    driver = uc.Chrome(
        use_subprocess=False,
        options=options,
    )

    # visit the target URL
    driver.get("https://www.datacamp.com/users/sign_in")

  
  

  
Copied!

To programmatically enter the email and password fields, you need to get the input field element selectors from the login form (class name or ID). To do so, open the login page in your browser, right-click the email field, and click Inspect to open the element in the developer console.

The E-mail address field has an ID of user_email and a next button with the unique class name js-account-check-email:

DataCamp Email Field Inspection — Click to open the image in full screen

Keep in mind that the form is in two phases. Enter your email address into the form opened in your browser and press Next to reveal the password field. Similarly, right-click the password field and select Inspect to view its elements.

So, the Password field has an ID of user_password:

DataCamp Password Field Inspection — Click to open the image in full screen

Finally, inspect the Sign In button. Here's its element with the attributes:

Now, let's fill out the forms with Selenium. Add the By and the time modules to your imports and automate the form-filling and login process. Include a sleep timer between each operation to allow the DOM to load at each level:

                    Example
                
# ...
from selenium.webdriver.common.by import By
import time

if __name__ == "__main__":

    # ...

    time.sleep(5)
    # fill in the username field
    username = driver.find_element(By.ID, "user_email")
    username.send_keys("<YOUR_EMAIL_ADDRESS>")

    # click next
    driver.find_element(By.CSS_SELECTOR, ".js-account-check-email").click()
    time.sleep(5)

    # fill in the password field
    password = driver.find_element(By.ID, "user_password")
    password.send_keys("<YOUR_PASSWORD>")

    # submit the login form
    driver.find_element(By.NAME, "commit").click()
    time.sleep(5)

  
  

  
Copied!

The program logs in after clicking the submit button. Once inside the dashboard, let's extract the profile name and the registered course. Then, close the browser instance.

                    Example
                
if __name__ == "__main__":

    # ...

    # retrieve and log profile credentials
    my_name = driver.find_element(By.TAG_NAME, "h1")
    my_course = driver.find_element(By.CLASS_NAME, "mfe-parcel-home-hub-learn-1h09ymt")

    print("Profile Name: " + my_name.text)
    print("Course Enrolled: " + my_course.text)

    # close the browser
    driver.quit()

Copied!

Let's combine all previous code snippets to see what our complete scraping script looks like:

                    Output
                
# pip3 install selenium undetected-chromedriver
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
import time

if __name__ == "__main__":

    # instantiate Chrome options
    options = uc.ChromeOptions()
    # add headless mode
    options.headless = False

    # instantiate a Chrome browser and add the options
    driver = uc.Chrome(
        use_subprocess=False,
        options=options,
    )

    # visit the target URL
    driver.get("https://www.datacamp.com/users/sign_in")

    # ...

    time.sleep(5)
    # fill in the username field
    username = driver.find_element(By.ID, "user_email")
    username.send_keys("<YOUR_EMAIL_ADDRESS>")

    # click next
    driver.find_element(By.CSS_SELECTOR, ".js-account-check-email").click()
    time.sleep(5)

    # fill in the password field
    password = driver.find_element(By.ID, "user_password")
    password.send_keys("<YOUR_PASSWORD>")

    # submit the login form
    driver.find_element(By.NAME, "commit").click()
    time.sleep(5)

    # retrieve and log profile credentials
    my_name = driver.find_element(By.TAG_NAME, "h1")
    my_course = driver.find_element(By.CLASS_NAME, "mfe-parcel-home-hub-learn-1h09ymt")

    print("Profile Name: " + my_name.text)
    print("Course Enrolled: " + my_course.text)

    # close the browser
    driver.quit()

  
  

  
Copied!

Depending on your profile name and registered courses, the output should look like this:

                    Output
                
Profile Name: Hey, <PROFILE_NAME>!
Course Enrolled: Introduction to Python
2%
4 hours to go
Keep Making Progress

Copied!

Great! You've successfully scraped content behind a basic login page protected by basic WAF protection. But will this work for every website? Unfortunately, the answer is no.

The Undetected Chromedriver package still leaks some bot-like attributes like the automated WebDriver and won't work against advanced protection measures. Moreover, WAF-protected sites can easily detect its headless mode.

You may rely on Undetected Chromedriver only if the protections are basic. However, assume your target uses advanced Cloudflare protection (e.g., this heavily protected Login Challenge page) or other DDoS mitigation services. In that case, this solution won't work.

Here's where ZenRows comes to the rescue. ZenRows is a web scraping API that can easily handle all sorts of anti-bot bypasses for you, including complex ones. Moreover, it works with any programming language and doesn't require any browser installation. You'll see how it works in the next section.

Bypassing Advanced Protections Using ZenRows

Scraping content behind a login on a website with advanced protection measures requires the right tool. We'll use the ZenRows API.

The goal is to bypass the Cloudflare Login Challenge page and extract specific product data from the result page.

First, we must explore our target website with DevTools. Right-click each field (Email Address, Password, and Submit button) and select Inspect in each case to expose their elements.

The Email Address, Password, and Submit fields have an ID of email, password, and submit-button, respectively:

Cloudflare Challene Page Form Inspection — Click to open the image in full screen

To use ZenRows, sign up to load the Request Builder. Paste the target URL in the link box, and activate Premium Proxies and JS Rendering. Toggle on JS Instructions and input the login credentials using the form field selectors and relevant JavaScript actions.

Select Python as your programming language and choose the API connection mode. Copy and paste the generated code into your scraper file.

The generated code should look like this with the JavaScript instructions:

                    Example
                
# pip3 install requests
import requests

url = "https://www.scrapingcourse.com/login/cf-antibot"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "js_instructions": """
    [{"fill":["#email","[email protected]"]},
    {"fill":["#password","password"]},
    {"click":"#submit-button"},
    {"wait":500}]
    """,
    "premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)

  
  

  
Copied!

Parse the response HTML with BeautifulSoup and extract target product data. Add BeautifulSoup to your imports, get the parent product elements, and extract product data from each iteratively:

                    Example
                
# pip3 install requests beautifulsoup4

# ...

from bs4 import BeautifulSoup

# ...

# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# extract the parent div
parent_element = soup.find_all(class_="product-item")

# specify an empty product_data list to collect extracted data
product_data = []

# loop through each parent div to extract specific data
for product in parent_element:
    data = {
        "Name": product.find(class_="product-name").text,
        "Price": product.find(class_="product-price").text,
    }

    # append the extracted data to the empty list
    product_data.append(data)

# print the product data
print(product_data)

  
  

  
Copied!

Here's the final code after combining both snippets:

                    Example
                
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup

url = "https://www.scrapingcourse.com/login/cf-antibot"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "js_instructions": """
    [{"fill":["#email","[email protected]"]},
    {"fill":["#password","password"]},
    {"click":"#submit-button"},
    {"wait":500}]
    """,
    "premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)

# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# extract the parent div
parent_element = soup.find_all(class_="product-item")

# specify an empty product_data list to collect extracted data
product_data = []

# loop through each parent div to extract specific data
for product in parent_element:
    data = {
        "Name": product.find(class_="product-name").text,
        "Price": product.find(class_="product-price").text,
    }

    # append the extracted data to the empty list
    product_data.append(data)

# print the product data
print(product_data)

  
  

  
Copied!

The above code bypasses Cloudflare protection, logs into the product website, and scrapes its product. See the output below:

                    Output
                
[
    {'Name': 'Chaz Kangeroo Hoodie', 'Price': '$52'},
    {'Name': 'Teton Pullover Hoodie', 'Price': '$70'},

    #... other products omitted for brevity
   
    {'Name': 'Grayson Crewneck Sweatshirt', 'Price': '$64'},
    {'Name': 'Ajax Full-Zip Sweatshirt', 'Price': '$69'}
]

Copied!

Congratulations! You just bypassed the highest level of Cloudflare protection, performed a CSRF authentication, and extracted product data using ZenRows.

Conclusion

You've seen how to perform scraping authentication and extract data behind a website's login using Python's Requests. Here's a recap of what you've learned:

Scrape data behind a simple login requiring only a username and a password.
Extract data behind a login page protected by a CSRF token.
Retrieve data behind a basic WAF-protected login page.
Bypass advanced WAF-protected login page and scrape its product data.

However, accessing and scraping data behind an anti-bot-protected login page is difficult. For an easy and scalable solution to bypass any anti-bot protection with Python, ZenRows offers all the toolkits you need, including CAPTCHA and anti-bot auto-bypass, premium proxy rotation, headless browsing, and more.

Don't stop learning! Here are a few tips to avoid getting blocked. Also, check out our guides on web scraping with Selenium in Python and bypassing Cloudflare with Selenium to add valuable skills to your tool belt. Remember to respect ethical and legal considerations while scraping.

Frequent Questions

How Do You Log Into a Website With Python?

The first step to scraping a login-protected website with Python is understanding how your target website handles login. Some old websites only require sending a username and password. However, modern ones use more advanced security measures, such as CSRF tokens, client-side validations, and WAFs. You'll need to deal with each accordingly to scrape your target data.

Can You Scrape Websites That Require a Login?

How to Scrape Sites Requiring Simple Username and Password Logins

Scraping Websites With CSRF Token Authentication for Login