Have you hit a roadblock when trying to scrape data behind a login wall? This task can be challenging, especially when faced with CSRF and anti-bot protections. But we've got you covered!
This tutorial walks you through scraping data behind a login with Python, from simple authentication to handling more advanced challenges, including bypassing CSRF tokens and anti-bot measures.
For educational purposes only, we'll go through the following methods:
- How to scrape sites requiring simple usernames and passwords.
- Scrape websites with CSRF token authentication for login.
- Scraping behind the login on WAF-protected websites.
Let's go!
Can You Scrape Websites That Require a Login?
Yes, it's technically possible to scrape behind a login. However, to comply with personal data and privacy matters, you must be mindful of the target site's scraping rules and laws, such as the General Data Protection Regulation (GDPR).
It's also essential to have some general knowledge about HTTP Request methods. If you're new to web scraping, read our beginner-friendly guide on web scraping with Python to master the fundamentals.
In the next sections, we'll explore the steps of scraping data behind site logins with Python. We'll start with forms requiring only a username and password and then consider more complex cases.
How to Scrape Sites Requiring Simple Username and Password Logins
This tutorial assumes you've set up Python3 on your machine. If you haven't, download and install the latest version from the Python download page.
We'll use Python's Requests as the HTTP client and parse HTML content with BeautifulSoup. Install both libraries using pip
:
pip3 install requests beautifulsoup4
The test website for this section is the simple Login Challenge page.Â
Here's what the page looks like, requiring authentication before viewing product data:
Before going further, open that page with a browser such as Chrome and analyze what happens when attempting to log in.
Right-click anywhere on the page and select Inspect to open the developer console. Then, go to the Network tab.Â
Now, fill in the credentials and hit the login button (use the demo credentials attached to the top of the login form). In the Network tab, click All. Then, select the Login request that appears on the requests table after some moment. Go to the Payload section. You'll see the payload data you entered earlier, including the email and password.
Create a similar payload in your Python script and post the request to the Login page to bypass the authentication wall. Once the response arrives, the program uses BeautifulSoup to parse the HTML of the page and extract its title. Here's the code to do that:
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
# the URL of the login page
login_url = "https://www.scrapingcourse.com/login"
# the payload with your login credentials
payload = {
"email": "[email protected]",
"password": "password",
}
# send the POST request to login
response = requests.post(login_url, data=payload)
# if the request went Ok, you should get a 200 status
print(f"Status code: {response.status_code}")
# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# find the page title
page_title = soup.title.string
# print the result page title
print(f"Page title: {page_title}")
See the output below with the status code and dashboard page title, indicating that you've logged in successfully:
Status code: 200
Page title: Success Page - ScrapingCourse.com
Great! You've just learned to scrape a website behind a simple login with Python. Now, let's try using a bit more complex protection.
Scraping Websites With CSRF Token Authentication for Login
Most websites have implemented additional security measures to stop hackers and malicious bots, making it more difficult to log in. One of these measures requires a CSRF (Cross-Site Request Forgery) token in the authentication process.
This time, we'll use the Login with CSRF challenge page as a test website to show you how to access CSRF-protected login.
See what the page looks like below:
Try the previous scraper with this page. You'll see it outputs the following error message, indicating that you can't bypass CSRF protection:
419
Page title: Page Expired
Step #1: Inspect the Page Network Tab
You'll use your browser's Developer Tools to determine if your target website requires CSRF or an authenticity_token
.
Open that page on a browser like Chrome, right-click any part and click Inspect. Go to the Network tab. Enter the given credentials (provided at the top of the login form), hit the login button, and click csrf
from the request table.
You'll see an extra _token
payload now sent with the email and password, showing that the website requires a CSRF token:
You could copy and paste this token into your payload. However, that's not recommended because the CSRF token is practically dynamic. A better approach is to grab the CSRF token dynamically while performing a request.
Step 2: Retrieve the CSRF Token Dynamically
You'll now retrieve the CSRF token from the HTML of the login form. Let's inspect the HTML source of the login form. Go back to the CSRF Login Challenge page and right-click the login form. Then, click Inspect. You'll see a hidden _token
input field in the form:
The next step is to dynamically obtain the value of the hidden _token
input field and add it to the payload. Let's do that.
Step 3: Add the CSRF Token to the Payload
Here, you'll use the Requests library Session object to persist the current login session and prevent it from expiring prematurely. Once you log in through a session, you don't need to log in again for subsequent requests in the same session.
Obtain the login page using the Session object:
# ...
# create a session object
session = requests.Session()
# retrieve the page with CSRF token
response = session.get(login_url)
Use BeautifulSoup to extract the CSRF token dynamically from the hidden input field you inspected previously. Add the extracted token to the payload and send a POST request using the current session:
# ...
# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# extract the CSRF token from the HTML dynamically
csrf_token = soup.find("input", {"name": "_token"})["value"]
# the payload with your login credentials and the CSRF token
payload = {
"_token": csrf_token,
"email": "[email protected]",
"password": "password",
}
# send the POST request to login
response = session.post(login_url, data=payload)
You can add request headers such as User Agent to your request to make your scraper more human-like. Check out our tutorial on changing the User Agent in Python Requests to learn more.
Step 4: Extract Product Data
Let's extract specific product data from the result product page using the current session. Before you begin, inspect the page to view its product elements. Right-click the first product and select Inspect.
You'll see that each product is in a div
tag with the class name product-item
:
Remember you parsed the login page HTML earlier. Re-parse the result page separately in another BeautifulSoup instance. Then, scrape the product name and price from each parent element using a for
loop. Append the scraped data to an empty list and print it:
# ...
# re-parse the HTML content of the current product page using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# extract the parent div
parent_element = soup.find_all(class_="product-item")
# specify an empty product_data list to collect extracted data
product_data = []
# loop through each parent div to extract specific data
for product in parent_element:
data = {
"Name": product.find(class_="product-name").text,
"Price": product.find(class_="product-price").text,
}
# append the extracted data to the empty list
product_data.append(data)
# print the product data
print(product_data)
Merge the snippets, and you'll get the following final code:
import requests
from bs4 import BeautifulSoup
# the URL of the login page
login_url = "https://www.scrapingcourse.com/login/csrf"
# create a session object
session = requests.Session()
# retrieve the the page with CSRF token
response = session.get(login_url)
# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# extract the CSRF token from the HTML dynamically
csrf_token = soup.find("input", {"name": "_token"})["value"]
# the payload with your login credentials and the CSRF token
payload = {
"_token": csrf_token,
"email": "[email protected]",
"password": "password",
}
# send the POST request to login
response = session.post(login_url, data=payload)
# re-parse the HTML content of the current product page using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# extract the parent div
parent_element = soup.find_all(class_="product-item")
# specify an empty product_data list to collect extracted data
product_data = []
# loop through each parent div to extract specific data
for product in parent_element:
data = {
"Name": product.find(class_="product-name").text,
"Price": product.find(class_="product-price").text,
}
# append the extracted data to the empty list
product_data.append(data)
# print the product data
print(product_data)
Here's the output:
[
{'Name': 'Chaz Kangeroo Hoodie', 'Price': '$52'},
{'Name': 'Teton Pullover Hoodie', 'Price': '$70'},
#... other products omitted for brevity
{'Name': 'Grayson Crewneck Sweatshirt', 'Price': '$64'},
{'Name': 'Ajax Full-Zip Sweatshirt', 'Price': '$69'}
]
Excellent! You just scraped a CSRF-authenticated website with Python's Requests and BeautifulSoup.
Scraping Behind the Login on WAF-Protected Websites
On many websites, you'll still get an Access Denied screen or receive an HTTP error like 403 forbidden error despite sending the correct username, password, and CSRF token. Even using the proper request headers won't work. All these indicate that the site uses advanced protections, like client-side browser verification.
Client-side verification is a security measure to block bots and scrapers from accessing websites, implemented mainly by WAFs (Web Application Firewalls) like Cloudflare, Akamai, and PerimeterX.
Let's find a solution to this problem.Â
Basic WAF Protections With Selenium and Undetected-ChromeDriver
The risk of being blocked is high if you use just the Requests and BeautifulSoup libraries to handle logins that require human-like interactions.
One of the alternatives that can help mitigate this issue is headless browsers. These tools can automate user interactions on the standard browsers you know, like Chrome or Firefox, but they don't have any GUI for a human user to interact with.
Although base Selenium implementation isn't enough for scraping WAF-protected sites, some extensions, such as Undetected ChromeDriver, are available to aid you.Â
Undetected ChromeDriver is a stealth ChromeDriver automation library that uses several evasion techniques to avoid detection. Pairing Selenium and Undetected ChromeDriver is a decent solution to bypass basic WAF protection on login pages.
Let's see how it works using a simple Cloudflare-protected DataCamp login page as a demo website. We assume you already have a DataCamp account. Otherwise, create one to get your credentials.
Now, install Selenium and Undetected ChromeDriver using pip
:
pip3 install selenium undetected-chromedriver
Import the required libraries:
# pip3 install selenium undetected-chromedriver
import undetected_chromedriver as uc
Create an undetectable browser instance in non-headless mode and move to the login page.
if __name__ == "__main__":
# instantiate Chrome options
options = uc.ChromeOptions()
# add headless mode
options.headless = False
# instantiate a Chrome browser and add the options
driver = uc.Chrome(
use_subprocess=False,
options=options,
)
# visit the target URL
driver.get("https://www.datacamp.com/users/sign_in")
To programmatically enter the email and password fields, you need to get the input field element selectors from the login form (class name or ID). To do so, open the login page in your browser, right-click the email field, and click Inspect to open the element in the developer console.Â
The E-mail address field has an ID of user_email
and a next button with the unique class name js-account-check-email
:
Keep in mind that the form is in two phases. Enter your email address into the form opened in your browser and press Next to reveal the password field. Similarly, right-click the password field and select Inspect to view its elements.
So, the Password field has an ID of user_password
:
Finally, inspect the Sign In button. Here's its element with the attributes:
Now, let's fill out the forms with Selenium. Add the By
and the time modules to your imports and automate the form-filling and login process. Include a sleep timer between each operation to allow the DOM to load at each level:
# ...
from selenium.webdriver.common.by import By
import time
if __name__ == "__main__":
# ...
time.sleep(5)
# fill in the username field
username = driver.find_element(By.ID, "user_email")
username.send_keys("<YOUR_EMAIL_ADDRESS>")
# click next
driver.find_element(By.CSS_SELECTOR, ".js-account-check-email").click()
time.sleep(5)
# fill in the password field
password = driver.find_element(By.ID, "user_password")
password.send_keys("<YOUR_PASSWORD>")
# submit the login form
driver.find_element(By.NAME, "commit").click()
time.sleep(5)
The program logs in after clicking the submit button. Once inside the dashboard, let's extract the profile name and the registered course. Then, close the browser instance.
if __name__ == "__main__":
# ...
# retrieve and log profile credentials
my_name = driver.find_element(By.TAG_NAME, "h1")
my_course = driver.find_element(By.CLASS_NAME, "mfe-parcel-home-hub-learn-1h09ymt")
print("Profile Name: " + my_name.text)
print("Course Enrolled: " + my_course.text)
# close the browser
driver.quit()
Let's combine all previous code snippets to see what our complete scraping script looks like:
# pip3 install selenium undetected-chromedriver
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
import time
if __name__ == "__main__":
# instantiate Chrome options
options = uc.ChromeOptions()
# add headless mode
options.headless = False
# instantiate a Chrome browser and add the options
driver = uc.Chrome(
use_subprocess=False,
options=options,
)
# visit the target URL
driver.get("https://www.datacamp.com/users/sign_in")
# ...
time.sleep(5)
# fill in the username field
username = driver.find_element(By.ID, "user_email")
username.send_keys("<YOUR_EMAIL_ADDRESS>")
# click next
driver.find_element(By.CSS_SELECTOR, ".js-account-check-email").click()
time.sleep(5)
# fill in the password field
password = driver.find_element(By.ID, "user_password")
password.send_keys("<YOUR_PASSWORD>")
# submit the login form
driver.find_element(By.NAME, "commit").click()
time.sleep(5)
# retrieve and log profile credentials
my_name = driver.find_element(By.TAG_NAME, "h1")
my_course = driver.find_element(By.CLASS_NAME, "mfe-parcel-home-hub-learn-1h09ymt")
print("Profile Name: " + my_name.text)
print("Course Enrolled: " + my_course.text)
# close the browser
driver.quit()
Depending on your profile name and registered courses, the output should look like this:
Profile Name: Hey, <PROFILE_NAME>!
Course Enrolled: Introduction to Python
2%
4 hours to go
Keep Making Progress
Great! You've successfully scraped content behind a basic login page protected by basic WAF protection. But will this work for every website? Unfortunately, the answer is no.
The Undetected Chromedriver package still leaks some bot-like attributes like the automated WebDriver and won't work against advanced protection measures. Moreover, WAF-protected sites can easily detect its headless mode.
You may rely on Undetected Chromedriver only if the protections are basic. However, assume your target uses advanced Cloudflare protection (e.g., this heavily protected Login Challenge page) or other DDoS mitigation services. In that case, this solution won't work.
Here's where ZenRows comes to the rescue. ZenRows is a web scraping API that can easily handle all sorts of anti-bot bypasses for you, including complex ones. Moreover, it works with any programming language and doesn't require any browser installation. You'll see how it works in the next section.
Bypassing Advanced Protections Using ZenRows
Scraping content behind a login on a website with advanced protection measures requires the right tool. We'll use the ZenRows API.
The goal is to bypass the Cloudflare Login Challenge page and extract specific product data from the result page.
First, we must explore our target website with DevTools. Right-click each field (Email Address, Password, and Submit button) and select Inspect in each case to expose their elements.
The Email Address, Password, and Submit fields have an ID of email
, password
, and submit-button
, respectively:
To use ZenRows, sign up to load the Request Builder. Paste the target URL in the link box, and activate Premium Proxies and JS Rendering. Toggle on JS Instructions and input the login credentials using the form field selectors and relevant JavaScript actions.
Select Python as your programming language and choose the API connection mode. Copy and paste the generated code into your scraper file.
The generated code should look like this with the JavaScript instructions:
# pip3 install requests
import requests
url = "https://www.scrapingcourse.com/login/cf-antibot"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"js_instructions": """
[{"fill":["#email","[email protected]"]},
{"fill":["#password","password"]},
{"click":"#submit-button"},
{"wait":500}]
""",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
Parse the response HTML with BeautifulSoup and extract target product data. Add BeautifulSoup to your imports, get the parent product elements, and extract product data from each iteratively:
# pip3 install requests beautifulsoup4
# ...
from bs4 import BeautifulSoup
# ...
# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# extract the parent div
parent_element = soup.find_all(class_="product-item")
# specify an empty product_data list to collect extracted data
product_data = []
# loop through each parent div to extract specific data
for product in parent_element:
data = {
"Name": product.find(class_="product-name").text,
"Price": product.find(class_="product-price").text,
}
# append the extracted data to the empty list
product_data.append(data)
# print the product data
print(product_data)
Here's the final code after combining both snippets:
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
url = "https://www.scrapingcourse.com/login/cf-antibot"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"js_instructions": """
[{"fill":["#email","[email protected]"]},
{"fill":["#password","password"]},
{"click":"#submit-button"},
{"wait":500}]
""",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# extract the parent div
parent_element = soup.find_all(class_="product-item")
# specify an empty product_data list to collect extracted data
product_data = []
# loop through each parent div to extract specific data
for product in parent_element:
data = {
"Name": product.find(class_="product-name").text,
"Price": product.find(class_="product-price").text,
}
# append the extracted data to the empty list
product_data.append(data)
# print the product data
print(product_data)
The above code bypasses Cloudflare protection, logs into the product website, and scrapes its product. See the output below:
[
{'Name': 'Chaz Kangeroo Hoodie', 'Price': '$52'},
{'Name': 'Teton Pullover Hoodie', 'Price': '$70'},
#... other products omitted for brevity
{'Name': 'Grayson Crewneck Sweatshirt', 'Price': '$64'},
{'Name': 'Ajax Full-Zip Sweatshirt', 'Price': '$69'}
]
Congratulations! You just bypassed the highest level of Cloudflare protection, performed a CSRF authentication, and extracted product data using ZenRows.
Conclusion
You've seen how to perform scraping authentication and extract data behind a website's login using Python's Requests. Here's a recap of what you've learned:
- Scrape data behind a simple login requiring only a username and a password.
- Extract data behind a login page protected by a CSRF token.
- Retrieve data behind a basic WAF-protected login page.
- Bypass advanced WAF-protected login page and scrape its product data.
However, accessing and scraping data behind an anti-bot-protected login page is difficult. For an easy and scalable solution to bypass any anti-bot protection with Python, ZenRows offers all the toolkits you need, including CAPTCHA and anti-bot auto-bypass, premium proxy rotation, headless browsing, and more.
Don't stop learning! Here are a few tips to avoid getting blocked. Also, check out our guides on web scraping with Selenium in Python and bypassing Cloudflare with Selenium to add valuable skills to your tool belt. Remember to respect ethical and legal considerations while scraping.
Frequent Questions
How Do You Log Into a Website With Python?
The first step to scraping a login-protected website with Python is understanding how your target website handles login. Some old websites only require sending a username and password. However, modern ones use more advanced security measures, such as CSRF tokens, client-side validations, and WAFs. You'll need to deal with each accordingly to scrape your target data.