The Anti-bot Solution to Scrape Everything? Get Your Free API Key! 😎

How to Scrape a Website that Requires a Login with Python

January 31, 2023 Β· 13 min read

While web scraping, you might find some data available only after you've signed in. In this tutorial, we'll learn the security measures used and three effective methods to scrape a website that requires a login with Python.

Let's find a solution!

Can You Scrape Websites that Require a Login?

Yes, it's technically possible to do web scraping behind a login. But you have to be mindful of the scraping rules of the target sites as well as laws like GDPR for compliance with personal data and privacy matters.

To get started, it's essential to have some general knowledge about HTTP Request Methods. And if web scraping is new for you, we recommend reading our guide on web scraping with Python to master the fundamentals.

How Do You Log into a Website with Python?

The first step to scraping a website that requires a login with Python is figuring out what login type your target domain uses. Some old websites just require sending a username and password. However, modern websites use more advanced security measures. They include:
  • Client-side validations.
  • CSRF tokens.
  • Web Application Firewalls (WAFs).

Keep reading to learn techniques to get around these strict security protections.

How Do You Scrape a Website behind a Login in Python?

We'll see the step by step of scraping data behind site logins with Python. We'll start with forms requiring only a username and password and then increase the difficulty progressively.

Just beware the methods showcased in this tutorial are for educational purposes only.

Three, two, one… let's code!

Sites Requiring a Simple Username and Password Login

We assume that you've already set up Python 3 and Pip, otherwise you should check a guide on properly installing Python.

As dependencies, we'll use the Requests and BeautifulSoup libraries. Start by installing them:

pip install requests beautifulsoup4

Tip: If you have any trouble during the installation, visit this page for Requests and this one for Beautiful Soup.

Now, go to Acunetix's User Information. This is a test page made specifically for learning purposes and is protected by a simple login, so you'll be redirected to a login page.

Before going further, we'll analyze what happens when attempting a login. For that, use test as a username and password, hit the login button and check the network section on your browser.

Simple login example
Click to open the image in full screen

Submitting the form generates a POST request to the User Information page, with the server responding with a cookie and fulfilling the requested section. The screenshot below shows the headers, payload, response and cookies.

POST request reponse
Click to open the image in full screen

The following web scraping script will bypass the login. It creates a similar payload and posts the request to the User Information page. Once the response arrives, the program uses Beautiful Soup to parse the response text and print the page name.

from bs4 import BeautifulSoup as bs 
import requests 
URL = "http://testphp.vulnweb.com/userinfo.php" 
 
payload = { 
	"uname": "test", 
	"pass": "test" 
} 
s = requests.session() 
response = s.post(URL, data=payload) 
print(response.status_code) # If the request went Ok we usually get a 200 status. 
 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(response.content, "html.parser") 
protected_content = soup.find(attrs={"id": "pageName"}).text 
print(protected_content)

This is our output:

Output simple login
Click to open the image in full screen

Great! πŸŽ‰ You just learned scraping sites behind simple logins with Python. Now, let's try with a bit more complex protections.

Scraping Websites with CSRF Token Authentication for Login

In 2023, it's not so easy to log into a website. Most websites have implemented additional security measures to stop hackers and malicious bots. One of these measures requires a CSRF (Cross-Site Request Forgery) token in the authentication process.

To find out if your target website requires CSRF or an authenticity_token, make the most of your browser's Developer Tools. It doesn't matter whether you use Safari, Chrome, Edge, Chromium or Firefox because all have a similar set of powerful tools for developers. To learn more, we suggest checking out the Chrome DevTools or Mozilla DevTools documentation.

Let's dive into scraping GitHub!

Step 1: Log into a GitHub Account

GitHub is one of the websites that use CSRF token authentication for logins. We'll scrape all the repositories in our test account for demonstration.

Open a web browser (Chrome, in our case) and navigate to GitHub's login page. Now, press the F12 key to see the DevTools window in your browser and inspect the HTML of the page to check if the login form element has an action attribute:

Git login inspect
Click to open the image in full screen

Select the Network tab from the DevTools window and click the Sign in button, then fill and submit the form yourself. This will perform a few HTTP requests, visible in this tab.

Git login page
Click to open the image in full screen

Let's look at what we've got after clicking on the Sign in button by taking a look at the POST request named session that has just been sent.

In the Headers section, you'll find the full URL where the login credentials are posted. We'll use it to send a login request in our script.

Git login request
Click to open the image in full screen
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 2: Set up Payload for the CSRF-protected Login Request

Now, you might be wondering how we know there's CSRF protection. The answer is in front of us:

Navigate to the Payload section of the session request. Notice that, in addition to login and password, we have payload data for the authentication token and the timestamps. This authenticity token is the CSRF token and must be passed as a payload along the login POST request.

Git login required fields
Click to open the image in full screen

Manually copying these fields from the Payload section for each new login request is tedious. We'll definitely write code to get that programmatically.

Next, look again at the HTML source of the login form. You'll see all the Payload fields are present in the form.

Git login form HTML
Click to open the image in full screen

The following script gets the CSRF token, timestamp and timestamp_secret from the login page:

import requests 
from bs4 import BeautifulSoup 
login_url = "https://github.com/session" 
login = "Your Git username Here" 
password = "Your Git Password Here" 
with requests.session() as s: 
	req = s.get(login_url).text 
	html = BeautifulSoup(req,"html.parser") 
	token = html.find("input", {"name": "authenticity_token"}). attrs["value"] 
	time = html.find("input", {"name": "timestamp"}).attrs["value"] 
	timeSecret = html.find("input", {"name": "timestamp_secret"}). attrs["value"]

We can now populate our payload dictionary for our Python login request as:

payload = { 
	"authenticity_token": token, 
	"login": login, 
	"password": password, 
	"timestamp": time, 
	"timestamp_secret": timeSecret 
}

Note: If you can't find the CSRF token on the HTML, it's probably saved in a cookie. In Chromium-based browsers like Chrome, from the DevTools, go to the Application tab. Then, in the left panel, search for cookies and select the domain of your target website.

Cookies on DevTools
Click to open the image in full screen

There you have it!

Step 3: Set Headers

It's possible to access websites that require a login by simply sending a POST request with the payload. However, using this method alone to scrape sites with advanced security measures is naive since they're usually smart enough to identify non-human behavior. Thus, implementing measures to make the scraper appear more human than a bot might be necessary.

The most basic and realistic way to do this is by adding real browser headers to our requests. Copy the headers from the Headers tab of your browser request and add those to the Python login request. You might need to learn more about header settings for requests.

Alternatively, you can use a web scraping API like ZenRows to get around a great number of annoying anti-bot systems for you.

Step 4: The Login in Action

This is our lucky day since you don't need to add the headers for GitHub, so we're ready to send our login request through Python:

res = s.post(login_url, data=payload) 
print(res.url)

If the login was successful, the output would be https://github.com/, or https://github.com/session if not.

πŸ‘ Amazing, we just nailed a CSRF-protected login bypass! Let's now scrape the data in the protected git repositories.

Step 5: Scrape Protected GitHub Repositories

Recall that we started in an earlier code with the with requests.session() as s: statement, which creates a request session. Once you log in through a request in a session, you don't need to re-login for the subsequent requests in the same session.

It's time to get to the repositories. Generate a GET, then parse the response using BeautifulSoup.

repos_url = "https://github.com/" + login + "/?tab=repositories" 
r = s.get(repos_url) 
soup = BeautifulSoup(r.content, "html.parser")

We'll extract the username and a list of repositories.

First, for the username, navigate to the repositories page in your browser, then right-click on the username and select Inspect Element. The username is contained in a span element, with the CSS class named p-nickname vcard-username d-block inside the <h1> tag.

Git username source
Click to open the image in full screen

Second, for repositories, right-click on any repository name and select Inspect Element. The DevTools window will show the following:

Repositories HTML source
Click to open the image in full screen

The repositories' names are inside hyperlinks in the <h3> tag with the class wb-break-all. Ok, we have enough knowledge of the target elements now, so let's extract them:

usernameDiv = soup.find("span", class_="p-nickname vcard-username d-block") 
print("Username: " + usernameDiv.getText()) 
repos = soup.find_all("h3",class_="wb-break-all") 
for r in repos: 
	repoName = r.find("a").getText() 
	print("Repository Name: " + repoName)

Since it's possible to find multiple repositories on the target web page, the script uses the find_all() method to extract all. For that, the loop iterates through each <h3> tag and prints the text of the enclosed <a> tag.

Here's what the complete code looks like:

import requests 
from bs4 import BeautifulSoup 
 
login = "Your Username Here" 
password = "Your Password Here" 
login_url = "https://github.com/session" 
repos_url = "https://github.com/" + login + "/?tab=repositories" 
 
with requests.session() as s: 
	req = s.get(login_url).text 
	html = BeautifulSoup(req,"html.parser") 
	token = html.find("input", {"name": "authenticity_token"}).attrs["value"] 
	time = html.find("input", {"name": "timestamp"}).attrs["value"] 
	timeSecret = html.find("input", {"name": "timestamp_secret"}).attrs["value"] 
 
	payload = { 
		"authenticity_token": token, 
		"login": login, 
		"password": password, 
		"timestamp": time, 
		"timestamp_secret": timeSecret 
	} 
	res =s.post(login_url, data=payload) 
 
	r = s.get(repos_url) 
	soup = BeautifulSoup (r.content, "html.parser") 
	usernameDiv = soup.find("span", class_="p-nickname vcard-username d-block") 
	print("Username: " + usernameDiv.getText()) 
 
	repos = soup.find_all("h3", class_="wb-break-all") 
	for r in repos: 
		repoName = r.find("a").getText() 
		print("Repository Name: " + repoName)

And the output:

Repositories output
Click to open the image in full screen

πŸ‘ Excellent! We just scraped a CSRF-authenticated website.

Scraping behind the Login on WAF-protected Websites

On many websites, you'll still get to an Access Denied screen or receive an HTTP error like 403 after sending the correct user, password and CSRF token. Not even using the proper request headers will work. This indicates that the website uses advanced protections, like client-side browser verification.

Client-side verification is a security measure to block bots and scrapers from accessing websites, mostly implemented by WAFs (Web Application Firewalls), like Cloudflare, Akamai and PerimeterX.

Let's see how to find a solution.

Basic WAF Protections with Selenium

The risk of being blocked is too high if you use the Requests and BeautifulSoup libraries only to handle logins that require human-like interaction. The alternative? Headless browsers. They're the standard browsers you know, like Chrome or Firefox, but they don't have any GUI for a human user to interact with. The beauty of them is that they can be controlled programmatically.

Headless browsers such as Selenium are found to work pretty decently to bypass WAFs' basic login protections. Moreover, they enable you to log in to websites that use the two-step verification (you type an email, and then a password field appears) in their login process, like Twitter.

Selenium has a set of tools that help you create a headless browser instance and control it with code. Although base Selenium implementation isn't enough for scraping the WAF-protected sites, some extended libraries are available to aid us in this purpose. undetected-chromedriver is an undetectable ChromeDriver automation library that uses several evasion techniques to avoid detection. We'll do it in this tutorial.

Our target site for this case is DataCamp, an e-learning website for data analytics enthusiasts, which has a two-step login. We'll do this:
  1. Create an account on DataCamp and enroll in a Python course to scrape our data next.
  2. Log in to DataCamp using undetected-chromedriver.
  3. Navigate and scrape https://app.datacamp.com/learn.
  4. Extract profile name and enrolled courses from the parsed HTML.

Let's begin by installing and importing the required modules and libraries.

pip install selenium undetected-chromedriver
import undetected_chromedriver as uc 
import time 
from selenium.webdriver.common.by import By

Now, create an undetectable headless browser instance using the uc object and move to the login page.

chromeOptions = uc.ChromeOptions() 
chromeOptions.headless = true 
driver = uc.Chrome(use_subprocess=True, options=chromeOptions) 
driver.get("https://www.datacamp.com/users/sign_in")

To enter the email and password fields programmatically, you need to get the id of the input fields from the login form. For that, open the login page in your browser and right-click the email field to inspect the element. This will open the corresponding HTML code in the DevTools window.

The following screenshot shows the HTML source for the email field, the first one we need:

Datacamp
Click to open the image in full screen

As the login follows a 2-step process, we initially have only the Email address field on the form with id="user_email". Let's programmatically fill it and click the Next button.

uname = driver.find_element(By.ID, "user_email") 
uname.send_keys("Your Email Here") 
driver.find_element(By.CSS_SELECTOR, ".js-account-check-email").click() 
time.sleep(10)

Note that the sleep of 10 seconds is added to let the JavaScript dynamically load the Password field.

The following code enters the password and clicks the submit button to request a login:

passwordF = driver.find_element(By.ID, "user_password") 
passwordF.send_keys("Your Password Here") 
driver.find_element(By.NAME, "commit").click()

Congratulations! You are logged in. πŸ˜ƒ

Once your headless instance logs in successfully, you can move to any web page available in your dashboard. Since we want to scrape the profile name and registered course from the dashboard page, we'll find those where the following screenshot indicates:

Datacamp learn page
Click to open the image in full screen

The code below will retrieve and parse the target URL to display the profile name and registered course.

driver.get("https://app.datacamp.com/learn") 
myName = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-15alavv") 
myCourse = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-1f1m67o") 
 
print("Profile Name: " + myName.get_attribute("innerHTML")) 
print("Course Enrolled: " + myCourse.get_attribute("innerHTML")) 
driver.close()

Let's combine all previous code blocks to see what the complete scraping script looks like.

import undetected_chromedriver as udc 
import time 
from selenium.webdriver.common.by import By 
username="Your Username Here"; 
password="Your Password Here" 
chromeOptions = udc.ChromeOptions() 
chromeOptions.headless = True 
driver = udc.Chrome(use_subprocess=True, options=chromeOptions) 
driver.get("https://www.datacamp.com/users/sign_in") 
uname = driver.find_element(By.ID, "user_email") 
uname.send_keys(username) 
driver.find_element(By.CSS_SELECTOR, ".js-account-check-email").click() 
 
time.sleep(5) 
passwordF = driver.find_element(By.ID, "user_password") 
passwordF.send_keys(password) 
driver.find_element(By.NAME, "commit").click() 
time.sleep(2) 
driver.get("https://app.datacamp.com/learn") 
time.sleep(2) 
myName = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-15alavv") 
myCourse = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-1f1m67o") 
 
print("Profile Name: " + myName.get_attribute("innerHTML")) 
print("Course Enrolled: " + myCourse.get_attribute("innerHTML")) 
driver.close()

We recommend changing the headless option to False to understand what's going behind. Depending on your profile name and registered courses, the output should look like this:

Output two step login
Click to open the image in full screen

Great! We just scraped content behind a WAF-protected login. But will the same work for every website? πŸ€” Unfortunately, not.

Currently, the undetected-chromedriver package only supports Chromium browsers with version 109 or greater. Moreover, the WAF-protected sites can easily detect its headless mode.

To scrape a website that requires a login with Python, undetected-chromedriver may be enough if the protections are basic. But let's assume the site uses advanced Cloudflare protection (e.g., G2) or other DDoS mitigation services. In this case, the solution we've seen may not be reliable.

ZenRows comes to the rescue. It's a web scraping API that can easily handle all sorts of anti-bot bypasses for us, including complex ones. And it doesn't requires you to have any web browser installed because it's an API.

Advanced Protections Using ZenRows

Scraping content behind a login on a website with higher protection measures requires the right tool. We'll use ZenRows API for that purpose.

Our mission will consist in bypassing G2.com's login page, which is the first one of the two-step login and then extracting the welcome message from the Homepage after we're logged in.

But before starting with the code, let's first explore our target with DevTools. The following table lists the necessary information regarding the HTML elements that we'll interact with throughout the script. Please keep it in mind for the upcoming steps.

Element/Purpose Element Type Attribute Value
G2 login (step 1): Email input <input type="email"> Class input-group-field
G2 login (step 1): Next button to proceed to the next login step <button> Class js-button-submit
G2 login (Step 2): Password Field <input type="password"> Id password_input
G2 login (Step 2): Login form submit button <input type="submit"> CSS Selector input[value='Sign In']
Welcome message at Homepage <div> Class l4 color-white my-1

With ZenRows, you don't need to install any particular browser drivers (as you would do with Selenium). Moreover, you don't need to worry about advanced Cloudflare protection, identity reveal and other DDoS mitigation services. Additionally, this scalable API frees you from infrastructure scalability issues.

Just sign up for free to get to the Request Builder and fill in the details as per the screenshot below.

ZenRows 2 step scraping
Click to open the image in full screen
Let's talk about the request creation step by step:
  1. Set the initial target (i.e., G2 login page in our case).
  2. Choose Plain HTML. We'll parse it further using BeatifulSoup later in the code. If you prefer, you can use the CSS Selectors to scrape only some specific elements from the target.
  3. Setting Premium Proxies helps you scrape region-specific data and mask you from identity reveal.
  4. Setting JavaScript Rendering is mandatory for running some JavaScript instructions in Step 6.
  5. Selecting Antibot helps you bypass advanced WAF security measures.
  6. Checking JavaScript Instructions allows you to add an encoded string of JavaScript instructions to run on the target. It allows controls similar to a headless browser.
  7. A text box will appear when you check the JavaScript Instructions checkbox. You can write any number of JS instructions, and we put the following instructions: in our case:
[ 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('.input-group-field').value = 'Your Business Email Here';"}, 
	{"wait": 1000}, 
	{"click": ".js-button-submit"}, 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('#password_input').value = 'Your Password Here';"}, 
	{"wait": 1000}, 
	{"click": "input[value='Sign In']"}, 
	{"wait": 6000} 
]
Note: Update the code above by adding your own login credentials.
  1. Choose Python.
  2. Select SDK and copy the whole code. Remember to install the ZenRows SDK package using pip install zenrows.

Now, you can paste this code into your Python project and execute it. We've copied the SDK code and modified it to make it more portable and easier to understand.

# pip install zenrows 
from zenrows import ZenRowsClient 
import urllib 
import json 
 
client = ZenRowsClient("Your Zenrows API Goes Here") 
url = "https://www.g2.com/login?form=signup#state.email.showform" 
 
js_instructions = [ 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('.input-group-field').value = 'Your G2 Login Email Here';"}, 
	{"wait": 1000}, 
	{"click": ".js-button-submit"}, 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('#password_input').value = 'Your G2 Password Here';"}, 
	{"wait": 1000}, 
	{"click": "input[value='Sign In']"}, 
	{"wait": 6000} 
] 
 
params = { 
	"js_render":"true", 
	"antibot":"true", 
	"js_instructions":urllib.parse.quote(json.dumps(js_instructions)), 
	"premium_proxy":"true" 
} 
 
response = client.get(url, params=params) 
 
print(response.text)

That code snippet brings and prints the plain HTML from the G2 Homepage after logging in. Now, we'll use BeatifulSoup to further parse the HTML and extract the data we want.

from bs4 import BeautifulSoup 
soup = BeautifulSoup(response.text, "html.parser") 
welcome = soup.find("div", attrs={"class", "l4 color-white my-1"}) 
print(welcome.text)

And we succeeded! πŸ₯³

Output ZenRows 2 step login
Click to open the image in full screen

Here's the full code:

# pip install zenrows 
from zenrows import ZenRowsClient 
from bs4 import BeautifulSoup 
import urllib 
import json 
 
client = ZenRowsClient("Your Zenrows API Goes Here") 
url = "https://www.g2.com/login?form=signup#state.email.showform" 
 
js_instructions = [ 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('.input-group-field').value = 'Your G2 Login Email Here';"}, 
	{"wait": 1000}, 
	{"click": ".js-button-submit"}, 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('#password_input').value = 'Your G2 Password Here';"}, 
	{"wait": 1000}, 
	{"click": "input[value='Sign In']"}, 
	{"wait": 6000} 
] 
 
params = { 
	"js_render":"true", 
	"antibot":"true", 
	"js_instructions":urllib.parse.quote(json.dumps(js_instructions)), 
	"premium_proxy":"true" 
} 
 
response = client.get(url, params=params) 
 
soup = BeautifulSoup(response.text, "html.parser") 
welcome = soup.find("div", attrs={"class", "l4 color-white my-1"}) 
print(welcome.text)

Conclusion

What works to scrape a website that requires login with Python? As seen, inspecting the HTML with BeautifulSoup and getting the cookies with the Requests library can help you. However, for modern websites with robust anti-bot solutions, you need undetectable headless browsers. The problem with them is scalability, costs and performance limitations. Moreover, they can still get blocked by websites with advanced WAFs implemented.

If you're looking for an easy and scalable solution to scrape a website with Python, ZenRows offers an API-based service that works best as we just saw.

Here are a few tips you should have in mind to avoid being blocked. Also, you might be interested in reading our guide on web scraping with Selenium in Python, and how to bypass Cloudflare with Selenium as well.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.