The Anti-bot Solution to Scrape Everything? Get Your Free API Key! ๐Ÿ˜Ž

How to Scrape a Website that Requires a Login with Python

January 31, 2023 ยท 13 min read

While web scraping, you might find some data available only after you've signed in. In this tutorial, we'll learn the security measures used and three effective methods to scrape a website that requires a login with Python.

Let's find a solution!

Can You Scrape Websites that Require a Login?

Yes, it's technically possible to scrape behind a login. But you must be mindful of the target site's scraping rules and laws like GDPR to comply with personal data and privacy matters.

To get started, it's essential to have some general knowledge about HTTP Request Methods. And if web scraping is new for you, read our beginner-friendly guide on web scraping with Python to master the fundamentals.

How Do You Log into a Website with Python?

The first step to scraping a login-protected website with Python is figuring out your target domain's login type. Some old websites just require sending a username and password. However, modern ones use more advanced security measures. These include:

  • Client-side validations
  • CSRF tokens
  • Web Application Firewalls (WAFs)

Keep reading to learn the techniques to get around these strict security protections.

How Do You Scrape a Website behind a Login in Python?

Time to explore each step of scraping data behind site logins with Python. We'll start with forms requiring only a username and password and then increase the difficulty progressively.

Remember that the methods showcased in this tutorial are for educational purposes only.

Three, two, one... let's code!

Sites Requiring a Simple Username and Password Login

We assume you've already set up Python 3 and Pip; otherwise, you should check a guide on properly installing Python.

As dependencies, we'll use the Requests and Beautiful Soup libraries. Start by installing them:

Terminal
pip install requests beautifulsoup4

Tip: If you need any help during the installation, visit this page for Requests and this one for Beautiful Soup.

Now, go to Acunetix's User Information. This is a test page explicitly made for learning purposes and protected by a simple login, so you'll be redirected to a login page.

Before going further, we'll analyze what happens when attempting a login. For that, use test as a username and password, hit the login button and check the network section on your browser.

scraping-simple-login-websites-python
Click to open the image in full screen

Submitting the form generates a POST request to the User Information page, with the server responding with a cookie and fulfilling the requested section. The screenshot below shows the headers, payload, response, and cookies.

scrape-login-websites-python
Click to open the image in full screen

The following scraping script will bypass the auth wall. It creates a similar payload and posts the request to the User Information page. Once the response arrives, the program uses Beautiful Soup to parse the response text and print the page name.

program.py
from bs4 import BeautifulSoup as bs 
import requests 
URL = "http://testphp.vulnweb.com/userinfo.php" 
 
payload = { 
	"uname": "test", 
	"pass": "test" 
} 
s = requests.session() 
response = s.post(URL, data=payload) 
print(response.status_code) # If the request went Ok we usually get a 200 status. 
 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(response.content, "html.parser") 
protected_content = soup.find(attrs={"id": "pageName"}).text 
print(protected_content)

This is our output:

scrape-login-websites-test
Click to open the image in full screen

Great! ๐ŸŽ‰ You just learned scraping sites behind simple logins with Python. Now, let's try with a bit more complex protections.

Scraping Websites with CSRF Token Authentication for Login

It's not that easy to log into a website in 2024. Most have implemented additional security measures to stop hackers and malicious bots. One of these measures requires a CSRF (Cross-Site Request Forgery) token in the authentication process.

To find out if your target website requires CSRF or an authenticity_token, make the most of your browser's Developer Tools. It doesn't matter whether you use Safari, Chrome, Edge, Chromium, or Firefox since all provide a similar set of powerful tools for developers. To learn more, we suggest checking out the Chrome DevTools or Mozilla DevTools documentation.

Let's dive into scraping GitHub!

Step #1: Log into a GitHub Account

GitHub is one of the websites that use CSRF token authentication for logins. We'll scrape all the repositories in our test account for demonstration.

Open a web browser (we use Chrome) and navigate to GitHub's login page. Now, press the F12 key to see the DevTools window in your browser and inspect the HTML to check if the login form element has an action attribute:

scrape-git-tutorial
Click to open the image in full screen

Select the Network tab and click the Sign in button, then fill in and submit the form yourself. This'll perform a few HTTP requests, visible in this tab.

scraping-git-login-python
Click to open the image in full screen

Let's look at what we've got after clicking on the Sign in button. To do so, explore the POST request named session that has just been sent.

In the Headers section, you'll find the full URL where the credentials are posted. We'll use it to send a login request in our script.

scraping-git-login-tutorial
Click to open the image in full screen
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step #2: Set up Payload for the CSRF-protected Login Request

Now, you might be wondering how we know there's CSRF protection. The answer is in right front of you:

Navigate to the Payload section of the session request. Notice that, in addition to login and password, we have payload data for the authentication token and the timestamps. This auth token is the CSRF token and must be passed as a payload along the login POST request.

scraping-login-website-tutorial
Click to open the image in full screen

Manually copying these fields from the Payload section for each new login request will be tedious. Instead, we'll write code to get that programmatically.

Let's go back to the HTML source of the login form. You'll see all the Payload fields are present in the form.

scraping-login-website-python-tutorial
Click to open the image in full screen

The following script gets the CSRF token, timestamp, and timestamp_secret from the login page:

program.py
import requests 
from bs4 import BeautifulSoup 
login_url = "https://github.com/session" 
login = "Your Git username Here" 
password = "Your Git Password Here" 
with requests.session() as s: 
	req = s.get(login_url).text 
	html = BeautifulSoup(req,"html.parser") 
	token = html.find("input", {"name": "authenticity_token"}). attrs["value"] 
	time = html.find("input", {"name": "timestamp"}).attrs["value"] 
	timeSecret = html.find("input", {"name": "timestamp_secret"}). attrs["value"]

We can now populate the payload dictionary for our Python login request as:

program.py
payload = { 
	"authenticity_token": token, 
	"login": login, 
	"password": password, 
	"timestamp": time, 
	"timestamp_secret": timeSecret 
}

Note: If you can't find the CSRF token on the HTML, it's probably saved in a cookie. In Chromium-based browsers, go to the Application tab in the DevTools. Then, in the left panel, search for cookies and select the domain of your target website.

scraping-login-website-cookies
Click to open the image in full screen

There you have it!

Step #3: Set Headers

It's possible to access auth-wall websites by sending a POST request with the payload. However, using this method alone won't be enough to scrape sites with advanced security measures since they're usually smart enough to identify non-human behavior. Thus, implementing measures to make the scraper appear more human-like is necessary.

The most realistic way to do this is by adding actual browser headers to our requests. Copy the ones from the Headers tab of your browser request and add them to the Python login request. Try this guide if you need to learn more about header settings for requests.

Alternatively, you can use a web scraping API like ZenRows to get around those annoying anti-bot systems for you.

Step #4: The Login in Action

This is our lucky day since adding headers for GitHub is unnecessary, so we're ready to send our login request through Python:

program.py
res = s.post(login_url, data=payload) 
print(res.url)

If the login's successful, our output'll be https://github.com/. Otherwise, we'll get https://github.com/session.

๐Ÿ‘ Amazing, we just nailed a CSRF-protected login bypass! Let's now scrape the data in the protected git repositories.

Step #5: Scrape Protected GitHub Repositories

Recall that we began an earlier code with the with requests.session() as s: statement, which creates a request session. Once you log in through a request, you don't need to re-login for the subsequent requests in the same session.

It's time to get to the repositories. Generate a GET, then parse the response using Beautiful Soup.

program.py
repos_url = "https://github.com/" + login + "/?tab=repositories" 
r = s.get(repos_url) 
soup = BeautifulSoup(r.content, "html.parser")

We'll extract the username and a list of repositories.

For the former, navigate to the repositories page in your browser, then right-click on the username and select Inspect Element. The information's contained in a span element, with the CSS class named p-nickname vcard-username d-block inside the <h1> tag.

scraping-git-login-python-tutorial
Click to open the image in full screen

While for the latter, you need to right-click on any repository name and select Inspect Element. The DevTools window will display the following:

scraping-login-tutorial
Click to open the image in full screen

The repositories' names are inside hyperlinks in the <h3> tag with the class wb-break-all. Ok, we have enough knowledge of the target elements now, so let's extract them:

program.py
usernameDiv = soup.find("span", class_="p-nickname vcard-username d-block") 
print("Username: " + usernameDiv.getText()) 
repos = soup.find_all("h3",class_="wb-break-all") 
for r in repos: 
	repoName = r.find("a").getText() 
	print("Repository Name: " + repoName)

Since it's possible to find multiple repositories on the target web page, the script uses the find_all() method to extract all. For that, the loop iterates through each <h3> tag and prints the text of the enclosed <a> tag.

Here's what the complete code looks like:

program.py
import requests 
from bs4 import BeautifulSoup 
 
login = "Your Username Here" 
password = "Your Password Here" 
login_url = "https://github.com/session" 
repos_url = "https://github.com/" + login + "/?tab=repositories" 
 
with requests.session() as s: 
	req = s.get(login_url).text 
	html = BeautifulSoup(req,"html.parser") 
	token = html.find("input", {"name": "authenticity_token"}).attrs["value"] 
	time = html.find("input", {"name": "timestamp"}).attrs["value"] 
	timeSecret = html.find("input", {"name": "timestamp_secret"}).attrs["value"] 
 
	payload = { 
		"authenticity_token": token, 
		"login": login, 
		"password": password, 
		"timestamp": time, 
		"timestamp_secret": timeSecret 
	} 
	res =s.post(login_url, data=payload) 
 
	r = s.get(repos_url) 
	soup = BeautifulSoup (r.content, "html.parser") 
	usernameDiv = soup.find("span", class_="p-nickname vcard-username d-block") 
	print("Username: " + usernameDiv.getText()) 
 
	repos = soup.find_all("h3", class_="wb-break-all") 
	for r in repos: 
		repoName = r.find("a").getText() 
		print("Repository Name: " + repoName)

And the output:

scraping-login-websites-repositories
Click to open the image in full screen

๐Ÿ‘ Excellent! We just scraped a CSRF-authenticated website.

Scraping behind the Login on WAF-protected Websites

On many websites, you'll still get to an Access Denied screen or receive an HTTP error like 403 forbidden error after sending the correct user, password, and CSRF token. Not even using the proper request headers will work. This indicates that the site uses advanced protections, like client-side browser verification.

Client-side verification is a security measure to block bots and scrapers from accessing websites, mostly implemented by WAFs (Web Application Firewalls), like Cloudflare, Akamai, and PerimeterX.

Let's see how to find a solution.

Basic WAF Protections with Selenium

The risk of being blocked is too high if you use just the Requests and Beautiful Soup libraries to handle logins that require human-like interaction. The alternative? Headless browsers. They're the standard browsers you know, like Chrome or Firefox, but they don't have any GUI for a human user to interact with. The beauty of them is that they can be controlled programmatically.

Headless browsers such as Selenium are found to work pretty decently to bypass WAFs' basic login protections. Moreover, they enable you to log in to websites that use the two-step verification (you type an email, and then a password field appears) in their login process, like Twitter.

Selenium has a set of tools that help you create a headless browser instance and control it with code. Although base Selenium implementation isn't enough for scraping WAF-protected sites, some extended libraries are available to aid us in this purpose. undetected-chromedriver is an undetectable ChromeDriver automation library that uses several evasion techniques to avoid detection. Let's see how it works.

Assume we want to scrape DataCamp, an e-learning website for data analytics enthusiasts that has a two-step login. We'll have to do the following:

  1. Create an account on DataCamp and enroll in a Python course to scrape our data next.
  2. Log in to DataCamp using undetected-chromedriver.
  3. Navigate and scrape https://app.datacamp.com/learn.
  4. Extract profile name and enrolled courses from the parsed HTML.

Let's begin by installing and importing the required modules and libraries.

Terminal
pip install selenium undetected-chromedriver
program.py
import undetected_chromedriver as uc 
import time 
from selenium.webdriver.common.by import By

Now, create an undetectable headless browser instance using the uc object and move to the login page.

program.py
chromeOptions = uc.ChromeOptions() 
chromeOptions.headless = True 
driver = uc.Chrome(use_subprocess=True, options=chromeOptions) 

driver.get("https://www.datacamp.com/users/sign_in")
time.sleep(10) # To let the login page load.

To enter the email and password fields programmatically, you need to get the id of the input fields from the login form. For that, open the login page in your browser and right-click the email field to inspect the element. This'll open the corresponding HTML code in the DevTools window.

The following screenshot shows the HTML source for the email field. That's the first one we need:

scrape-datacamp-login
Click to open the image in full screen

As the login follows a two-step process, we initially have only the Email address field on the form with id="user_email". Let's fill it in programmatically and click the Next button.

program.py
uname = driver.find_element(By.ID, "user_email") 
uname.send_keys("Your_Email_Here") 
driver.find_element(By.CSS_SELECTOR, ".js-account-check-email").click() 
time.sleep(5)

Note that the 5-second sleep is added to let the JavaScript dynamically load the Password field. The following code enters the password and clicks the submit button to request a login:

program.py
passwordF = driver.find_element(By.ID, "user_password") 
passwordF.send_keys("Your_Password_Here") 
driver.find_element(By.NAME, "commit").click()

Congratulations! You are logged in. ๐Ÿ˜ƒ

Once your headless instance logs in successfully, you can move to any page available. Since we want to scrape the profile name and registered course from the dashboard, we'll find those here:

scrape-datacamp-tutorial
Click to open the image in full screen

The code below will retrieve and parse the target URL to display the profile name and registered course.

program.py
driver.get("https://app.datacamp.com/learn") 
myName = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-15alavv") 
myCourse = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-1f1m67o") 
 
print("Profile Name: " + myName.get_attribute("innerHTML")) 
print("Course Enrolled: " + myCourse.get_attribute("innerHTML")) 
driver.close()

Let's combine all previous code blocks to see what our complete scraping script looks like.

program.py
import undetected_chromedriver as uc 
import time 
from selenium.webdriver.common.by import By

chromeOptions = uc.ChromeOptions() 
chromeOptions.headless = True 
driver = uc.Chrome(use_subprocess=True, options=chromeOptions) 

driver.get("https://www.datacamp.com/users/sign_in")
time.sleep(10)

uname = driver.find_element(By.ID, "user_email") 
uname.send_keys("Your_Email_Here") 
driver.find_element(By.CSS_SELECTOR, ".js-account-check-email").click() 
time.sleep(5)

passwordF = driver.find_element(By.ID, "user_password") 
passwordF.send_keys("Your_Password_Here") 
driver.find_element(By.NAME, "commit").click()
time.sleep(5)

driver.get("https://app.datacamp.com/learn") 
time.sleep(5)

myName = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-15alavv") 
myCourse = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-1f1m67o") 
 
print("Profile Name: " + myName.get_attribute("innerHTML")) 
print("Course Enrolled: " + myCourse.get_attribute("innerHTML")) 
driver.close()

We recommend changing the headless option to False to understand what's going behind. Depending on your profile name and registered courses, the output should look like this:

scrape-login-protected-websites
Click to open the image in full screen

Great! We successfully scraped content behind a WAF-protected login. But will this work for every website? ๐Ÿค” Unfortunately, not.

Currently, the undetected-chromedriver package only supports Chromium browsers with version 109 or above. Moreover, WAF-protected sites can easily detect its headless mode.

So, if you want to scrape a login-protected website with Python, you may rely on undetected-chromedriver only if the protections are basic. However, assume your target uses advanced Cloudflare protection (e.g., G2) or other DDoS mitigation services. In that case, this solution most probably won't work.

ZenRows comes to the rescue. It's a web scraping API that can easily handle all sorts of anti-bot bypasses for you, including complex ones. Moreover, since it's an API, it doesn't require any browser installation.

Advanced Protections Using ZenRows

Scraping content behind a login on a website with advanced protection measures requires the right tool. We'll use ZenRows API.

Our mission consists of bypassing G2.com's login page, the first of the two-step login, and extracting the Homepage welcome message after we've managed to get in.

But before getting our hands dirty with code, we must first explore our target with DevTools. The table below lists the necessary information regarding the HTML elements we'll interact with throughout the script. Keep those in mind for the upcoming steps.

Element/Purpose Element Type Attribute Value
G2 login (step 1): Email input Class input-group-field
G2 login (step 1): Next button to proceed to the next login step Class js-button-submit
G2 login (Step 2): Password Field Id password_input
G2 login (Step 2): Login form submit button CSS Selector input[value='Sign In']
Welcome message at Homepage
Class l4 color-white my-1

As mentioned, with ZenRows, you don't need to install any particular browser drivers, as opposed to Selenium). Moreover, you don't need to worry about advanced Cloudflare protection, identity reveal, and other DDoS mitigation services. Additionally, this scalable API frees you from infrastructure scalability issues.

Just sign up for free to get to the Request Builder and fill in the details shown below.

ZenRows Request Builder Page
Click to open the image in full screen

Let's go through each step of the request creation:

  1. Set the initial target (i.e., G2 login page in our case).
  2. Choose Plain HTML. We'll parse it further using Beautiful Soup later in the code. If you prefer, you can use the CSS Selectors to scrape only specific elements from the target.
  3. Setting Premium Proxies helps you scrape region-specific data and mask you from identity reveal.
  4. Setting JavaScript Rendering is mandatory for running some JavaScript instructions in step #6.
  5. Selecting Antibot helps you bypass advanced WAF security measures.
  6. Checking JS Instructions lets you add an encoded string of JavaScript instructions to run on the target. In turn, this allows control similar to a headless browser.
  7. A text box appears when you complete the instructions checkbox. You can write any number of them, and we put in the following:
program.py
[ 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('.input-group-field').value = 'Your_G2_Login_Email';"}, 
	{"wait": 1000}, 
	{"click": ".js-button-submit"}, 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('#password_input').value = 'Your_G2_Login_Password';"}, 
	{"wait": 1000}, 
	{"click": "input[value='Sign In']"}, 
	{"wait": 6000} 
] 

Note: Update the code above by adding your own login credentials.

  1. Choose Python.
  2. Select API and copy the whole code. Remember to install the Requests package using pip install requests.

Paste this script into your Python project and execute it. We've copied the API code and modified it to make it more portable and easier to understand.

Example
# pip install requests
import requests
import urllib 
import json 

url = 'https://www.g2.com/login?form=signup#state.email.showform'
apikey = '<YOUR_ZENROWS_API_KEY>'

js_instructions = [ 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('.input-group-field').value = 'Your_G2_Login_Email';"}, 
	{"wait": 1000}, 
	{"click": ".js-button-submit"}, 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('#password_input').value = 'Your_G2_Login_Password';"}, 
	{"wait": 1000}, 
	{"click": "input[value='Sign In']"}, 
	{"wait": 6000} 
] 

params = {
    'url': url,
    'apikey': apikey,
    'js_render': 'true',
    'antibot': 'true',
    'js_instructions': urllib.parse.quote(json.dumps(js_instructions)),
    'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

That snippet brings and prints the plain HTML from the G2 Homepage after logging in. Now, we'll use Beautiful Soup to further parse the HTML and extract the data we want.

program.py
from bs4 import BeautifulSoup 
soup = BeautifulSoup(response.text, "html.parser") 
welcome = soup.find("div", attrs={"class", "l4 color-white my-1"}) 
print(welcome.text)

It's a success! ๐Ÿฅณ

web-scraping-login-zenrows
Click to open the image in full screen

Here's the complete code:

program.py
# pip install requests
import requests
from bs4 import BeautifulSoup 
import urllib 
import json 

url = 'https://www.g2.com/login?form=signup#state.email.showform'
apikey = '<YOUR_ZENROWS_API_KEY>'

js_instructions = [ 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('.input-group-field').value = 'Your_G2_Login_Email';"}, 
	{"wait": 1000}, 
	{"click": ".js-button-submit"}, 
	{"wait": 2000}, 
	{"evaluate": "document.querySelector('#password_input').value = 'Your_G2_Login_Password';"}, 
	{"wait": 1000}, 
	{"click": "input[value='Sign In']"}, 
	{"wait": 6000} 
] 

params = {
    'url': url,
    'apikey': apikey,
    'js_render': 'true',
    'antibot': 'true',
    'js_instructions': urllib.parse.quote(json.dumps(js_instructions)),
    'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)

soup = BeautifulSoup(response.text, "html.parser") 
welcome = soup.find("div", attrs={"class", "l4 color-white my-1"}) 
print(welcome.text)

Conclusion

How to successfully scrape a website that requires login with Python?

Inspecting the HTML using BeautifulSoup and getting the cookies with requests can help you. However, modern websites rely on robust anti-bot solutions, meaning you need undetectable headless browsers. The problem with them is scalability, costs, and performance limitations. Moreover, they can still get blocked by websites with advanced WAFs implemented.

If you're looking for an easy and scalable solution to scrape a website with Python, ZenRows offers an API-based service that works best, as seen above.

Don't stop learning! Here are a few tips you should have in mind to avoid being blocked. Also, check out our guides on web scraping with Selenium in Python and bypassing Cloudflare with Selenium to add some valuable skills to your tool belt.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.