How to Scrape a Website that Requires a Login with Python
While web scraping, you might find some data available only after you've signed in. In this tutorial, we'll learn the security measures used and three effective methods to scrape a website that requires a login with Python.
Let's find a solution!
Can You Scrape Websites that Require a Login?
Yes, it's technically possible to do web scraping behind a login. But you have to be mindful of the scraping rules of the target sites as well as laws like GDPR for compliance with personal data and privacy matters.
To get started, it's essential to have some general knowledge about HTTP Request Methods. And if web scraping is new for you, we recommend reading our guide on web scraping with Python to master the fundamentals.
How Do You Log into a Website with Python?
- Client-side validations.
- CSRF tokens.
- Web Application Firewalls (WAFs).
Keep reading to learn techniques to get around these strict security protections.
How Do You Scrape a Website behind a Login in Python?
We'll see the step by step of scraping data behind site logins with Python. We'll start with forms requiring only a username and password and then increase the difficulty progressively.
Just beware the methods showcased in this tutorial are for educational purposes only.
Three, two, one⦠let's code!
Sites Requiring a Simple Username and Password Login
We assume that you've already set up Python 3 and Pip, otherwise you should check a guide on properly installing Python.
As dependencies, we'll use the Requests and BeautifulSoup libraries. Start by installing them:
pip install requests beautifulsoup4
Tip: If you have any trouble during the installation, visit this page for Requests and this one for Beautiful Soup.
Now, go to Acunetix's User Information. This is a test page made specifically for learning purposes and is protected by a simple login, so you'll be redirected to a login page.
Before going further, we'll analyze what happens when attempting a login. For that, use test
as a username and password, hit the login button and check the network section on your browser.

Submitting the form generates a POST
request to the User Information page, with the server responding with a cookie and fulfilling the requested section. The screenshot below shows the headers, payload, response and cookies.

The following web scraping script will bypass the login. It creates a similar payload and posts the request to the User Information page. Once the response arrives, the program uses Beautiful Soup to parse the response text and print the page name.
from bs4 import BeautifulSoup as bs
import requests
URL = "http://testphp.vulnweb.com/userinfo.php"
payload = {
"uname": "test",
"pass": "test"
}
s = requests.session()
response = s.post(URL, data=payload)
print(response.status_code) # If the request went Ok we usually get a 200 status.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
protected_content = soup.find(attrs={"id": "pageName"}).text
print(protected_content)
This is our output:

Great! π You just learned scraping sites behind simple logins with Python. Now, let's try with a bit more complex protections.
Scraping Websites with CSRF Token Authentication for Login
In 2023, it's not so easy to log into a website. Most websites have implemented additional security measures to stop hackers and malicious bots. One of these measures requires a CSRF (Cross-Site Request Forgery) token in the authentication process.
To find out if your target website requires CSRF or an authenticity_token
, make the most of your browser's Developer Tools. It doesn't matter whether you use Safari, Chrome, Edge, Chromium or Firefox because all have a similar set of powerful tools for developers. To learn more, we suggest checking out the Chrome DevTools or Mozilla DevTools documentation.
Let's dive into scraping GitHub!
Step 1: Log into a GitHub Account
GitHub is one of the websites that use CSRF token authentication for logins. We'll scrape all the repositories in our test account for demonstration.
Open a web browser (Chrome, in our case) and navigate to GitHub's login page. Now, press the F12
key to see the DevTools window in your browser and inspect the HTML of the page to check if the login form element has an action attribute:

Select the Network
tab from the DevTools window and click the Sign in
button, then fill and submit the form yourself. This will perform a few HTTP requests, visible in this tab.

Let's look at what we've got after clicking on the Sign in button by taking a look at the POST
request named session
that has just been sent.
In the Headers
section, you'll find the full URL where the login credentials are posted. We'll use it to send a login request in our script.

Step 2: Set up Payload for the CSRF-protected Login Request
Now, you might be wondering how we know there's CSRF protection. The answer is in front of us:
Navigate to the Payload
section of the session
request. Notice that, in addition to login
and password
, we have payload data for the authentication token and the timestamps. This authenticity token is the CSRF token and must be passed as a payload along the login POST
request.

Manually copying these fields from the Payload
section for each new login request is tedious. We'll definitely write code to get that programmatically.
Next, look again at the HTML source of the login form. You'll see all the Payload
fields are present in the form.

The following script gets the CSRF token
, timestamp
and timestamp_secret
from the login page:
import requests
from bs4 import BeautifulSoup
login_url = "https://github.com/session"
login = "Your Git username Here"
password = "Your Git Password Here"
with requests.session() as s:
req = s.get(login_url).text
html = BeautifulSoup(req,"html.parser")
token = html.find("input", {"name": "authenticity_token"}). attrs["value"]
time = html.find("input", {"name": "timestamp"}).attrs["value"]
timeSecret = html.find("input", {"name": "timestamp_secret"}). attrs["value"]
We can now populate our payload
dictionary for our Python login request as:
payload = {
"authenticity_token": token,
"login": login,
"password": password,
"timestamp": time,
"timestamp_secret": timeSecret
}
Note: If you can't find the CSRF token on the HTML, it's probably saved in a cookie. In Chromium-based browsers like Chrome, from the DevTools, go to the Application
tab. Then, in the left panel, search for cookies
and select the domain of your target website.

There you have it!
Step 3: Set Headers
It's possible to access websites that require a login by simply sending a POST
request with the payload
. However, using this method alone to scrape sites with advanced security measures is naive since they're usually smart enough to identify non-human behavior. Thus, implementing measures to make the scraper appear more human than a bot might be necessary.
The most basic and realistic way to do this is by adding real browser headers to our requests. Copy the headers from the Headers
tab of your browser request and add those to the Python login request. You might need to learn more about header settings for requests.
Alternatively, you can use a web scraping API like ZenRows to get around a great number of annoying anti-bot systems for you.
Step 4: The Login in Action
This is our lucky day since you don't need to add the headers for GitHub, so we're ready to send our login request through Python:
res = s.post(login_url, data=payload)
print(res.url)
If the login was successful, the output would be https://github.com/
, or https://github.com/session
if not.
π Amazing, we just nailed a CSRF-protected login bypass! Let's now scrape the data in the protected git repositories.
Step 5: Scrape Protected GitHub Repositories
Recall that we started in an earlier code with the with requests.session() as s:
statement, which creates a request session. Once you log in through a request in a session, you don't need to re-login for the subsequent requests in the same session.
It's time to get to the repositories. Generate a GET
, then parse the response using BeautifulSoup.
repos_url = "https://github.com/" + login + "/?tab=repositories"
r = s.get(repos_url)
soup = BeautifulSoup(r.content, "html.parser")
We'll extract the username and a list of repositories.
First, for the username, navigate to the repositories page in your browser, then right-click on the username and select Inspect Element
. The username is contained in a span element, with the CSS class named p-nickname vcard-username d-block
inside the <h1>
tag.

Second, for repositories, right-click on any repository name and select Inspect Element
. The DevTools window will show the following:

The repositories' names are inside hyperlinks in the <h3>
tag with the class wb-break-all
. Ok, we have enough knowledge of the target elements now, so let's extract them:
usernameDiv = soup.find("span", class_="p-nickname vcard-username d-block")
print("Username: " + usernameDiv.getText())
repos = soup.find_all("h3",class_="wb-break-all")
for r in repos:
repoName = r.find("a").getText()
print("Repository Name: " + repoName)
Since it's possible to find multiple repositories on the target web page, the script uses the find_all()
method to extract all. For that, the loop iterates through each <h3>
tag and prints the text of the enclosed <a>
tag.
Here's what the complete code looks like:
import requests
from bs4 import BeautifulSoup
login = "Your Username Here"
password = "Your Password Here"
login_url = "https://github.com/session"
repos_url = "https://github.com/" + login + "/?tab=repositories"
with requests.session() as s:
req = s.get(login_url).text
html = BeautifulSoup(req,"html.parser")
token = html.find("input", {"name": "authenticity_token"}).attrs["value"]
time = html.find("input", {"name": "timestamp"}).attrs["value"]
timeSecret = html.find("input", {"name": "timestamp_secret"}).attrs["value"]
payload = {
"authenticity_token": token,
"login": login,
"password": password,
"timestamp": time,
"timestamp_secret": timeSecret
}
res =s.post(login_url, data=payload)
r = s.get(repos_url)
soup = BeautifulSoup (r.content, "html.parser")
usernameDiv = soup.find("span", class_="p-nickname vcard-username d-block")
print("Username: " + usernameDiv.getText())
repos = soup.find_all("h3", class_="wb-break-all")
for r in repos:
repoName = r.find("a").getText()
print("Repository Name: " + repoName)
And the output:

π Excellent! We just scraped a CSRF-authenticated website.
Scraping behind the Login on WAF-protected Websites
On many websites, you'll still get to an Access Denied screen or receive an HTTP error like 403 after sending the correct user, password and CSRF token. Not even using the proper request headers will work. This indicates that the website uses advanced protections, like client-side browser verification.
Client-side verification is a security measure to block bots and scrapers from accessing websites, mostly implemented by WAFs (Web Application Firewalls), like Cloudflare, Akamai and PerimeterX.
Let's see how to find a solution.
Basic WAF Protections with Selenium
The risk of being blocked is too high if you use the Requests and BeautifulSoup libraries only to handle logins that require human-like interaction. The alternative? Headless browsers. They're the standard browsers you know, like Chrome or Firefox, but they don't have any GUI for a human user to interact with. The beauty of them is that they can be controlled programmatically.
Headless browsers such as Selenium are found to work pretty decently to bypass WAFs' basic login protections. Moreover, they enable you to log in to websites that use the two-step verification (you type an email, and then a password field appears) in their login process, like Twitter.
Selenium has a set of tools that help you create a headless browser instance and control it with code. Although base Selenium implementation isn't enough for scraping the WAF-protected sites, some extended libraries are available to aid us in this purpose. undetected-chromedriver
is an undetectable ChromeDriver automation library that uses several evasion techniques to avoid detection. We'll do it in this tutorial.
- Create an account on DataCamp and enroll in a Python course to scrape our data next.
- Log in to DataCamp using
undetected-chromedriver
. - Navigate and scrape
https://app.datacamp.com/learn
. - Extract profile name and enrolled courses from the parsed HTML.
Let's begin by installing and importing the required modules and libraries.
pip install selenium undetected-chromedriver
import undetected_chromedriver as uc
import time
from selenium.webdriver.common.by import By
Now, create an undetectable headless browser instance using the uc
object and move to the login page.
chromeOptions = uc.ChromeOptions()
chromeOptions.headless = true
driver = uc.Chrome(use_subprocess=True, options=chromeOptions)
driver.get("https://www.datacamp.com/users/sign_in")
To enter the email and password fields programmatically, you need to get the id
of the input fields from the login form. For that, open the login page in your browser and right-click the email field to inspect the element. This will open the corresponding HTML code in the DevTools window.
The following screenshot shows the HTML source for the email field, the first one we need:

As the login follows a 2-step process, we initially have only the Email address
field on the form with id="user_email"
. Let's programmatically fill it and click the Next
button.
uname = driver.find_element(By.ID, "user_email")
uname.send_keys("Your Email Here")
driver.find_element(By.CSS_SELECTOR, ".js-account-check-email").click()
time.sleep(10)
Note that the sleep of 10 seconds is added to let the JavaScript dynamically load the Password
field.
The following code enters the password and clicks the submit button to request a login:
passwordF = driver.find_element(By.ID, "user_password")
passwordF.send_keys("Your Password Here")
driver.find_element(By.NAME, "commit").click()
Congratulations! You are logged in. π
Once your headless instance logs in successfully, you can move to any web page available in your dashboard. Since we want to scrape the profile name and registered course from the dashboard page, we'll find those where the following screenshot indicates:

The code below will retrieve and parse the target URL to display the profile name and registered course.
driver.get("https://app.datacamp.com/learn")
myName = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-15alavv")
myCourse = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-1f1m67o")
print("Profile Name: " + myName.get_attribute("innerHTML"))
print("Course Enrolled: " + myCourse.get_attribute("innerHTML"))
driver.close()
Let's combine all previous code blocks to see what the complete scraping script looks like.
import undetected_chromedriver as udc
import time
from selenium.webdriver.common.by import By
username="Your Username Here";
password="Your Password Here"
chromeOptions = udc.ChromeOptions()
chromeOptions.headless = True
driver = udc.Chrome(use_subprocess=True, options=chromeOptions)
driver.get("https://www.datacamp.com/users/sign_in")
uname = driver.find_element(By.ID, "user_email")
uname.send_keys(username)
driver.find_element(By.CSS_SELECTOR, ".js-account-check-email").click()
time.sleep(5)
passwordF = driver.find_element(By.ID, "user_password")
passwordF.send_keys(password)
driver.find_element(By.NAME, "commit").click()
time.sleep(2)
driver.get("https://app.datacamp.com/learn")
time.sleep(2)
myName = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-15alavv")
myCourse = driver.find_element(By.CLASS_NAME, "mfe-app-learn-hub-1f1m67o")
print("Profile Name: " + myName.get_attribute("innerHTML"))
print("Course Enrolled: " + myCourse.get_attribute("innerHTML"))
driver.close()
We recommend changing the headless
option to False
to understand what's going behind. Depending on your profile name and registered courses, the output should look like this:

Great! We just scraped content behind a WAF-protected login. But will the same work for every website? π€ Unfortunately, not.
Currently, the undetected-chromedriver
package only supports Chromium browsers with version 109 or greater. Moreover, the WAF-protected sites can easily detect its headless mode.
To scrape a website that requires a login with Python, undetected-chromedriver
may be enough if the protections are basic. But let's assume the site uses advanced Cloudflare protection (e.g., G2) or other DDoS mitigation services. In this case, the solution we've seen may not be reliable.
ZenRows comes to the rescue. It's a web scraping API that can easily handle all sorts of anti-bot bypasses for us, including complex ones. And it doesn't requires you to have any web browser installed because it's an API.
Advanced Protections Using ZenRows
Scraping content behind a login on a website with higher protection measures requires the right tool. We'll use ZenRows API for that purpose.
Our mission will consist in bypassing G2.com's login page, which is the first one of the two-step login and then extracting the welcome message from the Homepage after we're logged in.
But before starting with the code, let's first explore our target with DevTools. The following table lists the necessary information regarding the HTML elements that we'll interact with throughout the script. Please keep it in mind for the upcoming steps.
Element/Purpose | Element Type | Attribute | Value |
---|---|---|---|
G2 login (step 1): Email input | <input type="email"> | Class | input-group-field |
G2 login (step 1): Next button to proceed to the next login step | <button> | Class | js-button-submit |
G2 login (Step 2): Password Field | <input type="password"> | Id | password_input |
G2 login (Step 2): Login form submit button | <input type="submit"> | CSS Selector | input[value='Sign In'] |
Welcome message at Homepage | <div> | Class | l4 color-white my-1 |
With ZenRows, you don't need to install any particular browser drivers (as you would do with Selenium). Moreover, you don't need to worry about advanced Cloudflare protection, identity reveal and other DDoS mitigation services. Additionally, this scalable API frees you from infrastructure scalability issues.
Just sign up for free to get to the Request Builder and fill in the details as per the screenshot below.

- Set the initial target (i.e., G2 login page in our case).
- Choose Plain HTML. We'll parse it further using BeatifulSoup later in the code. If you prefer, you can use the
CSS Selectors
to scrape only some specific elements from the target. - Setting Premium Proxies helps you scrape region-specific data and mask you from identity reveal.
- Setting JavaScript Rendering is mandatory for running some JavaScript instructions in Step 6.
- Selecting Antibot helps you bypass advanced WAF security measures.
- Checking JavaScript Instructions allows you to add an encoded string of JavaScript instructions to run on the target. It allows controls similar to a headless browser.
- A text box will appear when you check the JavaScript Instructions checkbox. You can write any number of JS instructions, and we put the following instructions: in our case:
[
{"wait": 2000},
{"evaluate": "document.querySelector('.input-group-field').value = 'Your Business Email Here';"},
{"wait": 1000},
{"click": ".js-button-submit"},
{"wait": 2000},
{"evaluate": "document.querySelector('#password_input').value = 'Your Password Here';"},
{"wait": 1000},
{"click": "input[value='Sign In']"},
{"wait": 6000}
]
- Choose Python.
- Select SDK and copy the whole code. Remember to install the ZenRows SDK package using
pip install zenrows
.
Now, you can paste this code into your Python project and execute it. We've copied the SDK code and modified it to make it more portable and easier to understand.
# pip install zenrows
from zenrows import ZenRowsClient
import urllib
import json
client = ZenRowsClient("Your Zenrows API Goes Here")
url = "https://www.g2.com/login?form=signup#state.email.showform"
js_instructions = [
{"wait": 2000},
{"evaluate": "document.querySelector('.input-group-field').value = 'Your G2 Login Email Here';"},
{"wait": 1000},
{"click": ".js-button-submit"},
{"wait": 2000},
{"evaluate": "document.querySelector('#password_input').value = 'Your G2 Password Here';"},
{"wait": 1000},
{"click": "input[value='Sign In']"},
{"wait": 6000}
]
params = {
"js_render":"true",
"antibot":"true",
"js_instructions":urllib.parse.quote(json.dumps(js_instructions)),
"premium_proxy":"true"
}
response = client.get(url, params=params)
print(response.text)
That code snippet brings and prints the plain HTML from the G2 Homepage after logging in. Now, we'll use BeatifulSoup to further parse the HTML and extract the data we want.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
welcome = soup.find("div", attrs={"class", "l4 color-white my-1"})
print(welcome.text)
And we succeeded! π₯³

Here's the full code:
# pip install zenrows
from zenrows import ZenRowsClient
from bs4 import BeautifulSoup
import urllib
import json
client = ZenRowsClient("Your Zenrows API Goes Here")
url = "https://www.g2.com/login?form=signup#state.email.showform"
js_instructions = [
{"wait": 2000},
{"evaluate": "document.querySelector('.input-group-field').value = 'Your G2 Login Email Here';"},
{"wait": 1000},
{"click": ".js-button-submit"},
{"wait": 2000},
{"evaluate": "document.querySelector('#password_input').value = 'Your G2 Password Here';"},
{"wait": 1000},
{"click": "input[value='Sign In']"},
{"wait": 6000}
]
params = {
"js_render":"true",
"antibot":"true",
"js_instructions":urllib.parse.quote(json.dumps(js_instructions)),
"premium_proxy":"true"
}
response = client.get(url, params=params)
soup = BeautifulSoup(response.text, "html.parser")
welcome = soup.find("div", attrs={"class", "l4 color-white my-1"})
print(welcome.text)
Conclusion
What works to scrape a website that requires login with Python? As seen, inspecting the HTML with BeautifulSoup and getting the cookies with the Requests library can help you. However, for modern websites with robust anti-bot solutions, you need undetectable headless browsers. The problem with them is scalability, costs and performance limitations. Moreover, they can still get blocked by websites with advanced WAFs implemented.
If you're looking for an easy and scalable solution to scrape a website with Python, ZenRows offers an API-based service that works best as we just saw.
Here are a few tips you should have in mind to avoid being blocked. Also, you might be interested in reading our guide on web scraping with Selenium in Python, and how to bypass Cloudflare with Selenium as well.
Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.