Web Scraping Instagram: Extract Data Using Python

November 2, 2022 ยท 8 min read

Web scraping Instagram is a daunting task due to the complex security algorithms integrated into its framework. Such protection becomes an obstacle during scraping because it blocks automated requests or prevents using standard parsing algorithms.

But what if I tell you that it's not an impossible mission, and You can scrap IG, for example, to find your potential buyer or learn more about the target audience for your product in a particular area?

This tutorial will cover web scraping from one of the most popular social networks: Instagram. You'll learn how to set up your system and prepare a Python-based crawler via a Selenium library. With this practical knowledge, you should be able to quickly and efficiently extract information from any online source.

Instagram homepage
Instagram homepage

What is Instagram Web Scraping?

Generally, crawling the IG website assumes you obtain publicly available information using a specific scraping tool. Instagram is an appealing target for such activity as it provides a wide range of data. Just look at the picture below!

Types of Instagram data
Types of Instagram data

From IG you can collect the data underlying users' profiles, such as emails, phone numbers, and biographies, as well as scrape their posts, comments, locations, number of followers, likes, and hashtags.

But, as you can guess by now, extracting confidential information contradicts the company's policy. Therefore, ensuring that the output data doesn't fall under the GDPR, CPA rules, or other intellectual property rights is important.

How can we collect data from Instagram.com?

Given the popularity of Instagram, it's not surprising that there're many techniques to scrape users' data:

While the official IG APIs limit what types of data you can scrape (i.e., collection of comments, posts, and images left by other users is usually unacceptable, as is extracting certain hashtags and locations), unofficial utilities open up much more possibilities for web scraping Instagram.

Here, we're particularly interested in exploring the Selenium package's main aspects. So, by scraping hashtags and posts, we check its ability to bypass security blocks and access hidden content. Then we'll look at a complete and ready-to-use scraping solution to see the advantages of using such industrial tools in data extraction.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to scrape Instagram

When you open Instagram.com as an unauthorized user, you'll discover that some content has free access (for example, top of hashtag or post search, comments), and some pages may require a log-in.

Unfortunately, it's impossible to predict which features will be available tomorrow because IG constantly changes its protocols and security rules. That's why, hereafter, we consider the cases of scraping for authorized users and the problems associated with this scenario.

Selenium is a strong scraping instrument that can be easily implemented in both scenarios.
  • It can automate complex processes:

    • Passing pop-up notifications without manual intervention;
    • Dealing with log-in blocks;
    • Even crawling specific content elements;
  • Selenium can also control the parser's speed - so you wouldn't be blocked for suspicious actions.

Thus, this Python library is the best fit for our scraper, and now we're ready to explore it further!

Step 1: Set up the environment

Let's start our Instagram scraping adventure by importing some essential dependencies. To execute Selenium smoothly, we first download the latest stable release of ChromeDriver (in case your primary browser is Chrome) or geckodriver (i.e., for Firefox users).

Then, proceed with the installation of the following Python libraries in the environment:

pip install selenium time

Step 2: Log in to the Instagram account

Now, let's build the IG scraper by importing the Selenium package on the top of the code:

# Let's start by importing basic scraping packages 
from selenium import webdriver 
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.wait import WebDriverWait

Of course, don't forget to connect the chromeDriver / geckodriver you downloaded previously to our crawler by specifying the path to the file in your local system:

# Then, we need to specify the path to chromedriver.exe that we saved and downloaded previously 
driver = webdriver.Chrome("C:/Users/your/path/to/chromedriver.exe") 
 
# and open the IG homepage 
driver.get("http://www.instagram.com")

After establishing the Web Driver settings, call it with the .get argument to open our scraping target website. By now, if you run the script, you'll be able to see a new window popping up in your browser.

But it's too early to celebrate, as we still miss a couple of crucial steps: to log into your personal account, we need to specify the username and password input fields. Note that it's better to downsize the risk of losing your Insta account and not performing such actions with your primary account.

Getting Instagram data
Getting Instagram data

As you can see above, to obtain the name attributes for these two input fields, we should open developer tools and select an element on the webpage that we aim to inspect (alternatively, press Ctrl + Shift + C on the keyboard and click on the username/password fields).

Once the required element is found, you'll see these source code snippets, where we can extract name attributes for our input boxes.

While web scraping Instagram, it's crucial to remember that each web page has its own cycle; therefore, the target element that hasn't yet been loaded on the page can't be parsed by Selenium.

Thus, to achieve complete automation with a scraper, a WebDriverWait() function should be called to assign enough time for the elements to load and tackle this challenge. Then we can finally register our authorization information and continue by clicking on the login button.

# Afterwards, let's target the username & password 
username = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']"))) 
password = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='password']"))) 
 
# and enter your username and password 
username.clear() 
username.send_keys("yourusername") 
password.clear() 
password.send_keys("yourpassword")

Step 3: Handling pop-up messages

One of the first obstacles we inevitably face at this stage is the multiple pop-up messages appearing on the web page when you open it for the first time.

To automatically bypass this security check, we'll slightly adjust our first function by changing the attribute of the By condition in WebDriverWait() from CSS selector to XPATH selector (which we can find in the developer tools in a similar way that we did previously).

First challenge
First challenge
# At this point, we face the first challenge with a popping-up message to accept cookies 
cookies = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Only allow essential cookies")]'))).click() 
 
# Then, we can finally target the login button and click it 
button = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//*[@id='loginForm']/div/div[3]/button"))).click() 
 
#But, don't be too happy yet, as we still need to pass two more popping-up notifications 
# Handle, the first one with this line 
popup1 = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Not Now")]'))).click() 
 
# In case you have a second one, run this: 
popup2 = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Not Now")]'))).click()

Congrats! You're now logged in!

Login Instagram
Login Instagram

Step 4: Crawling Instagram Hashtags and Posts URLs with Selenium

As soon as we enter the IG account, a search bar appears at the top of the page, and from there, we can easily navigate through our scraping cycle. We can use various exploration pages for Instagram users' data collection.

One of the most popular techniques is to search keywords or hashtags (as shown in the image below):

Hashtag in search bar
Hashtag in search bar

So, let's narrow the crawl scope by specifying hashtags #pets. Generally, the idea of interacting with the search bar is very similar to what we did earlier with authorization input fields; only this time, we're changing our WebDriverWait() function from Selenium to the time.sleep(seconds) function provided by the time library.

import time 
 
# Specify the search box here 
search = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//input[@placeholder='Search']"))) 
search.clear() 
 
#search for the hashtag pets 
keyword = "#pets" 
search.send_keys(keyword) 
 
# Wait for 5 seconds 
time.sleep(5) 
search.send_keys(Keys.ENTER) 
time.sleep(5) 
search.send_keys(Keys.ENTER) 
time.sleep(5)

Note that due to the Instagram UI specification, there's actually no direct button in the search box. In order to avoid any manual input during the operation of our IG scraper, we wrote the press of the input key in the code itself.

Hashtag search results
Hashtag search results

Next, when you reach the #pets search output page, you'll be able to scroll down to the page and scrape all the items above the current position:

# Scroll down to the page 
driver.execute_script("window.scrollTo(0, 4000);") 
 
# Scrape all posts on the page 
posts = driver.find_elements(By.TAG_NAME, "a") 
posts = [a.get_attribute("href") for a in posts] 
 
#narrow down all links to image links only 
posts = [a for a in posts if str(a).startswith("https://www.instagram.com/p/")] 
 
# Get rid of your avatar and IG's logo from the results 
posts[:-2]

This operation inevitably results with the URLs for each posts:

[ 
	'https://www.instagram.com/p/CkIEQ4RSn6w/', 
	'https://www.instagram.com/p/CkJ7r48vhgR/', 
	'https://www.instagram.com/p/CkHDE3WyDSI/', 
	'https://www.instagram.com/p/CkHgXsZIrYg/', 
	'https://www.instagram.com/p/CkIva4bSKat/' 
]

But you could do more! Suppose you're creating an ML algorithm that needs to be trained on a vast number of images. In this case, you can proceed and extract the direct links to the images using the following command.

So, don't forget to open the development tools one more time and retrieve the tag of the presented elements. For example, here we scrape the elements tagged with the img attribute:

images = [] 
 
# You can continue and extract directly image links 
for a in posts: 
	driver.get(a) 
	time.sleep(5) 
	img = driver.find_elements(By.TAG_NAME,'img') 
	img = [i.get_attribute('src') for i in img] 
	images.append(img[1]) 
 
images[:5]

So we create an empty list of comprehension for images, which will be filled with extracted URLs as the for loop runs. Finally, as soon as all the elements with the img tag are extracted, you'll be able to get the desired result!

[ 
	'https://instagram.fflr3-2.fna.fbcdn.net/v/t51.2885-19/308665418_399722812341209_7392476696729906026_n.jpg?stp=dst-jpg_s150x150&_nc_ht=instagram.fflr3-2.fna.fbcdn.net&_nc_cat=1&_nc_ohc=c5dFGsuYeiAAX95Nirg&edm=ALQROFkBAAAA&ccb=7-5&oh=00_AfA20w3aGTe5AbrH8PX9V28TBIm-4g86d2PiEmGeldnwKg&oe=635EEE12&_nc_sid=30a2ef', 
	# ... 
]

Great job!

Step 5: Crawling Profiles with Instagram Scraper

Despite the advantage over official APIs, the Selenium library isn't an ideal solution for scraping websites like Instagram:
  • Selenium can't monitor all changes in IG security protocols;
  • It's impossible to crawl some content elements unless you're not logged in;
  • Scraping data as an authorized user may cause you to lose your account;

Therefore, the easiest way to scrape Instagram is to implement a ready-to-use crawling algorithm developed by ZenRows.

ZenRows Web Scraping tool is assembled with APIs that you can easily incorporate into your code and quickly collect all the data you need. Moreover, using ZenRows Scraper, you'll access rich data and content that aren't visible at first sight: CSV and JSON native files.

To use the IG Scraper, simply follow these steps:
  • Creating a free ZenRows account.
  • Opening the Instagram Scraper.
  • Selecting your target Instagram page.
  • Finally, get the content element in JSON (auto-parsed output).

So, once you find the Instagram profile you want to crawl and log in to the ZenRows platform under your account, you should see the request building page:

ZenRows Builder
ZenRows Builder

Next, enter the Instagram URLs you want to scrape into a dedicated area and click the Try it button. Notice that alternatively, you can copy the APIs offered by ZenRows into your script and run the code directly from your Python environment.

As a result of scraping an IG profile page, you'll get a JSON file with all the extracted data. You'll get the username, as well as captions, comments and likes by count, etc.

ZenRows' JSON Output
ZenRows' JSON Output

Furthermore, you have the flexibility to modify the default parsing settings further and specify the element via CSS Selectors or Autoparser. For example, in case you want to extract all images from the profile page:

CSS Selectors
CSS Selectors

That will result with the following output:

CSS Selectors' output
CSS Selectors' output

Conclusion

In this tutorial, we went through the main stages of web scraping Instagram with the Selenium Python-based library and saw the challenges it has to face. Finally, we saw a more efficient solution and engaged in simple parsing of IG's profile pages with its API.

In other words, with ZenRows Instagram Scraper, you can forget forever about the sleepless night of coding and developing new and new security gate solutions and just enjoy scraping IG with few actions.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.