Web scraping Instagram is a daunting task due to the complex security algorithms integrated into its framework. Such protection becomes an obstacle during scraping because it blocks automated requests or prevents using standard parsing algorithms.
But what if I tell you that it's not an impossible mission, and You can scrap IG, for example, to find your potential buyer or learn more about the target audience for your product in a particular area?
This tutorial will cover web scraping from one of the most popular social networks: Instagram. You'll learn how to set up your system and prepare a Python-based crawler via a Selenium library. With this practical knowledge, you should be able to quickly and efficiently extract information from any online source.
What is Instagram Web Scraping?
Generally, crawling the IG website assumes you obtain publicly available information using a specific scraping tool. Instagram is an appealing target for such activity as it provides a wide range of data. Just look at the picture below!
From IG you can collect the data underlying users' profiles, such as emails, phone numbers, and biographies, as well as scrape their posts, comments, locations, number of followers, likes, and hashtags.
But, as you can guess by now, extracting confidential information contradicts the company's policy. Therefore, ensuring that the output data doesn't fall under the GDPR, CPA rules, or other intellectual property rights is important.
How can we collect data from Instagram.com?
While the official IG APIs limit what types of data you can scrape (i.e., collection of comments, posts, and images left by other users is usually unacceptable, as is extracting certain hashtags and locations), unofficial utilities open up much more possibilities for web scraping Instagram.
Here, we're particularly interested in exploring the Selenium package's main aspects. So, by scraping hashtags and posts, we check its ability to bypass security blocks and access hidden content. Then we'll look at a complete and ready-to-use scraping solution to see the advantages of using such industrial tools in data extraction.
How to scrape Instagram
When you open Instagram.com as an unauthorized user, you'll discover that some content has free access (for example, top of hashtag or post search, comments), and some pages may require a log-in.
Unfortunately, it's impossible to predict which features will be available tomorrow because IG constantly changes its protocols and security rules. That's why, hereafter, we consider the cases of scraping for authorized users and the problems associated with this scenario.
- It can automate complex processes:
- Passing pop-up notifications without manual intervention;
- Dealing with log-in blocks;
- Even crawling specific content elements;
- Selenium can also control the parser's speed - so you wouldn't be blocked for suspicious actions.
Thus, this Python library is the best fit for our scraper, and now we're ready to explore it further!
Step 1: Set up the environment
Let's start our Instagram scraping adventure by importing some essential dependencies. To execute Selenium smoothly, we first download the latest stable release of ChromeDriver (in case your primary browser is Chrome) or geckodriver (i.e., for Firefox users).
Then, proceed with the installation of the following Python libraries in the environment:
pip install selenium time
Step 2: Log in to the Instagram account
Now, let's build the IG scraper by importing the Selenium package on the top of the code:
# Let's start by importing basic scraping packages from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.webdriver.support.wait import WebDriverWait
Of course, don't forget to connect the chromeDriver / geckodriver you downloaded previously to our crawler by specifying the path to the file in your local system:
# Then, we need to specify the path to chromedriver.exe that we saved and downloaded previously driver = webdriver.Chrome("C:/Users/your/path/to/chromedriver.exe") # and open the IG homepage driver.get("http://www.instagram.com")
After establishing the Web Driver settings, call it with the
.get argument to open our scraping target website. By now, if you run the script, you'll be able to see a new window popping up in your browser.
But it's too early to celebrate, as we still miss a couple of crucial steps: to log into your personal account, we need to specify the username and password input fields. Note that it's better to downsize the risk of losing your Insta account and not performing such actions with your primary account.
As you can see above, to obtain the
name attributes for these two input fields, we should open developer tools and select an element on the webpage that we aim to inspect (alternatively, press Ctrl + Shift + C on the keyboard and click on the username/password fields).
Once the required element is found, you'll see these source code snippets, where we can extract
name attributes for our input boxes.
While web scraping Instagram, it's crucial to remember that each web page has its own cycle; therefore, the target element that hasn't yet been loaded on the page can't be parsed by Selenium.
Thus, to achieve complete automation with a scraper, a
WebDriverWait() function should be called to assign enough time for the elements to load and tackle this challenge. Then we can finally register our authorization information and continue by clicking on the login button.
# Afterwards, let's target the username & password username = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']"))) password = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='password']"))) # and enter your username and password username.clear() username.send_keys("yourusername") password.clear() password.send_keys("yourpassword")
Step 3: Handling pop-up messages
One of the first obstacles we inevitably face at this stage is the multiple pop-up messages appearing on the web page when you open it for the first time.
To automatically bypass this security check, we'll slightly adjust our first function by changing the attribute of the
By condition in
WebDriverWait() from CSS selector to XPATH selector (which we can find in the developer tools in a similar way that we did previously).
# At this point, we face the first challenge with a popping-up message to accept cookies cookies = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Only allow essential cookies")]'))).click() # Then, we can finally target the login button and click it button = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//*[@id='loginForm']/div/div/button"))).click() #But, don't be too happy yet, as we still need to pass two more popping-up notifications # Handle, the first one with this line popup1 = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Not Now")]'))).click() # In case you have a second one, run this: popup2 = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Not Now")]'))).click()
Congrats! You're now logged in!
Step 4: Crawling Instagram Hashtags and Posts URLs with Selenium
As soon as we enter the IG account, a search bar appears at the top of the page, and from there, we can easily navigate through our scraping cycle. We can use various exploration pages for Instagram users' data collection.
One of the most popular techniques is to search keywords or hashtags (as shown in the image below):
So, let's narrow the crawl scope by specifying hashtags #pets. Generally, the idea of interacting with the search bar is very similar to what we did earlier with authorization input fields; only this time, we're changing our
WebDriverWait() function from Selenium to the
time.sleep(seconds) function provided by the time library.
import time # Specify the search box here search = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//input[@placeholder='Search']"))) search.clear() #search for the hashtag pets keyword = "#pets" search.send_keys(keyword) # Wait for 5 seconds time.sleep(5) search.send_keys(Keys.ENTER) time.sleep(5) search.send_keys(Keys.ENTER) time.sleep(5)
Note that due to the Instagram UI specification, there's actually no direct button in the search box. In order to avoid any manual input during the operation of our IG scraper, we wrote the press of the input key in the code itself.
Next, when you reach the #pets search output page, you'll be able to scroll down to the page and scrape all the items above the current position:
# Scroll down to the page driver.execute_script("window.scrollTo(0, 4000);") # Scrape all posts on the page posts = driver.find_elements(By.TAG_NAME, "a") posts = [a.get_attribute("href") for a in posts] #narrow down all links to image links only posts = [a for a in posts if str(a).startswith("https://www.instagram.com/p/")] # Get rid of your avatar and IG's logo from the results posts[:-2]
This operation inevitably results with the URLs for each posts:
[ 'https://www.instagram.com/p/CkIEQ4RSn6w/', 'https://www.instagram.com/p/CkJ7r48vhgR/', 'https://www.instagram.com/p/CkHDE3WyDSI/', 'https://www.instagram.com/p/CkHgXsZIrYg/', 'https://www.instagram.com/p/CkIva4bSKat/' ]
But you could do more! Suppose you're creating an ML algorithm that needs to be trained on a vast number of images. In this case, you can proceed and extract the direct links to the images using the following command.
So, don't forget to open the development tools one more time and retrieve the tag of the presented elements. For example, here we scrape the elements tagged with the
images =  # You can continue and extract directly image links for a in posts: driver.get(a) time.sleep(5) img = driver.find_elements(By.TAG_NAME,'img') img = [i.get_attribute('src') for i in img] images.append(img) images[:5]
So we create an empty list of comprehension for
images, which will be filled with extracted URLs as the
for loop runs. Finally, as soon as all the elements with the
img tag are extracted, you'll be able to get the desired result!
[ 'https://instagram.fflr3-2.fna.fbcdn.net/v/t51.2885-19/308665418_399722812341209_7392476696729906026_n.jpg?stp=dst-jpg_s150x150&_nc_ht=instagram.fflr3-2.fna.fbcdn.net&_nc_cat=1&_nc_ohc=c5dFGsuYeiAAX95Nirg&edm=ALQROFkBAAAA&ccb=7-5&oh=00_AfA20w3aGTe5AbrH8PX9V28TBIm-4g86d2PiEmGeldnwKg&oe=635EEE12&_nc_sid=30a2ef', # ... ]
Step 5: Crawling Profiles with Instagram Scraper
- Selenium can't monitor all changes in IG security protocols;
- It's impossible to crawl some content elements unless you're not logged in;
- Scraping data as an authorized user may cause you to lose your account;
Therefore, the easiest way to scrape Instagram is to implement a ready-to-use crawling algorithm developed by ZenRows.
ZenRows Web Scraping tool is assembled with APIs that you can easily incorporate into your code and quickly collect all the data you need. Moreover, using ZenRows Scraper, you'll access rich data and content that aren't visible at first sight: CSV and JSON native files.
So, once you find the Instagram profile you want to crawl and log in to the ZenRows platform under your account, you should see the request building page:
Next, enter the Instagram URLs you want to scrape into a dedicated area and click the Try it button. Notice that alternatively, you can copy the APIs offered by ZenRows into your script and run the code directly from your Python environment.
As a result of scraping an IG profile page, you'll get a JSON file with all the extracted data. You'll get the username, as well as captions, comments and likes by count, etc.
Furthermore, you have the flexibility to modify the default parsing settings further and specify the element via CSS Selectors or Autoparser. For example, in case you want to extract all images from the profile page:
That will result with the following output:
In this tutorial, we went through the main stages of web scraping Instagram with the Selenium Python-based library and saw the challenges it has to face. Finally, we saw a more efficient solution and engaged in simple parsing of IG's profile pages with its API.
In other words, with ZenRows Instagram Scraper, you can forget forever about the sleepless night of coding and developing new and new security gate solutions and just enjoy scraping IG with few actions.