There are different use cases and reasons why people scrape real time data from websites, either to keep track of data updates like stock prices, crypto rates and football scores or to stay up to date with a store inventory.
In this article, we'll discuss what real time web scraping is, how to scrape data and the best real time web scraping tool for your project.
Let's dive right in!
What is Real Time Web Scraping?
Real time web scraping is the process of using scrapers and crawlers to scrape data from a webpage at almost the same time as changes occur on the websites.
The idea behind real-time web scraping is to be able to capture the data as soon as it changes, whether that change is in minutes or seconds. We can approach real time web scraping using the real time API that the service uses or by parsing the HTML and overcoming its limitations which we'll discuss.
What is the difference between Offline and Real-time Web Scraping?
Offline web scraping works by downloading a portion of a website that you want to scrape, then parsing it to extract the data and saving it in a database, CSV or JSON file. While real-time web scraping works by using a real time API or parsing the HTML in a very short time, making it possible to extract the data as soon as it changes.
So what is the fastest way to scrape web pages in real time? The optimal solution is to use the real-time data API of the service. But many sites don't offer an API or it's well protected but we can get real-time data by parsing the HTML with a 2-5 threshold, however this isn't an optimal solution.
What are the benefits of real-time web scraping?
The benefits of real time web scraping lie in the ability to extra live data and make use of it, either for business or personal purposes. For example, scraping real time stock data can be used to make trading analyses and decisions and businesses use real-time data to manage products and optimize operations.
- Improving customer service.
- Keeping stock count.
- Stock analysis.
- Improving campaign performance for marketers.
Challenges involved in web scraping real time data
Is it possible to keep sending requests to the website and renew the data every time we get a new response? Yes it is, but there are some limitations:
Sending requests and parsing HTML can take some time, so if the whole process takes 2 or 3 minutes, the data might change in less than a minute or less making the extracted data obsolete.
Web scraping real time data from hyperlinks is also slow since the crawler takes another step in sending a request to the hyperlink, making it use more power and time.
2. Firewall blocking
Sending too many requests to the server may alert the firewall, thus blocking the requests. Although that shouldn't be a problem since we have a guide on how to scrape data from websites without getting blocked.
3. It can crash the host site
Requesting the web sources many times can create an additional load on the web source host and can even crash the website.
4. Proxy failure
There are different issues that can be associated with proxies when it comes to real time web scraping, issues like downtime and blacklisted IP address, therefore it's advisable to use a reliable proxy server.
Some websites have anti-bots installed that block out web scrapers making it difficult to crawl, which can limit our output. These antibots include rate limit, fingerprinting, honeypots and CAPTCHA.
How to Scrape Data in Real Time
We've gone over the basics and advantages, it's about time we get into it and do some real time web scraping with Python. Let's try to scrape coinmarketcap.com, a website for reliable cryptocurrency prices. You can also use this method to do real time data scraping from a webpage like Twitter.
To get the data on the page, we can either parse the HTML and extract the data or use a real time API of the front end. But let's try to understand how the data is rendered from the API to the website. A request is first sent to the API from the front end then the API responds with JSON data that are rendered in the table above.
What we'll do is mimic what happens in the browser in our scraper, which means that we'll get the data directly from the API as JSON. To do this, we'll inspect the page by clicking F12 and then select Fetch/XHR tab under the network tab. Reloading the page shows the API and the request sent.
As you can see, we got all the data as JSON, simply right-click it and copy the link address.
We'll be using Python, Pandas and Requests for this tutorial. You can install Python libraries if you haven't by using
pip install requests pandas.
Now to the Python code, let's import the requests library and send a simple request to the address we got before.
import requests url = "https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,max_supply,circulating_supply,total_supply,volume_7d,volume_30d,self_reported_circulating_supply,self_reported_market_cap" response = requests.request("GET", url) data = response.json()
After sending the request, let's convert the data to JSON using the
.json() method. We can also add payload and request-headers so the website recognizes us as a normal web browser and not a bot.
# ... data = response.json() res =  for p in data["data"]["cryptoCurrencyList"]: res.append(p)
Let's access the array that contains the data by selecting
cryptoCurrencyList, which is a child of
data. After iterating over all the items, we can now append the result to the array
Since Pandas Library supports both JSON and CSV files, we can use it to export our result as a
import pandas as pd # ... df = pd.json_normalize(res) df.to_csv("result.csv")
Here we use the method
json_normalize, which normalizes semi-structured JSON data into a flat table, then we save the file as CSV.
And there you have it! A table of the real time data scraped from Coinmarketcap using Python.
In this tutorial, we went through the basics of real time scraping and then went ahead to crawl some real-time data from coinmarketcap using Python libraries.
- Data becomes obsolete if the real time crawler is slow.
- Firewall and anti-bots can sometimes be a real headache for scrapers.
- A proxy failure can lead to scraper malfunction.
Is it possible to scrape a real time website without these limitations? Yes, it is, especially when you make use of ZenRows. You can take advantage of the free trial available and send thousands of requests with a simple API call and not get blocked.