Does your IP blacklisted while scraping? This post We will let you know “How to crawl a website without getting IP blocked!”
Data scraping is generally accepted by the web at large, as long as the safety and security of a website’s server and its users aren’t jeopardized. Given the sharing-is-caring nature of the public online community, many websites probably see it as mutually beneficial, giving them more traffic, more hits, and even possibly, more exposure as well.
Websites do set limits to how much can be downloaded from their site from a single IP address, though, to protect themselves, but also to prevent people from taking a little too much too quickly. This is where proxies and web scrapers come in handy, circumventing those restrictions to download as much as possible from different websites.
Theoretically, this could crash a website, but one spider is very unlikely to do so, so it’s more a matter of moderation and setting a precedent. Web scrapers and proxies can bypass those moderations without harming the server security, but doing that crosses into territory that can result in an IP ban without proper caution.
Here are some of the ways to avoid that, depending on how you plan on web scraping.
Understand and adhere to a Website /robots.txt webpage
In order to avoid an IP ban, its probably most important to obey the guidelines laid out in a website’s robots.txt subdomain. Proxies alone do not protect you from these restrictions; all that’ll happen is the IP address of the proxy will receive an IP block, and then the proxy is useless to the web scraper, or spider, unless it can be replaced by the proxy service.
Web scraping involves ensuring that no matter what, you don’t get kicked off the server — whether you set a crawler to follow the guidelines, or code the crawler yourself to follow the guidelines, it’s generally in the spider’s best interest to adhere to the robots.txt instructions included on almost every website.
Finding this subpage is generally done by entering in the homepage, followed by ‘/robots.txt.’ Sometimes, these rules can be difficult for a human to understand, since they’re mainly intended to be machine-readable. Here’s one straightforward enough to understand the fundamentals, via openculture.com:
The first line says that humans can go wherever they want on the site, as is fairly common. Social networks will have greater barriers, like going to a private profile or going to certain places without logging in, which everyone who has used those sites is already aware of.
Now there’s the disallow followed by the allow. This is an instruction to robots, since whatever they disallow is inaccessible within the webpage anyway. The disallow, obviously, says, ‘don’t go to anything that falls under www.openculture.com/(disallow).
The allow that follows makes a single exception to the rule, saying to robots, “you can go here and only here within this subsection.”
The last line contains the sitemap, which is essential for a spider to know all of the webpages it’s able to access. Following that link leads a series of embedded URLs that comprise the whole site, essentially. This is the easiest way to scrape a webpage, through the sitemap provided by the website itself.
If there’s something they’re not including, it’s probably not important anyway. No conspiracy theories here: whatever a website might want to exclude, they would just take offline. What they do hide generally prevents them from different types of cyberattacks, or honeypots, as explained below.
These robots.txt files should generally be adhered to in order to avoid getting blacklisted while crawling or scraping. If they’re as simple as the one above (the website it’s taken from is a fairly popular site, to note for the sake of reference) then it’s easy to adhere to.
Longer lists of allows and disallows, like ones on facebook.com/robots.txt, can just be incorporated into however you plan to do your web scraping, by simply omitting, or forbidding, certain URL subdomains.
Between taking advantage of the sitemap and avoiding the easiest way to get your IP addresses blacklisted It’s not only easy, but smart to adhere to if you want to keep your IP address safe from a ban. And just to mention, there are ways for the server to find out if you’re not listening to the rules, so disobey them at your own risk.
Don’t fall into honeypots! wait…
What are honeypots?
Honeypots are traps that websites set by the web server that only robots can fall into. For example, a web scraper instructed to go to every URL available, as it generally the default setting, will go to a part of the site that a human on an internet browser wouldn’t be able to reach by navigating through any part of the site. There’s no purpose for the URL existing other than to detect spiders, or web scraping pipelines.
Some honeypots are designed to only detect robots. Unless a website bans all crawlers outright on its robots.txt URL, this isn’t a problem in itself—almost all websites allow bots to crawl their webpages, for reasons explained below—its only that the server wants to know who’s a bot and who isn’t.
Other honeypots are designed to detect only robots that violate their robots.txt guidelines. This is where, in either doing so by accident or by just ignoring the rules, you can easily find yourself slapped with an IP block.
Prepare a crawling sitemap ahead of time.
This can save you from falling into a honeypot, catching an IP block, and even save time. Most websites share their sitemap somewhere on the website. For example, here’s medium.com’s sitemap. It was found by looking at their robots.txt webpage.
With this information, the web scraper will only visit the sites it’s permitted to access, and this will avoid the type of honeypots that could lead to an IP ban. At the same time, while looking at the sitemap, ask yourself if the information on each URL on the website is needed. If there are only a handful of webpages needed, access only those ones.back to menu ↑
Find the Right Proxy Service while scraping
As always, with web scraping any large amount of data in a short period of time, you need multiple proxies if you aren’t scraping from many different sites.
It’s also better, as always, to stay anonymous online, especially when web scraping, because of the increase in internet activity. Let’s say a hacker can see the activity of servers belonging to the website you happen to be scraping from. Your real IP address is given away much more and your activity more vulnerable.
So a proxy is recommended.
Make sure the proxy you’re buying is a ‘virgin’ proxy, meaning the IP address has not been used for web scraping ever before.
If the proxy service does not distinguish types of proxies that have and haven’t been used for scaping before, they have never been used for web scraping, or they might not know, having bought them from a re-seller, they may not know the history of some of the IP address they own.
They also may haven’t, but somewhere the company should tell you, at least, if you’re getting a ‘virgin proxy’ or not. This usually depends on your pricing plan.
Get as many proxies as you might need
Be careful not to assume you only need, let’s say five proxies when you really need 20. On the other hand, don’t get too many.
Backconnect rotating proxies are the Best choice for web scraping
Now there is no doubt, The backconnect rotating proxies are the best proxies for web scraping or crawling, We already discussed it on the previous post, The backconnect proxy rotates IP based on your requests while scraping, It’s can prevent IP getting blacklisted.
For Beginner, Suggest start from dedicated proxies
If you’re new to web scraping, I mean you can not afford the rotating proxies for scraping, you can start with dedicated IP proxies, If you plan to use this dedicated proxies to crawl a website, you can read this guide to learn how to prevent proxy getting blocked.
If all of your IPs are banned, you’ll just need to buy more. If after two days you realize the number of proxies needed are a fraction of what was purchased, return them if you can, but after 2 days after purchasing a proxy from B for example, refunds are not guaranteed and there needs to be a valid reason for them in order to receive a refund.
You’re better off getting too few than too many until you know exactly how many you’ll need for what you’re doing.back to menu ↑
Use APIs – Website APIs, Scraping APIs, Proxy Server APIs
It’s important to break down all the different APIs that can be involved in web crawling and web scraping. Some are necessary, some are helpful. Some are only necessary because of the web scraping method being used. Others are neither necessary nor helpful.
First, the main API you need to worry about when web scraping is the target website’s API, if it has one. Many websites have APIs in part because they want web scrapers to use them, and not using them might result in an IP ban as a result of several different outcomes(like honeypots). If the website you want to scrape has an API, read the API doc. It should tell you right off the bat whether there are download restrictions, which would apply to humans as well as scrapers.
APIs can also make web scraping more efficient since the API exists in order to communicate between machines – in this case, between their web server and your web crawler. The target website’s API will lead the scraper towards the information it’s looking for, leaving out the mess of other stuff. This is a win-win: the webserver gets less strain from web scrapers downloading everything, and the spider downloads less of the stuff it doesn’t need (and if you’ve ever examined a webpage, you know there’s a mountain of stuff it doesn’t need).
For example, the first section of Reddit’s API, tells you that, “Listings do not use page numbers because their content changes so frequently,” means that navigating from one page to the next across a subreddit is not as simple as a site like openculture.com, whose URL clearly contains a page number in the next page as ‘http://www.openculture.com/page/2.’
Another part of the Reddit API shows you how to update a live thread when it’s updated. Without that, the spider would get the information at the time, and move on, not knowing if or when a page is updated.
Many proxy services have their own APIs, mainly if you elect to purchase a remote server from them alongside proxies. These are the type of APIs in web crawling that are necessary, but not necessarily helpful. If the proxy server uses an API, chances are using the API is required, and the crawler would not work without it.
It doesn’t alone add to the efficiency of the crawler, but electing to use a proxy server has its advantages, and if using the API comes along with that, it’s just thought of as one component of the server as a whole.
Web scraping tools like Scrapy offer APIs as well, but for web scraping along, their usage is generally not necessary or helpful. These APIs are mainly intended for developers, not users. For just using these tools for web scraping, these APIs are not necessary to implement. Maybe have a look at them in case they provide a useful extension, but otherwise, don’t worry about not using them –it won’t lead to an IP ban.
If the source or website, being crawled has an openly available API, using it will nearly eliminate the chance of being banned, because the API will stop you from exceeding your limits.
Once you’ve reached that limit, just refresh your API key, switch to the next proxy, make sure your IP has been changed, and keep going. Many scraping tools – whether through a programming script or package of scripts, or with a software program, can do this automatically, enabling you to switch proxies as soon as another has reached its limit.
Note also that, if a site does, in fact, provide an API, the chances of being banned will increase when not using it, because the site, generally, will be less tolerant of crawlers who ignore their APIs.back to menu ↑
Use Selenium or Puppeteer (Headless browser)
By using these headless browsers like Selenium and Puppeteer, you will literally see a web browser pop up and manually work their way through the crawl. I much-acquainted and like to use Selenium, So I will take more words on it, If you much like to use Puppeteer, I would suggest you read this post when crawling.
Selenium uses what’s called a ‘headless browser,” Selenium will open a browser – Chrome or Firefox are recommended – and proceed to do the same web scraping you’d do otherwise. The only difference is that it appears that it’s being done manually, with human hands.
If a site is running PHP (many are, and most big ones are), they can detect things like the ‘clickthrough rates,’ which means they can figure out if a button to a link was clicked to get to a subdomain in the site, or if the user just moves to the URL without clicking the button that links to it. Things like that make it obvious that a robot and not a human is using the site, and Selenium can be programmed to click, type, and scroll around on any website.
The site can’t forbid robots from crawling their website for many reasons, but mainly because they wouldn’t show up on any search engines (simply put, search engines crawl the web in order to find websites).
If you are code-savvy enough to write a simple python script or savvy-enough to find and borrow one you find on the web, you can use packages like selenium to make any site believe a human is accessing their site. There are plenty of guides on how to do incorporate this into web scraping, such as this one, which has a python script at the bottom that would be good to start off with(you can download Python here).
There are some downsides, however. As you may have guessed, using a headless browser like Selenium will generally slow down the process. The extra precaution may be worth the slowdown, depending on how slow it gets, and how fast you need to crawl between webpages.
However, it does not really affect the speed that you scrape the website; it’s just a matter of going between webpages. The slowdown, overall, might be minimal, and this option is definitely worth testing out if it wouldn’t take too much time to figure out how to run it.back to menu ↑
Use reliable web scraping software
The main recommendation here would be Octoparse, but there are others out there that are unreliable, or cost a fair amount. Octoparse has a very reasonable totally-free software package.
It does have a limit to how much web scraping can be done on it for free over a certain amount of time, but a larger limit than most.
This software can be told to follow the guidelines on a robots.txt page and not exceed any other limits that would lead to an IP block. For users new to web scraping, it makes everything easier, which can avoid mistakes that could easily lead to an IP ban.
Be careful with lesser-known web scraping software, though. It might be deprecated or outdated, meaning the mistakes could result in their end to get blacklisted.
Also, obviously, don’t waste time trying to get web scraping software to work if it doesn’t seem to work, especially if it’s not commonly used, because it may not work properly anymore, and there may be no resources to help with getting it to work properly anyway.
Check the dates of software you download for when it’s been most recently updated. Anything that hasn’t been updated in a few years should throw up a red flag.
If you do hit an IP ban, it’s not the end of the world. It may only last 24 hours, giving you some time to figure out what led to the slip-up, and preventing a repeat performance.
Better yet, if you have multiple proxies, you can just use the other ones for the time being. With a large scale web scraping project, these mistakes are likely to occur for a number of reasons, either by accident or negligence.
Sometimes it’s the web servers at fault, or more even frustrating, the proxies bought from the proxy service.
No matter the case, IP addresses get blocked or blacklisted from time to time, which is why many proxy services offer IP replacements for each proxy.
It’s obviously a major hindrance though, and if you’re not careful, or just reckless, you could accidentally get a whole lot of proxies banned in a short amount of time, which its why it’s so important to know what to watch out for when web scraping and what to do to avoid these pitfalls.