Frequently Asked Questions
Scraping as a service in comparison to scraping on your own
When it comes to scraping developers always face a difficult technical challenge.
Large scaled web services, such as Google, use sophisticated detection algortihms to prevent automated access of ther databases.
Duch systems may include self-learning neural networks that can detect unusual activity.
In the case of Google we have seen world-wide bans on similar groups of keywords, this proofs that the Google detection is independend of the selected
language, country or Google domain.
How Google detects scraping activity and automated access
Google, as well as other large websites, detect automated activity by a range of indicators.
- The IP address where requests originate from is very important.
An IP address is similar to an identity, requests coming from one IP and do not behave like a normal user raise the detection chance
- The meta data sent by a browser like HTTP header elements and cookies, especially the User-Agent can raise or lower the detection chance
- The delay between requests, the time of requests, the geo localized origin of requests as well as the number of requests within certain timeframes are an important factor that play a major role in detection.
- The keywords play a big role as well. Especially long keywords or the use of search operators (:site, :intitle, etc) can cause detection to happen instantly or much earlier.
- Instead of passively detecting, Google also changes the layout from time to time, this sometimes disables all existing scraping solutions.
This list sounds easier than it actually is. There are many different HTTP headers, all of them together play a role.
A few examples:
- "Looking for japanese results from switzerland" is very unusual, this will raise the attention of the detection system.
- Downloading thousands of results related to a large webshop by using "intitle" or "site" operators will significantly raise the detection attention.
- Using the same User-Agent ten thousand of times in a row would be an usual increase, similar as using an outdated browser.
- Accessing a search engine 24 hours a day from the same IP would be very unusual, a typical user is not online through a whole day without sleep
What happens when scraping is detected
Bing is usually a bit easier on detection than Google, Google provides the most valueable search results and the most difficult results to be scraped.
Taking Google for example, there are different levels of detection:
- Google might shout about "Unusual traffic from your computer network"
This message should be taken seriously
- The entire suite of Google offers can be protected and blocked through a captcha page.
The redirection URL usually contains "https://ipv4.google.com/sorry/IndexRedirect"
Such a captcha protected page usually stays active for around 8-9 hours.
Further activity during this time can increase the time.
- Continued captcha issues increase an internal counter within Google, at a later moment a Google employee can react on the case.
This can result in severe actions from Google, including the ban of large IP subnets and possibly legal action.
Google is also known to strike against the website of offenders by removing it from the the search index.
Scraping as a service instead of a selfmade solution
Even with high funding a scraping self-solution can be a continued source of trouble.
Instead of continuously working on an own solution scraping.services delivers a well working and reliable solution.
Our service can be integrated into custom projects without any hassle and even be used without any programming knowhow through the web based frontends.
We know how to easily manage millions of keywords a day and our engineers keep the service quality permanently watched.