Web Scraping Cost



Fully managed enterprise-grade web scraping service provider based in the USA. We take care of web crawling, data extraction, automated quality checks and deliver usable structured data. Awesome customer service. Customers include Fortune 50 to startups and everyone in between. Netflix reports Q1 revenue of $7.16B, up 24% YoY, and 208M paid memberships, up 14% YoY but below its forecast of 210M; stock down 8%+ after hours — April 20, 2021 Fellow shareholders, Revenue grew 24% year over year and was in line with our beginning of quarter forecast, while operating profit and margin reached all-time highs.

Web scraping is the solution for collecting the enormous amount of data over the web. Most businesses today need data and they need this data to be regularly collected or updated. But it’s impossible to manually collect data because the web is huge and more information is added on a daily basis. That’s where data scraping can help your business. Data scraping, web crawling or data extraction all refer to the collection of industry or topic related data on the web for many sectors including e-commerce, market research, human resources, finance, and real estate.

Machine learning has been transforming many industries for the last several decades. Think about self-driving cars or intelligent smartphones. Combined together, machine learning and data scraping are going to create a revolutionary innovation for the world of data. Data scraping has got quite popular during recent years with the increasing number of information. So if you want to extract data from a website, you need to either work with data scraping service or use a scraping tool. In the future, machine learning might make the data extraction process even easier and faster. However, today you will have to choose between the two options mentioned above. In this post, we’ll reveal the best data scraping companies of 2019 and describe their advantages.

Top 5 Web Scraping Companies

Web Scraping Cost

DataHen offers full advanced web crawling, scraping and data extraction services to different industries and has great features which help gain a competitive advantage. At DataHen, we offer superior service and make sure that you can lean back and relax, while your data is being scraped by our team of professional scrappers. Here are the main features and advantages of DataHen:

Cost
  • A Customized Approach – traditional data scraping techniques are limited in their capabilities and it can be hard to get customized data that corresponds to your needs. We solve such issues as we deal with difficult cases like authentication or additional coding issues, and even fill out forms.
  • No Software – software scraping solutions can be not only quite pricey but also very complicated to understand and use. At DataHen, we provide you with the service, not software. This means that you just let us know what data you need and we deliver it to you.
  • Captcha Problem Solved – CAPTCHA is a computer program that distinguishes humans from machines via challenge-response testing. Unlike most of the web scraping companies, we scrape and crawl websites that have CAPTCHA restriction.
  • Affordable Pricing – since our services are automated, the costs are thereby lower than usual. The budget won’t be a constraint if you need data because we charge for data extraction only.

  • Fast-Acting – our team is very responsive and makes sure to deliver a superior level service to the clients. If you have a question or concern regarding the work that is in progress, you will most surely get a fast response from the team.

DataHen extracts data for you and most importantly, it delivers the data in the format that suits your needs the most. So you get your raw data in the format you need, such as CSV format, Microsoft Excel, Google Sheets, a PDF file or a JPEG. We can present your data in these formats or any other format that is preferred by you. The format in which you will get the data is very important for the further analysis of the information, so it’s important to get the data in a specific organized form. At DataHen, we scrape texts, images or any other files of your choice and needs. Also, we scrape different industries, including retail,pharmaceutical, automotive,finance, mortgage and many other industry-specific websites. We have scraped over a billion pages last year.

Scraper is a chrome extension that can extract data from websites and put into spreadsheets. It’s very simple to use when it comes to web page data extraction. However, although scraper is a simple data scraping tool, it is limited in terms of how much and what websites it can scrape. It will help you facilitate the online research process when you need to get data quickly and in a nicely formatted spreadsheet. Scraper is intended as an easy-to-use tool for users of different levels who feel comfortable with working with XPath.

Octoparse is another web scraping company that makes data mining process easy for all. You don’t need any special knowledge of coding to scrape pages with Octoparse. On their website, you can find a step-by-step guide that will teach you how to use Octoparse scraper. You will also find information on the modes to scrape, different ways to get data, and how to extract and download data on your device. Octoparse offers automated scraping with the following features:

  • Cloud Service – the Cloud Service offers unlimited storage for the data you scrape. You can scrape and access data on Octoparse cloud platform for 24/7.
  • Scheduled Scraping – since the process of scraping is automated, Octoparse offers you a solution to schedule crawling for a specific time. Tasks can be scheduled to scrape at any specific time, such as weekly, daily or even hourly.
  • IP Rotation – automatic IP rotation helps prevent IP blocking. Anonymous scraping minimizes the chances of getting traced and blocked.

  • Downloads – you can download scraped data in different formats, such as CSV, Microsoft Excel, API or you can choose to save it to the cloud databases.

Datahut is a cloud-based web scraping platform that aims to make the data scraping process easy. You don’t need servers, coding or expensive software. Datahut wants to help businesses grow by dealing with the chaos of data on the web by offering a simple way to extract data from websites. The work process goes in the following steps:

  • The company, first of all, gets to know the client to understand the needs and wants, in order to conduct a feasibility analysis and design a solution that works best for the client.
  • Based on the complexity of the source website and extraction volume, you decide on the pricing and the company sends you a payable invoice.
  • The company then creates an account for you in the customer support portal for further communication with data mining engineers and customer support managers.
  • After the approval of the sample data for you, a full data crawl is conducted and sent to the quality assurance tool to make sure that there are no faulty data.
  • The data is then delivered to you in your preferred source like Amazon S3, Dropbox, Box, FTP upload or via a custom API.
  • Customers get free maintenance of the data scrapers as part of the subscription. So if the client needs data on a recurring basis, they can schedule it on the platform and data will be gathered and shared automatically.

PromptCloud is doing web data extraction using cloud computing technologies that focus on helping enterprises acquire large scale structured data from all over the web.

Currently, the main industries that they scrape include travel, finance, health-care, marketing, analytics and more. The main features of PromtCloud include:

  • Customer Data Extraction – data extraction solution that delivers web data exactly the way a customer wants and needs, and at the desired frequency via the most preferred delivery channel.
  • Hosted Indexing – aims at indexing crawled data to focus only on the relevant datasets by using a logical combination in queries.
  • Live Crawls – crawling that’s done in real-time to deliver fresh data via search API.
  • DataStock – allows you to download clean and ready-to-use pre-crawled data sets available for a wide range of industries.
  • JobsPikr – a job data extraction that uses machine learning techniques to intelligently crawl job data from the web.

Data Scraping Services vs Tools

We’ve looked through the best data scraping companies of 2019, but how to choose the one that suits your needs the best? Well, firstly, you need to choose between a web scraping tool and a web scraping service. They have their advantages and disadvantages, so we’ll consider both.

Web Scraping Tools

Web scraping tools should be your top choice if you need data to support a small scale project. Also, they are great especially if you are on a tight budget. However, they are less scalable and viable. So if you need to conduct comprehensive monitoring of a larger amount of information for your enterprise, then the power of tools can be quite limited. Yet, there are many different scraping tools out there with functionality and pricing varying vastly. Most of them offer free trial periods. You can try the free demo to check if the tool fits your needs before subscribing to the paid version.

The main problem with this method is that the extracted data might not be ready to be used for your business needs immediately. Most of the scraping tools operate on objective-focused algorithms by crawling raw data from a given website without refining the information for immediate usage. So be ready to spend extra time to manage the lists of scraped data and arrange the massive amount of information.

Web Scraping Services

Web scraping service providers, also known as DaaS (data as a service) companies provide you with clean, accurate and structured data once you purchase the service.

Web crawling services use advanced scraping techniques to eliminate the risk of missing out the data from complex-coded web pages such as websites coded with Ajax, JavaScript or other complex programming languages. They also provide full coverage of Internet sources, while using a tool, you’ll need to pay for a tool upgrade to access new sources or features. DaaS companies should be your best choice for large-scale operations, such as financial analysis, brand and media monitoring, lead generation and more.

The Advantages of Outsourcing Data Scraping

Businesses are always in a hunt for big chunks of raw data. Getting valuable data via web scraping is a long and time-consuming process. The long and tiring data crawling and hunt for information ends once a company outsources data scraping to a service. Working with a reputable and professional data mining company is the solution for your data needs. Such companies will provide you with accurate and clean data from all over the web. They are not limited in terms of the number of web pages they scrape all over the internet and are able to extract information from websites with Captcha restrictions.

If you need to extract a large amount of data for a big project, web scraping services offer significant advantages over web scraping tools in terms of cost-efficiency, scalability, and a relatively short time-frame. Tools are less expensive, but they are limited in terms of what and how much they can scrape. While some advanced tools provide custom extraction and parsing, these features usually imply a higher pricing model. And this affects the overall cost-effectiveness of scraping tools. So if you’re undertaking a large project, you should consider working with a web scraping service for the overall effectiveness of the data you’ll get in the end.

Having your data provided by a professional service saves you precious time so that you can focus on your daily tasks and business growth. Outsourcing will enable your company to focus on the core business operations, thus improve the overall productivity. It helps businesses in managing data effectively, thereby helping achieve and generate more profits. So make a wise decision for your business’s future growth and choose a professional web scraping service, which will handle all the data work for you!

'Come on, I worked so hard on this project! And this is publicly accessible data! There's certainly a way around this, right? Or else, I did all of this for nothing... Sigh...'

Yep - this is what I said to myself, just after realizing that my ambitious data analysis project could get me into hot water. I intended to deploy a large-scale web crawler to collect data from multiple high profile websites. And then I was planning to publish the results of my analysis for the benefit of everybody. Pretty noble, right? Yes, but also pretty risky.

Interestingly, I've been seeing more and more projects like mine lately. And even more tutorials encouraging some form of web scraping or crawling. But what troubles me is the appalling widespread ignorance on the legal aspect of it.

So this is what this post is all about - understanding the possible consequences of web scraping and crawling. Hopefully, this will help you to avoid any potential problem.

Disclaimer: I'm not a lawyer. I'm simply a programmer who happens to be interested in this topic. You should seek out appropriate professional advice regarding your specific situation.

What are web scraping and crawling?

Let's first define these terms to make sure that we're on the same page.

How To Prevent Website Scraping

  1. Web scraping: the act of automatically downloading a web page's data and extracting very specific information from it. The extracted information can be stored pretty much anywhere (database, file, etc.).
  2. Web crawling: the act of automatically downloading a web page's data, extracting the hyperlinks it contains and following them. The downloaded data is generally stored in an index or a database to make it easily searchable.

For example, you may use a web scraper to extract weather forecast data from the National Weather Service. This would allow you to further analyze it.

In contrast, you may use a web crawler to download data from a broad range of websites and build a search engine. Maybe you've already heard of Googlebot, Google's own web crawler.

So web scrapers and crawlers are generally used for entirely different purposes.

Why is web scraping often seen negatively?

The reputation of web scraping has gotten a lot worse in the past few years, and for good reasons:

  1. It's increasingly being used for business purposes to gain a competitive advantage. So there's often a financial motive behind it.
  2. It's often done in complete disregard of copyright laws and of Terms of Service (ToS).
  3. It's often done in abusive manners. For example, web scrapers might send much more requests per second than what a human would do, thus causing an unexpected load on websites. They might also choose to stay anonymous and not identify themselves. Finally, they might also perform prohibited operations on websites, like circumventing the security measures that are put in place to automatically download data, which would otherwise be inaccessible.

Tons of individuals and companies are running their own web scrapers right now. So much that this has been causing headaches for companies whose websites are scraped, like social networks (e.g. Facebook, LinkedIn, etc.) and online stores (e.g. Amazon). This is probably why Facebook has separate terms for automated data collection.

In contrast, web crawling has historically been used by the well-known search engines (e.g. Google, Bing, etc.) to download and index the web. These companies have built a good reputation over the years, because they've built indispensable tools that add value to the websites they crawl. So web crawling is generally seen more favorably, although it may sometimes be used in abusive ways as well.

So is it legal or illegal?

Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch.

Web Scraping Stata

The problem arises when you scrape or crawl the website of somebody else, without obtaining their prior written permission, or in disregard of their Terms of Service (ToS). You're essentially putting yourself in a vulnerable position.

Just think about it; you're using the bandwidth of somebody else, and you're freely retrieving and using their data. It's reasonable to think that they might not like it, because what you're doing might hurt them in some way. So depending on many factors (and what mood they're in), they're perfectly free to pursue legal action against you.

I know what you may be thinking. 'Come on! This is ridiculous! Why would they sue me?'. Sure, they might just ignore you. Or they might simply use technical measures to block you. Or they might just send you a cease and desist letter. But technically, there's nothing that prevents them from suing you. This is the real problem.

Need proof? In Linkedin v. Doe Defendants, Linkedin is suing between 1-100 people who anonymously scraped their website. And for what reasons are they suing those people? Let's see:

  1. Violation of the Computer Fraud and Abuse Act (CFAA).
  2. Violation of California Penal Code.
  3. Violation of the Digital Millennium Copyright Act (DMCA).
  4. Breach of contract.
  5. Trespass.
  6. Misappropriation.

That lawsuit is pretty concerning, because it's really not clear what will happen to those 'anonymous' people.

Consider that if you ever get sued, you can't simply dismiss it. You need to defend yourself, and prove that you did nothing wrong. This has nothing to do with whether or not it's fair, or whether or not what you did is really illegal.

Another problem is that law isn't like anything you're probably used to. Because where you use logic, common sense and your technical expertise, they'll use legal jargon and some grey areas of law to prove that you did something wrong. This isn't a level playing field. And it certainly isn't a good situation to be in. So you'll need to get a lawyer, and this might cost you a lot of money.

Besides, based on the above lawsuit by LinkedIn, you can see that cases can undoubtedly become quite complex and very broad in scope, even though you 'just scraped a website'.

Web Scraping Stocks

The typical counterarguments brought by people

I found that people generally try to defend their web scraping or crawling activities by downplaying their importance. And they do so typically by using the same arguments over and over again.

So let's review the most common ones:

  1. 'I can do whatever I want with publicly accessible data.'

    False. The problem is that the 'creative arrangement' of data can be copyrighted, as described on cendi.gov:

    Facts cannot be copyrighted. However, the creative selection, coordination and arrangement of information and materials forming a database or compilation may be protected by copyright. Note, however, that the copyright protection only extends to the creative aspect, not to the facts contained in the database or compilation.

    So a website - including its pages, design, layout and database - can be copyrighted, because it's considered as a creative work. And if you scrape that website to extract data from it, the simple fact of copying a web page in memory with your web scraper might be considered as a copyright violation.

    In the United States, copyrighted work is protected by the Digital Millenium Copyright Act (DMCA).

  2. 'This is fair use!'

    This is a grey area:

    • In Kelly v. Arriba Soft Corp., the court found that the image search engine Ditto.com made fair use of a professional photographer's pictures by displaying thumbnails of them.
    • In Associated Press v. Meltwater U.S. Holdings, Inc., the court found that Meltwater's news aggregator service didn't make fair use of Associated Press' articles, even though scraped articles were only displayed as excerpts of the originals.
  3. 'It's the same as what my browser already does! Scraping a site is not technically different from using a web browser. I could gather data manually, anyway!'

    False. Terms of Service (ToS) often contain clauses that prohibit crawling/scraping/harvesting and automated uses of their associated services. You're legally bound by those terms; it doesn't matter that you could get that data manually.

  4. 'The worse that might happen if I break their Terms of Service is that I might get banned or blocked.'

    This is a grey area:

    • In Facebook v. Pete Warden, Facebook's attorney threatened Mr. Warden to sue him if he published his dataset comprised of hundreds of million of scraped Facebook profiles.
    • In Linkedin Corporation v. Michael George Keating, Linkedin blocked Mr. Keating from accessing Linkedin because he had created a tool that they thought was made to scrape their website. They were wrong. But yet, he has never been able to restore his account. Fortunately, this case didn't go further.
    • In LinkedIn Corporation v. Robocog Inc, Robocog Inc. (a.k.a. HiringSolved) was ordered to pay 40000$ to Linkedin for their unauthorized scraping of the site.
  5. 'This is completely unfair! Google has been crawling/scraping the whole web since forever!'

    True. But law has apparently nothing to do with fairness. It's based on rules, interpreted by people.

  6. 'If I ever get sued, I'll Good-Will-Hunting my way into defending myself.'

    Good luck! Unless you know law and legal jargon extensively. Personally, I don't.

  7. 'But I used an automated script, so I didn't enter into any contract with the website.'

    This is a grey area:

    • In Internet Archive v. Suzanne Shell, Internet Archive was found guilty of breach of contract while copying and archiving pages from Mrs. Shell's website using its web crawlers. On her website, Mrs. Shell displays a warning stating that as soon as you copy content from her website, you enter into a contract, and you owe her 5000$US per page copied (!!!). The two parties apparently reached an amicable resolution.
    • In Southwest Airlines Co. v. BoardFirst, LLC, BoardFirst was found guilty of violating a browsewrap contract displayed on Southwest Airlines' website. BoardFirst had created a tool that automatically downloaded the boarding passes of Southwest's customers to offer them better seats.
  8. 'Terms of Service (ToS) are not enforceable anyway. They have no legal value.'

    False. The Bingham McCutchen LLP law firm published a pretty extensive article onthis matter and they state that:

    As is the general rule with any contract, a website's terms of use will generally be deemed enforceable if mutually agreed to by the parties. [...] Regardless of whether a website's terms of use are clickwrap or browsewrap, the defendant's failure to read those terms is generally found irrelevant to the enforceability of its terms. One court disregarded arguments that awareness of a website's terms of use could not be imputed to a party who accessed that website using a web crawling or scraping tool that is unable to detect, let alone agree, to such terms. Similarly, one court imputed knowledge of a website's terms of use to a defendant who had repeatedly accessed that website using such tools. Nevertheless, these cases are, again, intensely factually driven, and courts have also declined to enforce terms of use where a plaintiff has failed to sufficiently establish that the defendant knew or should have known of those terms (e.g., because the terms are inconspicuous), even where the defendant repeatedly accessed a website using web crawling and scraping tools.

    In other words, Terms of Service (ToS) will be legally enforced depending on the court, and if there's sufficient proof that you were aware of them.

  9. 'I respected their robots.txt and I crawled at a reasonable speed, so I can't possibly get into trouble, right?'

    This is a grey area.

    robots.txt is recognized as a 'technological tool to deter unwanted crawling or scraping'. But whether or not you respect it, you're still bound to the Terms of Service (ToS).

  10. 'Okay, but this is for personal use. For my personal research only. I won't re-publish it, or publish any derivative dataset, or even sell it. So I'm good to go, right?'

    This is a grey area. Terms of Service (ToS) often prohibit automatic data collection, for any purpose.

    According to the Bingham McCutchen LLP law firm:

    The terms of use for websites frequently include clauses prohibiting access or use of the website by web crawlers, scrapers or other robots, including for purposes of data collection. Courts have recognized causes of action for breaches of contract based on the use of web crawling or scraping tools in violation of such provisions.

  11. 'But the website has no robots.txt. So I can do what I want, right?'

    False. You're still bound to the Terms of Service (ToS), and the content is copyrighted.

General advice for your scraping or crawling projects

Based on the above, you can certainly guess that you should be extra cautious with web scraping and crawling.

How Much To Charge For Web Scraping

Here are a few pieces of advice:

  1. Use an API if one is provided, instead of scraping data.
  2. Respect the Terms of Service (ToS).
  3. Respect the rules of robots.txt.
  4. Use a reasonable crawl rate, i.e. don't bombard the site with requests. Respect the crawl-delay setting provided in robots.txt; if there's none, use a conservative crawl rate (e.g. 1 request per 10-15 seconds).
  5. Identify your web scraper or crawler with a legitimate user agent string. Create a page that explains what you're doing and why, and link back to the page in your user agent string (e.g. 'MY-BOT (+https://yoursite.com/mybot.html)')
  6. If ToS or robots.txt prevent you from crawling or scraping, ask a written permission to the owner of the site, prior to doing anything else.
  7. Don't republish your crawled or scraped data or any derivative dataset without verifying the license of the data, or without obtaining a written permission from the copyright holder.
  8. If you doubt on the legality of what you're doing, don't do it. Or seek the advice of a lawyer.
  9. Don't base your whole business on data scraping. The website(s) that you scrape may eventually block you, just like what happened in Craigslist Inc. v. 3Taps Inc..
  10. Finally, you should be suspicious of any advice that you find on the internet (including mine), so please consult a lawyer.

Remember that companies and individuals are perfectly free to sue you, for whatever reasons they want. This is most likely not the first step that they'll take. But if you scrape/crawl their website without permission and you do something that they don't like, you definitely put yourself in a vulnerable position.

Scraping

Conclusion

As we've seen in this post, web scraping and crawling aren't illegal by themselves. They might become problematic when you play on somebody else's turf, on your own terms, without obtaining their prior permission. The same is true in real life as well, when you think about it.

There are a lot of grey areas in law around this topic, so the outcome is pretty unpredictable. Before getting into trouble, make sure that what you're doing respects the rules.

And finally, the relevant question isn't 'Is this legal?'. Instead, you should ask yourself 'Am I doing something that might upset someone? And am I willing to take the (financial) risk of their response?'.

So I hope that you appreciated my post! Feel free to leave a comment in the comment section below!

Update (24/04/2017): this post was featured in Reddit and Lobsters. It was also featured in the Programming Digest newsletter. If you get a chance to subscribe to it, you won't be disappointed! Thanks to everyone for your support and your great feedback!