Web scraping is, once again, in the news and under attack. In the past few months, both Meta and LinkedIn have filed federal lawsuits alleging that the use of web scraping to access public data violates the Computer Fraud and Abuse Act.
As with most technologies, there’s a right and wrong way to use web scraping. Effective web scraping isn’t just about extracting data, but about doing it the right way, which means doing it in a way that protects user data and prevents website crashes (among other things).
Here are three ways to scrape the right way and avoid everything from lost data to lawsuits.
It’s unlikely you’ll face any litigation due to simply overloading a server. But if a source site crashes as a result of scraping-related overloading, you run the risk of being held liable under the Trespass to Chattels law.
Chances are, overloading won’t be an issue if you scrape a site that has just a few listings. But overloading and crashing the site could quickly become a problem if you want to scrape a source with hundreds of listings. Here’s why.
Every time you scrape a website, you make requests. The more data you want to extract, the more requests you make. And if you make too many requests, you run the risk of overloading the server – which can cause the site to crash.
If a site owner doesn’t detect your spiders overloading their site while they’re crawling, they surely will notice after a crash. And even if the site owner doesn’t file a lawsuit, they can still block your spiders and restructure their site, which breaks your existing scrapes.
Overloading a site also makes the user experience worse for anyone using that site (think: slow response times). This irresponsible job scraping gives all forms of web scraping a bad reputation and creates additional work for you, whether that’s reconfiguring scrapes or finding new ways to bypass IP blockers.
Fortunately, there are plenty of ways to make sure you don’t overload a site. For instance, you might cache the pages you visited, which means you won’t need to reload them if you need to scrape the site again. You can program your spiders to “sleep” between different scraping tasks. Or – and this is a big one – you could adjust your spiders’ navigation scheme so they only scrape the pages you need.
Consider real-time monitoring a best practice that keeps your scrapes fully operational. Again, while it’s possible your spiders could attract the attention of a site owner, it’s improbable this would result in any legal action.
But even if your spiders don’t alert site owners, you still need to regularly monitor your scrapes. Why? HTML pages frequently change. And while these changes can vary in scope, even a small change like adjusting the font of a page’s copy can break your scrapes.
When a scrape breaks, you miss out on data. For example, an older, unmonitored scrape might extract job titles but fail to get descriptions. Or it might map data incorrectly, like by putting salary information in the job listing’s title field.
When you’re scraping hundreds – or thousands – of job listings, these formatting errors and lost pieces of data quickly add up, amounting to a time-consuming and expensive problem for you to solve.
This is why real-time monitoring is so beneficial. It keeps your scrapes up and running. A monitored scrape results in up-to-date listings and an improved user experience for any job seekers who visit your board.
Continuously monitoring your web scrapes can also limit the odds your spiders and scrapes trigger any site-wide security measures or overload the site.
In their separate lawsuits, Meta and LinkedIn both alleged that third parties unlawfully extracted – or enabled others to extract – the personal information of Meta’s and LinkedIn’s users. Yes, people can use web scraping maliciously. But sometimes you can extract this information incidentally, like by programming your spiders to follow every link on a page.
A best practice for job scraping that also keeps you out of the courtroom: don’t scrape candidates’ personal data (names, addresses, phone numbers, etc.). We certainly don’t.
Now let’s say you want to scrape a client’s job listings via its ATS. Working with a reputable web scraping firm means you don’t need to worry about that firm also extracting applicants’ personal information. In other words, you aren’t just paying for high-quality scrapes; you’re paying for peace of mind.
Yes, the court struck down LinkedIn’s case. But LinkedIn plans to appeal – and Meta’s case is ongoing. In other words, the debate over web scraping is far from over.
Navigating the technical and legal parameters of web scraping isn’t easy. If you’re running a smaller job board with hopes to scale, these lawsuits and best practices can feel daunting.
That’s why partnering with a reputable web scraping firm is so valuable. A web scraping firm doesn’t just keep you out of the courtroom; it helps you increase the content on your site while ensuring your listings are updated and ready for the candidates who click them.
We noticed you mentioned scraping Indeed.com
Just to confirm: Indeed.com prohibits spidering of its content and they will block anyone trying to scrape it.
Normally, our clients ask us to spider jobs from direct employer websites and ATSes.
In some cases we can spider commercial job boards: if there is a formal agreement between our client and the job board to allow spidering.