A correctly configured web scrape consistently drives value for a job board. But ensuring that a scrape – never mind thousands of scrapes – stays correctly configured as websites change and occasionally fail is a lot of work.
That’s a problem: when websites change and fail, scrapes stop working. And for every day a broken scrape goes unnoticed, job boards lose money and provide a sub-par visitor experience.
So how can your dev team ensure your scrapes are always up and running? By being on the lookout for these three common reasons scrapes break – and understanding what to do when each happens.
When career site web page structures change, scrapes can break. This is because web scrapes are designed for a specific configuration of the page. Even a small update to the structure of a page – say, switching the location of a job’s title and its salary – can prevent a scrape from gathering data or populating it in the right place.
This, of course, can hurt click-through and application rates.
When pages are deleted or moved, scraped URLs can return 404 errors, which means the scrape cannot retrieve any data.
Another common example: in a period of high growth, a company might move its job listings to a page hosted on a recruitment technology firm’s site, such as Workday or Taleo. While the scrape for the old careers page might technically still work, it will no longer capture the organization’s open positions, meaning no listings make their way to the job board.
To stay on top of common website changes and failures, dev teams need to monitor their scrapes regularly – ideally daily. That’s a lot of work, but the potential for lost revenue from missing and misconfigured listings is significant.
Websites restrict access to data in various ways. Two of the most common are…
Of course, it takes time to diagnose why a scrape isn’t pulling data and to develop a workaround. When a single developer is in charge of maintaining all of a job board’s scrapes, that time can be hard to come by.
According to Pingdom, more than 12,000 websites are down at any point in time, which means scrapes can’t populate listings from those sites. When that happens, candidates cannot click on or apply to the associated jobs, which chips away at a job board’s revenue and visitor experience.
The most effective way to monitor website failures is to write and run scripts designed to detect them. Running those scripts may also require server or cloud space beyond what a job board would otherwise maintain.
And that’s only the first part of dealing with website failures. When they happen, devs have to know how to handle the associated scrape. Pause it? Take it down altogether? Build a scrape for a new site?
The decision-making process requires a nuanced understanding of not only various types of website failures but also the business case for running various scrapes.
Maintaining web scrapes for job listings is especially important right now as Americans leave their jobs and reenter the applicant pool at higher rates than ever before.
And because of this employment environment, even relatively short scrape failures can translate to big losses for job boards. If you don’t currently have the in-house bandwidth to monitor your scrapes so you can ensure as much uptime as possible, outsourcing to a specialty web-scraping firm may be the right solution.
Your web scraping partner can help ensure you populate the most up-to-date data for your listings. Aspen Tech Labs, for example, maintains real-time monitoring and analytics, which ensures our web scrapes are continuously working as designed. When something breaks, we identify it and we fix it, often within the same day. When those fixes take longer than expected, we stay in regular communication with our clients.
The result is that our clients’ job boards are constantly driving revenue while also maintaining an excellent user experience.
If you’re looking to solve your web scraping problems, contact us here. Aspen Tech Labs offers a free trial, so throw one of your most challenging scraping problems at us, and we will show you our industry best practices.
We noticed you mentioned scraping Indeed.com
Just to confirm: Indeed.com prohibits spidering of its content and they will block anyone trying to scrape it.
Normally, our clients ask us to spider jobs from direct employer websites and ATSes.
In some cases we can spider commercial job boards: if there is a formal agreement between our client and the job board to allow spidering.