Challenge: One of our clients requested a “warm” scraping configuration for specific jobs – pre-configured, but not running by schedule. Our team had to identify the best way to import selected (not all in a bulk) jobs from pre-configured sites, process them and provide clean data.
We established the following workflow:
1. Our team configures a standard scrape request and provides the client with a proper XML file and scrape ID.
2. Client sets everything up on their end and uploads the file to own Amazon S3 server with the scrape ID in the file name (so we understand which source the file is connected to)
3. Client provides us the file with links to exact jobs that they need scraped, waits for daily reprocess or calls “Restart Scraping” command via dashboard scraping API and gets selected jobs processed, properly formatted, cleaned up, with mapping rules applied.
4. Client receives the final XML feed with selected jobs scraped and post-processed in almost real-time.
Results: The whole process allows the client to control what jobs populate their XML feeds and always have fresh and updated data in almost real-time for instantaneous job distribution.
We noticed you mentioned scraping Indeed.com
Just to confirm: Indeed.com prohibits spidering of its content and they will block anyone trying to scrape it.
Normally, our clients ask us to spider jobs from direct employer websites and ATSes.
In some cases we can spider commercial job boards: if there is a formal agreement between our client and the job board to allow spidering.