Challenge: A client requested to scrape hotel/host data descriptions as well as availability dates, pricing, conditions, accommodations with accompanying photo images from the adverts for monitoring and business intelligence needs .
Solution: Our team created 6 separate CSV feed files with data reporting that the client requested:
1) Hotel listings data fields: place/hotel/room descriptions, locations, titles, IDs, URLs, coordinates;
For coordinates – we additionally enabled our GeoLocation feature to help parse this data properly.
2) Host feed data: data related to place/hotel owners;
3) Pricing feed: data related to apartment booking rates and periods;
Additionally, our team cleaned up lots of numerical data to deal with different variations of rates and time periods (per night, per week, per month, etc.);
4) Photos: links to images;
5) Review feed: Data on reviews of apartments/places posted by users This part was challenging, but we managed to extract raw data and then added every single review to the feed as a replicated record;
6) Calendar data feed.
Results: We got a lot of raw data, but we also got additional data from different sources (URLs or frames). We then created a specific script that parsed, formatted, and combined all the data into a user-readable form.
For example, we had calendar data of hotel rooms’ booked and available dates. This helped us to obtain room availability for the first 90 days and mark occupied and available dates . We were also able to add pricing values and calculate total prices.
We noticed you mentioned scraping Indeed.com
Just to confirm: Indeed.com prohibits spidering of its content and they will block anyone trying to scrape it.
Normally, our clients ask us to spider jobs from direct employer websites and ATSes.
In some cases we can spider commercial job boards: if there is a formal agreement between our client and the job board to allow spidering.