Multiple links to scrape - Please help!

navarino · Post by **navarino** » Sat Aug 06, 2011 5:36 pm

Hello,

I really need your help! I have a list of amazon urls for books where I need to scrape data such as title, price, star rating, number of customer reviews and all the content of the reviews. I have followed the basic tutorial and understand how I would do it for one link but how do I do the same actions for a list of many links that I have in an excel spreadsheet?

I have a big list - here is a sample list of 10 books:

http://www.amazon.co.uk/Books/dp/0004721152
http://www.amazon.co.uk/Books/dp/0004721896
http://www.amazon.co.uk/Books/dp/0004722078
http://www.amazon.co.uk/Books/dp/0004722086
http://www.amazon.co.uk/Books/dp/0004722914
http://www.amazon.co.uk/Books/dp/0004723686
http://www.amazon.co.uk/Books/dp/0004724143
http://www.amazon.co.uk/Books/dp/0004724704
http://www.amazon.co.uk/Books/dp/0006374921
http://www.amazon.co.uk/Books/dp/0006387802

Any help would be greatly appreciated,

Thanks,
Navarino

Post by **webmaster** » Mon Aug 08, 2011 8:27 pm

Hi Navarino,

You can use the Navigate URLs action to navigate through a list of URLs.

navarino · Post by **navarino** » Tue Aug 09, 2011 3:25 pm

Hi Juan,

Thanks. I will try the 'Navigate URLs' and see if it works.

I need your help 3 other issues. Say I have a 100 urls to scrape, how do i set pauses for example a pause of 1 minute after 20 urls? I tried to work with 'wait' in the actions tree but could only get a pause after each url rather than after 20.

Also, can I use a different proxy server after say 20 urls? I could add many servers by using the proxy tool - but can I set it so that after 20 urls it uses a different server?

One final query is that I would like to extract a sliced html link on a site rather than the innertext. In the actions part I go to extract and then from the drop down menu I can choose outerhtml which gives me the full link. But how do I slice this as I only need a number from the hyperlink (I can do this with text using text gatherers, but how do I do it with outerhtml?

Appreciate your help,
Navarino

Post by **webmaster** » Tue Aug 09, 2011 7:39 pm

Hi Navarino,

Use the attached project to wait and rotate proxy every 20 (or any other number) URLs. Just set the table or add your URLs to the URLs table, double click the Execute JS (Pause) action and set the pauseEvery and waitSeconds variables. It will wait the amount of seconds indicated by waitSeconds and rotate the proxies every pauseEvery URLs.

Note that this Execute JS action uses the Tree.UserData variable, which assumes you are not using it in another Execute JS action in the same actions tree.

To use a text gatherer with outer html, first open the text gatherer tool and copy and paste in the samples table at the bottom one or more samples of the code from which you want to extract a piece of text from. This code must be the OuterHTML property. Then create your text gatherer as you would normally do. Say you called it "myTextGatherer". Open Project -> JavaScript Gatherers, find the "JS_myTextGatherer" gatherer and modify the first line from this:

Code: Select all

var step0_result = element.innerText.replace(/\r\n/g, "\n");

to this:

Code: Select all

var step0_result = element.outerHTML.replace(/\r\n/g, "\n");

Remember to test your newly created gatherer by selecting it in the selection panel at the bottom.

Helium Scraper

Multiple links to scrape - Please help!

Multiple links to scrape - Please help!

Re: Multiple links to scrape - Please help!

Re: Multiple links to scrape - Please help!

Re: Multiple links to scrape - Please help!