How to skip a selection from a search list

Questions and answers about anything related to Helium Scraper
Post Reply
boochikik
Posts: 3
Joined: Wed Oct 26, 2011 5:03 pm

How to skip a selection from a search list

Post by boochikik » Wed Oct 26, 2011 5:13 pm

I am trying to scrape the customer reviews from Amazon website for cameras, for example. But in the search results, it also shows accessories which I don't care about. How do I navigate through each of the results but selecting the cameras only and skipping the accessories?

Here's the sample URL:
http://www.amazon.com/s/ref=nb_sb_noss? ... ony+camera

Thanks so much for this great tool.

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: How to skip a selection from a search list

Post by webmaster » Wed Oct 26, 2011 10:05 pm

Hi,

Interesting case. I haven't noticed on the link you sent any clear cut / objective way to distinguish between a camera and an accessory, such as they belonging to different category or perhaps cameras having an icon that accessories don't. If there happen to be such a way to distinguish them, let me know. If not, I have a couple of projects that should help you which actually can be applied to many other cases.

The first one contains a JavaScript gatherer called JS_HasWords that will return true if an element has any of a given set of words, or false otherwise. You can set these words (keywords actually, since each can be more than one word) by modifying the line of code that says var words = "case|memory stick|battery"; (as you can see I already set a few sample words).

There is also a kind called Links that if you look at its properties, you'll find the JS_HasWords property set to False, which means it will only select links that doesn't have any of the words selected in the JS_HasWords gatherer.

The second project is a bit more complicated. It has a JavaScript gatherer called JS_IsBiggerThanN that will return true if the element contains a number (or a price) greater than the number N, which you can set to any number by editing the line of code that says var N = 100;. So if used the way it is, it will return true for prices bigger than 100. Also, there is a kind called Prices in which the JS_IsBiggerThanN property is True, so it will only select prices bigger than 100. If you modify the N number in the JS_IsBiggerThanN you would be also modifying which elements will the Prices kind selects without having to recreate this kind.

Now, these elements are not links themselves, so we cannot navigate through them. But if you look at the test actions tree, you will find that, after an Execute JS action that clears the DummyTable, there is an Extract action which extracts the prices, and then there is a Navigate Each action that navigates through the links. If you double click this Navigate Each action, you'll notice the Only if modified in DummyTable option is selected. This will cause the Navigate Each action to navigate only through links that correspond to prices that have been extracted to the DummyTable. There is also a sample Extract actions that extracts the title inside each of these description pages.

Let me know if you have any trouble with these files.
Attachments
IsBiggerThanN.hsp
(384.13 KiB) Downloaded 538 times
HasWords.hsp
(278.84 KiB) Downloaded 572 times
Juan Soldi
The Helium Scraper Team

boochikik
Posts: 3
Joined: Wed Oct 26, 2011 5:03 pm

Re: How to skip a selection from a search list

Post by boochikik » Thu Oct 27, 2011 6:48 am

Hi Juan,
Thanks you so much for your quick reply. The two projects you sent were both interesting as well. The IsBiggerThanN.hsp appears to be able to resolve my issue. However, it's looking at the price of the item for comparison. What I would like is to be able to distinguish the item I would like to include based on the Model Name. In the project I'm working on, I would like to look for items with product names starting with "Kodak ESP" or Kodak HERO". The other project (HasWords.hsp) you sent would have been appropriate for parsing... however, I can not use it in the If..while action tree. Unfortunately, I tried to put both projects together but I got lost. I attached the zipped project I'm working on so you can see what I'm trying to do. I had to zip it because it's bigger than max allowed size. I am basically trying to scrape the customer reviews just for the printers only, not the ink cartridge, etc.

Thanks in advance for your help.
Attachments
Amazon_rip_test.zip
Please unzip
(40.09 KiB) Downloaded 545 times

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: How to skip a selection from a search list

Post by webmaster » Fri Oct 28, 2011 6:35 pm

Hi,

I found a nice solution to your problem :). Instead of the DummyTable thing, which is not as accurate and which is hard to understand, I wrote an actions tree called Reflect Kind. This actions tree takes three kinds: a source, a parent and a target. What it does is, first, selects the source kind, then for each selected element goes up the HTML tree until it finds an element of kind parent, and these parent elements are selected (which means only parent elements that contain a source element will be selected). Then it does the opposite. Goes down the HTML tree from each of the selected parent elements until it finds a target element, and these target elements are selected. From here, you can use a Navigate Each action that navigates through the currently selected items (by setting the Use current selection option).

So in your project, firstly, I added a small javascript gatherer called JS_StartsWith, which returns true if the text of the element starts with "Kodak ESP" or "Kodak HERO" (you can look at the code from Project -> JavaScript Gatherers -> JS_StartsWith and see how you can add valid starting strings to it). Then, I created the Valid Links kind which selects only links where the JS_StartsWith property is true (and therefore only links to "Kodak ESP" or "Kodak HERO" items).

I also added the Containers kind, which selects the DIV elements that contain both the links and the elements that your Select By Reviews kind selects (the amount of reviews link). What I did to create this kind was simply select one of the links and keep hitting the Select parent button in the selection panel until I saw that the amount of reviews link was highlighted, and created the kind. Then I did the same thing with some other links and added them to this kind until it started selecting 16 items on a couple of pages.

Finally, I added a Reflect Kind action to your Actions tree 1 that uses Containers as a parent, Valid Links as a source, and Select By Reviews as a target. So what will happen is that Valid Links will be selected (remember, only links to "Epson..."), then Containers that contain these links (only) will be selected, and finally Select By Review items that are contained by these containers will be selected.

Notice the I've set the Navigate Each action to only Use current selection and to not Simulate click. You cannot simulate click while using current selection, but even if you could, I strongly recommend only using this option when necessary (the easiest way to figure out whether is necessary or not is by testing without the click first).

Make sure the new kinds are selecting elements on your end. The Containers kind should select every item on any results page. And the Valid Links should select the kind of links I just described above.

Let me know if you have any question.
Attachments
ReflectKind.hsp
(281.8 KiB) Downloaded 553 times
AmazonRipTest2.hsp
(407.06 KiB) Downloaded 559 times
Juan Soldi
The Helium Scraper Team

boochikik
Posts: 3
Joined: Wed Oct 26, 2011 5:03 pm

Re: How to skip a selection from a search list

Post by boochikik » Thu Nov 03, 2011 4:22 pm

Hi Juan. Thank you very very much for you kind assistance. Everything works great now. I just have a few things that I have to figure out for other review websites.

Post Reply