Note: This post was written for an older version of Helium Scraper.
I had a user having troubles with a kind that was supposed to select a “next” button in a page. It worked fine on the first page, but when he added the “next” button on the second page, his kind started selecting also the “back” button. Helium Scraper couldn’t find any difference between the “back” and the “next” button, given the set of properties that defined his kind. But, if he and I could tell the difference just by looking at them, then Helium Scraper should be able to do so.
This difference was in the image of the buttons. One of them was a little red left arrow and the other one a right arrow. So all he needed to do is activate the “SrcAttribute” gatherer from Project -> Options -> Select Active Properties. This property gatherer gets the “src” attribute of the element, which contains the URL of the element’s image. After doing this, Helium Scraper started selecting only the “next” button on every page.
This is how property gatherers work. When creating a kind, Helium Scraper will gather every active property from every element in a webpage, and generate a list of properties that are common to every element we have added to this kind. This list will be the definition of the kind. So, for instance, if we would tell Helium Scraper to, among other properties, take into consideration the color of the elements when creating kinds (by activating a gatherer that gets the color of the element, such as the “BackgroundColor” one), and we create a kind using elements that are all red, then this kind will only select red elements. But if we use elements with different colors, this property will be removed from the kind’s definition and this kind will select elements of any color.
var index = url.indexOf("://");
return url.substring(index + 3).split(/\/+/g);
Now, going back to my example, if I would like to create a kind that selects only links to the “www.example.com” domain, I would select a few links that point to more than one page in that domain and create a kind called “LinksToExample”. This kind will now select links that point to any page in that domain. Now, if I wouldn’t have any links that point to that domain to take as samples, you can always edit your kind manually by clicking on the “Edit kind” button in the kind editor. It will take you to an XML editor that displays the XML representation of the kind. If you know nothing about XML, don’t panic. It’s just the list of properties that define our kind. Each item in this list starts with the <Item> keyword and end with the </Item> keyword.
So, if I’d only have links that point to domains I don’t care about, I would create a kind that selects links to any of them, then, in the kind’s XML, find this line (remember, my gatherer is called “JS_LinkDomain”):
And right underneath, supposing I created my kind by selecting links that pointed to pages in the “www.DomainIDoNotWant.com” domain, change this line:
for this other one:
Now, in order for the “JS_LinkDomain” property to be listed in my kind definition’s XML, I must have selected links that point all to the same domain when creating my kind. This is because, as I said before, when creating a kind, only properties that are common to every element used when creating it are listed on the kind’s definition. If, for some reason, I would have been forced to select links to different domains, I would just add this code, right bellow the <Items> (note the “s”) tag: