Suggestions ALA "Rapid Miner"

What do you suggest? What should be improve? We are all ears.
Post Reply
derba
Posts: 27
Joined: Tue May 15, 2012 8:03 pm

Suggestions ALA "Rapid Miner"

Post by derba » Wed May 16, 2012 11:35 am

Hello,

Could you please add the following ideas taken from an open source software (Rapid Miner http://rapid-i.com/content/view/181/190/ - which is by the way not very user friendly to say the least!) :

a) Another way for crawling the web : Regex crawling rules :

If such keyword is found in the url(s) contained in a webpage, follow it. ;)
See image below :
crawl web   Regex crawling rules.jpg
crawl web Regex crawling rules.jpg (122.87 KiB) Viewed 8074 times
b) Keep only a part of a webpage content : String
The user choose the starting and ending html code found in a webpage. Then only the content in between is kept. ;)
(Note: in order to find the html code : usually right click on a webpage inside a web browser like Internet explorer or Firefox and choose something like "view page source")
01 string matching example.jpg
01 string matching example.jpg (246.04 KiB) Viewed 8074 times
c) Keep only a part of a webpage content : Xpath link
(Note in order to find the correct Xpath easily I recommend using the firefox addon "Autopager" https://addons.mozilla.org/en-US/firefo ... autopager/ please see my next answer below in order to find a mini tutorial for that addon)
02 xpath queries.jpg
02 xpath queries.jpg (211.41 KiB) Viewed 8074 times
Thanks in advance ;)

derba
Posts: 27
Joined: Tue May 15, 2012 8:03 pm

Re: Suggestions ALA "Rapid Miner"

Post by derba » Wed May 16, 2012 4:28 pm

Here is a mini tutorial for finding xpath link with the Firefox autopager addon :

Once the addon is installed, the starting url is the current one :
http://www.heliumscraper.com/forum/view ... =492&p=992
autopager01.jpg
autopager01.jpg (343.45 KiB) Viewed 8070 times
autopager02.jpg
autopager02.jpg (185.08 KiB) Viewed 8070 times
autopager03.jpg
autopager03.jpg (314.6 KiB) Viewed 8070 times
Hope this helps. ;)

webmaster
Site Admin
Posts: 501
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Suggestions ALA "Rapid Miner"

Post by webmaster » Thu May 17, 2012 4:01 am

Hi,

Thanks for your suggestions.

As for point A, you can achieve the same effect by creating a JavaScript gatherer with this code:

Code: Select all

try
{
	var regex = /.+keyword.+/;
	return element.getAttribute('href').match(regex) ? true : false;
}
catch(error)
{
	return false;
}
called, say, IsMatch, and then manually creating a kind with the property JS_IsMatch set to True. Note that element.getAttribute('href') could be replaced by element.href. The difference is that that the latter would resolve relative links to absolute links, so you'd likely get more links since you'd be using absolute, and therefore longer URLs.

Regarding suggestion B, if I understand correctly, you should be able to do this with Text Gatherers (here is more info about these with video included ;) ).

As for suggestion C, seems to me like you'd be basically creating kinds (each kind would kind of be each of the XPath lines in your autopager03.jpg picture), except that Helium Scraper identifies elements by a set of properties, which can also be extended with JavaScript gatherers in case they are not enough for a particular scenario.

It would indeed be interesting to have XPath kinds in Helium Scraper, although would probably be overkill considering how easy is to create kinds by selecting sample elements and letting Helium Scraper figure out what they have in common.

Let me know if I there is anything I missed.
Juan Soldi
The Helium Scraper Team

derba
Posts: 27
Joined: Tue May 15, 2012 8:03 pm

Re: Suggestions ALA "Rapid Miner"

Post by derba » Sat May 19, 2012 3:36 pm

Hi,
Many thanks for the detailed explanations. ;)

For A: your code worked in my testing but after many trials I prefer using your kind ("next page" + javascript in action) system already in place in Helium ! ;)

For B: thanks. I was aware of "JavaScript gatherers" but never thought that it could do the trick ! ;)

For C:
I have found the following simple thing very powerful (sorry I am a newbie!) :
Let's say that :
The html code is : <div id='content'>
So my xpath is the following : //div[@id='content']

If I am not wrong I can do that in Helium with :
"define kind manually" / then in the property column : choose the little down arrow and select "IdAttribute" and enter "content" in the column value.

But I stuck with these two more examples :
1) if the @id is different like @iduknow how do I tell Helium to get that ? using JS_Xpath ? or NameAttribute ? Is there a documentation (or could you share a few examples) on that please ?
2) if the xpath is exactly the following ?
//div[@id='content']/table/tbody
or
//div[@id='content']/*/tr

Many thanks in advance ;)

webmaster
Site Admin
Posts: 501
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Suggestions ALA "Rapid Miner"

Post by webmaster » Sun May 20, 2012 6:20 pm

By "if the @id is different" you mean if instead of an ID you want to use another thing such as a class name or a tag name, or if the element just has another id? If the latter, then you can create a kind manually and set the IdAttribute property to whatever the ID is. If you meant using any attribute other than an ID, use this code in a JavaScript gatherer:

Code: Select all

try
{
	return element.getAttribute('some_attribute');
}
catch(error)
{
	return null;
}
and replace some_attribute for whatever other attribute you want to use. Note that there already exist property gatherers for common attributes such as class name. When creating a javascript gatherer you may want to preview its output for some elements. You can do this by clicking on the Choose visible properties button on the selection panel and then picking your property gatherer (which will start with JS_) and then selecting some elements on the page.

As for your sample XPaths, you'll see that there are a bunch of properties named stuff like FirstParentId, SecondParentId or ParentTagName. Your second sample, would be similar to a kind where the SecondParentId is "content", the ParentTagName is table, and the TagName is tbody. If you'd need to get stuff like the 10th parent ID then you'd need to create a JavaScript gatherer to get this property. But again, if you use the automated way of creating kind by selecting them and using the Create kind from selection button, all these properties will be automatically populated without having to worry about XPath at all.
Juan Soldi
The Helium Scraper Team

derba
Posts: 27
Joined: Tue May 15, 2012 8:03 pm

Re: Suggestions ALA "Rapid Miner"

Post by derba » Sun May 20, 2012 7:29 pm

Hi Juan,
Many thanks for the detailed and clear explanations. ;)
I appreciated. ;)

Post Reply