Could you please add the following ideas taken from an open source software (Rapid Miner http://rapid-i.com/content/view/181/190/ - which is by the way not very user friendly to say the least!) :
a) Another way for crawling the web : Regex crawling rules :
If such keyword is found in the url(s) contained in a webpage, follow it.

See image below : b) Keep only a part of a webpage content : String
The user choose the starting and ending html code found in a webpage. Then only the content in between is kept.

(Note: in order to find the html code : usually right click on a webpage inside a web browser like Internet explorer or Firefox and choose something like "view page source") c) Keep only a part of a webpage content : Xpath link
(Note in order to find the correct Xpath easily I recommend using the firefox addon "Autopager" https://addons.mozilla.org/en-US/firefo ... autopager/ please see my next answer below in order to find a mini tutorial for that addon) Thanks in advance
