Page 1 of 1

Extracting a Page URL

Posted: Wed Apr 27, 2011 8:59 pm
by dq335514
Could someone show me how to extract the URL of the page im extracting data from? I believe it may involve java coding? If so could someone be able to show me the code i need to use and where to put it?


Re: Extracting a Page URL

Posted: Thu Apr 28, 2011 3:59 am
by webmaster
Just change the property being extracted to Url as in this picture:
url.jpg (33.2 KiB) Viewed 3268 times
You do need to create a kind that will select any element in the page. No matter which element it selects, the Url will be always the same for every element in a given page. So normally, you would use any other kind from which you are also extracting the inner text or anything else.

If you want to extract only the Url, you should create a kind that selects one element per page. Otherwise, Helium Scraper would extract repeated URL's, since it's extracting the Url of each element the kind selects. What I usually do is create a kind that selects the BODY element, because the BODY is always present and always once per page. I'm attaching a project that contains a kind that always selects the BODY element, no matter which page you are at. You can import it into you current project from File -> Import.

Hope it helped.

Re: Extracting a Page URL

Posted: Thu Apr 28, 2011 12:09 pm
by dq335514
Thanks for your quick response. :)