Extracting a Page URL

Questions and answers about anything related to Helium Scraper
Post Reply
dq335514
Posts: 2
Joined: Wed Apr 27, 2011 8:55 pm

Extracting a Page URL

Post by dq335514 » Wed Apr 27, 2011 8:59 pm

Could someone show me how to extract the URL of the page im extracting data from? I believe it may involve java coding? If so could someone be able to show me the code i need to use and where to put it?

Thanks

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Extracting a Page URL

Post by webmaster » Thu Apr 28, 2011 3:59 am

Just change the property being extracted to Url as in this picture:
url.jpg
url.jpg (33.2 KiB) Viewed 8822 times
You do need to create a kind that will select any element in the page. No matter which element it selects, the Url will be always the same for every element in a given page. So normally, you would use any other kind from which you are also extracting the inner text or anything else.

If you want to extract only the Url, you should create a kind that selects one element per page. Otherwise, Helium Scraper would extract repeated URL's, since it's extracting the Url of each element the kind selects. What I usually do is create a kind that selects the BODY element, because the BODY is always present and always once per page. I'm attaching a project that contains a kind that always selects the BODY element, no matter which page you are at. You can import it into you current project from File -> Import.

Hope it helped.
Attachments
Body.hsp
(289.46 KiB) Downloaded 679 times
Juan Soldi
The Helium Scraper Team

dq335514
Posts: 2
Joined: Wed Apr 27, 2011 8:55 pm

Re: Extracting a Page URL

Post by dq335514 » Thu Apr 28, 2011 12:09 pm

Thanks for your quick response. :)

Post Reply