Extracting a Page URL

dq335514 · Post by **dq335514** » Wed Apr 27, 2011 8:59 pm

Could someone show me how to extract the URL of the page im extracting data from? I believe it may involve java coding? If so could someone be able to show me the code i need to use and where to put it?

Thanks

Post by **webmaster** » Thu Apr 28, 2011 3:59 am

Just change the property being extracted to Url as in this picture:

: url.jpg (33.2 KiB) Viewed 20509 times

You do need to create a kind that will select any element in the page. No matter which element it selects, the Url will be always the same for every element in a given page. So normally, you would use any other kind from which you are also extracting the inner text or anything else.

If you want to extract only the Url, you should create a kind that selects one element per page. Otherwise, Helium Scraper would extract repeated URL's, since it's extracting the Url of each element the kind selects. What I usually do is create a kind that selects the BODY element, because the BODY is always present and always once per page. I'm attaching a project that contains a kind that always selects the BODY element, no matter which page you are at. You can import it into you current project from File -> Import.

Hope it helped.

dq335514 · Post by **dq335514** » Thu Apr 28, 2011 12:09 pm

Thanks for your quick response.

Helium Scraper

Extracting a Page URL

Extracting a Page URL

Re: Extracting a Page URL

Re: Extracting a Page URL