Page 1 of 1

Scraping breadcrumbs from 1300 pages

Posted: Mon Oct 31, 2011 8:34 pm
by Tubcat
I am trying to figure out how to scrape my own website's breadcrumbs along with the page URL and Page Title.
This is in order to create an easy way to organize my website's page structure and track "page owners" in an Excel report. Every page on my site has breadcrumbs which provide the structure I am looking to recreate.

Right now the website URLs use PageIDs, so a URL list won't work and there is no easy way of reproducing a report which captures the breadcrumbs without paying the CMS company a great deal of cash to customize a report for me.

I have a list of the 1,300 page URLs, and I want to capture the URL, Page Title and the breadcrumbs, all of which are in a <div id="breadcrumbs"> and I would like to capture all text within that <div> for all pages.

I've been playing with Helium for the better part of today, but still feel lost as I'm a content/communications guy, not a dev, and I'm under pressure to get this done ASAP.
Would be a great tool I'd likely come to reply on if anyone could walk me through setting something like this up. Will love you long time! :|

Re: Scraping breadcrumbs from 1300 pages

Posted: Thu Nov 03, 2011 3:43 am
by webmaster
Hi,

It really depends on your site. You'd typically create a kind that selects the whole breadcrumb line, or create one kind for each breadcrumb hierarchic level (such as in level1 -> level2 -> level3...). This might confuse Helium Scraper if it cannot find any difference between different levels, so a JavaScript gatherer might need to be written to have Helium Scraper distinguish them.

If you want send me the project you're working on and I'll see how can I can help.