Scraping breadcrumbs from 1300 pages

Questions and answers about anything related to Helium Scraper
Post Reply
Tubcat
Posts: 1
Joined: Mon Oct 31, 2011 2:52 pm

Scraping breadcrumbs from 1300 pages

Post by Tubcat » Mon Oct 31, 2011 8:34 pm

I am trying to figure out how to scrape my own website's breadcrumbs along with the page URL and Page Title.
This is in order to create an easy way to organize my website's page structure and track "page owners" in an Excel report. Every page on my site has breadcrumbs which provide the structure I am looking to recreate.

Right now the website URLs use PageIDs, so a URL list won't work and there is no easy way of reproducing a report which captures the breadcrumbs without paying the CMS company a great deal of cash to customize a report for me.

I have a list of the 1,300 page URLs, and I want to capture the URL, Page Title and the breadcrumbs, all of which are in a <div id="breadcrumbs"> and I would like to capture all text within that <div> for all pages.

I've been playing with Helium for the better part of today, but still feel lost as I'm a content/communications guy, not a dev, and I'm under pressure to get this done ASAP.
Would be a great tool I'd likely come to reply on if anyone could walk me through setting something like this up. Will love you long time! :|

webmaster
Site Admin
Posts: 501
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Scraping breadcrumbs from 1300 pages

Post by webmaster » Thu Nov 03, 2011 3:43 am

Hi,

It really depends on your site. You'd typically create a kind that selects the whole breadcrumb line, or create one kind for each breadcrumb hierarchic level (such as in level1 -> level2 -> level3...). This might confuse Helium Scraper if it cannot find any difference between different levels, so a JavaScript gatherer might need to be written to have Helium Scraper distinguish them.

If you want send me the project you're working on and I'll see how can I can help.
Juan Soldi
The Helium Scraper Team

Post Reply