Scraping Widget Content or Social Plugin Possible?

Questions and answers about anything related to Helium Scraper
Post Reply
aigoodaigoo
Posts: 10
Joined: Wed May 18, 2011 6:18 pm

Scraping Widget Content or Social Plugin Possible?

Post by aigoodaigoo » Mon Jun 06, 2011 8:42 pm

Hi, I'm scraping Blog Content Likes (e.g.: Facebook, Twitter). For instance, a blog content here:
http://blog.nwf.org/wildlifepromise/201 ... countries/
has 0 Tweets and 2 Likes on Facebook. I think this info is downloaded not as HTML but a dynamic content that Twitter/FB widgets pull externally. Is there any way to pull this info down as scrapable content? I took a look at the post on IFRAME but doesn't seem to work. If this is not directly accessible, saving the whole page as complete Web Page saves this social plugin into a local directory would suffice as well.

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Scraping Widget Content or Social Plugin Possible?

Post by webmaster » Tue Jun 07, 2011 8:03 pm

I've edited this post after the release of Helium Scraper 2.0.2.0 to make use of the new Reference Table functionality.

Hi,

Is not possible to access the contents of the IFrame directly, because of XSS restrictions (these plugins reside on a different domain, so we don't have access to them from the blog page). But it is possible to see their URL by looking at the "src" property of the IFrame, navigate there, extract the info and then go back.

So that's what the attached project does. The kinds "FaceLike" and "TweetLike" select the iFrames. The kinds "FaceLikeContent" and "TweetLikeContent" select the amount of likes after navigating to the "scr" of the IFrame. If you'd like to navigate inside any of this frames, to create a kind or just to see how it works, look at the "test" actions tree. It uses the actions tree "Go To IFrame", which I created for this purpose. It currently navigates to the Facebook IFrame. You can change that by double clicking and editing the "Execute tree: Go To IFrame" node.

The "Go To IFrame And Back" actions tree is just the same as "Go To IFrame", but it goes back to the previous page. What it does is, navigate to the "src" of the given IFrame, then execute the child nodes of the "Execute Actions Tree" action that calls it, and then go back to the previous page. I think this will make more sense when you look at the "Main" actions tree. Try double clicking all those "Execute tree:..." nodes to see how they are set up.

Also, I'm using on several spots the "Force Select" actions tree, taken from here, that will wait until some content have loaded, since most of this content loads dynamically.

And finally, since you cannot extract from multiple pages with a single "Extract" action, I'm using a Reference Table called "URLs" to correlate the Twitter and Facebook likes to the referrer URL of the page from which they are extracted, which is the URL of the blog page that contains both iFrames. To get the referrer page I'm using a JavaScript Gatherer that you can see at Project -> JavaScript Gatherers. To view the data as a single table containing URL, Tweeter likes and Facebook likes, just select the URLs table in the database panel, click on the "Quick Join Reference Table" button, and press play.

I know this is lots of information but it should make sense after playing a little bit with the project. Let me know if you have any other question.
Attachments
IFrames.hsp
(956.36 KiB) Downloaded 588 times
Juan Soldi
The Helium Scraper Team

Post Reply