Helium Scraper FAQ's / Troubleshooting

Questions and answers about anything related to Helium Scraper
Post Reply
webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Helium Scraper FAQ's / Troubleshooting

Post by webmaster » Wed Jun 20, 2012 3:03 am

My project is not extracting any / enough data.

To diagnose the cause of this problem, the first thing to do is to add requirements to every possible action. Most actions have a drop down where you can select whether to require At Least, Exactly or At Most any given amount of items. If you're, for instance, using a Next button to turn the pages in a set of search results, add a requirement of exactly one item if you expect this item to be found exactly once per page. Remember all this does, is notify you when the requirement is not met. Even if you know some elements are not going to be present in every page, is still a good idea to set a requirement when doing a test extraction just to make sure your kinds are working correctly. You can then choose to ignore the "Requirements not met..." error if it doesn't apply to the current page.

If once the requirements are set and you attempt to run an extraction, you get the error "Requirements not met when trying to select...", take note of which kind that is mentioned in the error. Then click Pause and try to select the kind by clicking on the Select kind in browser button inside the kind editor.

If the element that is supposed to be selected is not, make sure selection mode is ON, select the element that didn't get selected, and click Add selection to this kind inside the kind. Then go back to your actions trees and press play to continue running.

If instead, the element gets properly selected, this element is most likely loading dynamically, which means it wasn't there when the page first completed loading. An easy fix is to add a Wait action just above the action that caused the error. But the recommended way to do this is to use a Force Select premade at File -> Online Premades (there is more info in the project's description), because this action will wait just as long as it takes for the given kind to appear on the page.


Helium Scraper is running out of resources!

First of all, make sure you're running the latest version of Helium Scraper at Help -> Check for Updates. Then, make sure you've installed the latest version of Internet Explorer. As per our tests, Internet Explorer 10 has fixed every memory leak we've seen in previous versions of Internet Explorer (as of January 11, Internet Explorer 10 is included in Windows 8 and a release preview for Windows 7 is available).

If this doesn't solve the problem, limiting the work load a single process has to handle by distributing the extraction among multiple processes will do. This video shows how to use the Start Processes action and the Multi-process Navigate URLs premade to achieve this. In fact, distributing the extraction among processes will also greatly speed up your extraction.


The site I want to scrape doesn't have a next button but instead only page numbers.

Essentially, the link that takes you to the next page from whatever page you're at is the next button, except that the text changes for every page. Since is easy for Helium Scraper to confuse this "next" button with any other page number because there are so similar, we've created the Turn pages without a "Next" button online premade (File -> Online Premades). This premade adds a few extra JavaScript Gatherers that check some extra properties that help Helium Scraper distinguish between the next page link and every other page link.

All you need to do is import this premade and create your Next button kind as if you'd be doing it with a normal "Next" button, but instead, use the link that takes you to the next page from whatever pare you're currently at. Use three of four pages as samples and always select the link that takes you to the next page from the current page as a sample.
Juan Soldi
The Helium Scraper Team

Post Reply