Helium running out of resources!

Questions and answers about anything related to Helium Scraper
Post Reply
vandigroup
Posts: 7
Joined: Sun Mar 25, 2012 5:20 am

Helium running out of resources!

Post by vandigroup » Sun Apr 08, 2012 2:28 am

UPDATE: I just ran a couple other saved projects and they are fine so it must be something with the one I uploaded here but I kept it super basic so not sure what it could be.

Also, the starting address for the attached project is /apps/any/recent/1/.

And then it asks me to save the project and restart the application. The project is then paused.

This is happening only after it gets through about the first 10 pages. When I watch Task Manager and Helium running I can see the memory adding up. It fails once it reaches around 1.2GB of RAM used. I am not sure if this is happening from upgrading to the latest version or what is going on. Obviously, something is wrong because the memory usage should not continuously keep going up until it runs out of memory. I have also attached the project (which I also rebuilt just in case) to make sure there is nothing wrong with it. You can see where the program failed.

Here are the details:

Helium: v2.3.2.8
OS: Windows 7 Ultimate 32bit - Service Pack 1
RAM: 4GB

1. Running no other programs while running Helium
2. Uninstalled/reinstalled Helium
3. Restarted my computer

Thanks for helping,

Joe
Attachments
appmatch.rar
(45.22 KiB) Downloaded 565 times

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Helium running out of resources!

Post by webmaster » Mon Apr 09, 2012 2:28 am

Hi Joe,

This site is causing a memory leak. But there is a solution, which will also make extraction a lot faster. Here is what you need to do, which I already did in the attached project. If anything is not 100% clear please let me know and I'll clarify.

You'll need to have a list of URLs to every page (not to every item page, but to every results page), which in your case can be easily done with the URL Variations premade (which can be imported from File -> URL Variations) since it follows a simple pattern (just open the Generate URLs tree to see how I did it). I guess your homework would be to figure out what the last page is, which you'd need to do anyway ;) . You could alternatively use a Go Through All Pages actions tree (also a premade) and extract the URL property of the BODY kind for every page, but generating them with the URL Variations script is a lot faster, provided you know the last page number. I think the most straightforward way to do it would be to try to navigate to some URL like "http://www.ziipa.com/apps/any/current/1000/" (here the page is 1000) and if it works then go to something like "http://www.ziipa.com/apps/any/current/2000/" and if this last gives you an error then enter "http://www.ziipa.com/apps/any/current/1500/" and so on. Alternatively, you could create a kind for the page number to the left of the "Next" button and have Helium Scraper navigate through it until it hits a page not found or some error page.

So now that you have a list of pages, you need to use a Start Processes action, which will run more that one instances of Helium Scraper and have them extract from these URLs at the same time, which is why I said the extraction would be faster. But before you need to export your database with the Export Database -> Export and Connect button in the database panel (there is more info about this in the Start Processes' documentation). Then create another actions tree, and add a Start Processes action, setting the URLs table and the URLs column as the URL column (as I did in the Do Processes actions tree). Finally, create another actions tree called Main (once every one of these instances start, they will open your project and look for an actions tree with this name) and basically copy all your extraction logic in it, without the navigating to the next page thing since we want each of these instances to extract from only one page to prevent the leak. As you can see, I just duplicated your Navigate Each: links actions and moved them to Main.

Then make sure you save your project and run Do Processes. Start with 10 or 20 URLs just for test. Remember that since your database is external, all data will be automatically committed to the database even if you don't save your project. Note that in the attached file you'll still need to export and connect to the database. I kept it in the file so I don't have to attach two files.
Attachments
appmatch.hsp
(823.02 KiB) Downloaded 590 times
Juan Soldi
The Helium Scraper Team

vandigroup
Posts: 7
Joined: Sun Mar 25, 2012 5:20 am

Re: Helium running out of resources!

Post by vandigroup » Mon Apr 09, 2012 2:57 am

Hi Juan,

Thanks for the help. Not sure I completely understand but it will make much more sense once I open it up and follow your instructions. I will identify the last page and then use Excel to generate a URL list which will only take 10 seconds. I'm glad to know this because I think there are probably many instances where this could be done to speed up the process.

Also, I am running MySQL locally for other projects. My question is would it be better to use Helium with an external db? If so, let me know where I would set it up and if there is anything I should know. Honestly, your software runs fairly quick and I'm running projects overnight typically but as you know, anything that can speed things up is always welcome.

Thanks again for your detailed answer. Your support has been wonderful...

Joe

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Helium running out of resources!

Post by webmaster » Mon Apr 09, 2012 3:46 am

Hi,

Helium Scraper uses Ms Access database files, so you couldn't connect to a MySQL one. Anyway, other than having several instances extracting to the same database, I don't see any reason to use an external.
Juan Soldi
The Helium Scraper Team

Post Reply