Multi Process Extraction Doesn't END

Questions and answers about anything related to Helium Scraper
Post Reply
dhkim2388
Posts: 2
Joined: Thu Jan 02, 2014 6:29 pm

Multi Process Extraction Doesn't END

Post by dhkim2388 » Thu Jan 02, 2014 6:37 pm

Hey,

I created a very simple project extracting addresses from a list of URLs I create.

Action tree:

Main
Start Process
Extract from each URL
-Navigate URLs from URL table
--Extract ADDRESS KIND

Main2
Execute Action Tree: Start Process

The URL Table looks like this:
ID|URLs
1 |Sex.com
2 |Anus.com
3 |etc.com
4 |peepoo.com



So, the extraction works BUT when I run it, it doesn't end. I had a list of about 50 urls I was testing, and expecting about 200 addresses returned total. The extraction ran and just kept on racking up data. I was thinking that 50 urls was taking quite a long time for a multi-process extraction. After the new project ran for about as long as the normal extraction, I looked at the results and noticed I had almost 5000. This leads me to believe that the even though I had populated IDs, they were not being used to track which URl was visited, and all processes were running ALL urls nonstop. Please help!

crookedleaf
Posts: 38
Joined: Tue Dec 11, 2012 6:44 pm

Re: Multi Process Extraction Doesn't END

Post by crookedleaf » Fri Jan 03, 2014 10:38 pm

dhkim2388 wrote:Action tree:

Main
Start Process
Extract from each URL
-Navigate URLs from URL table
--Extract ADDRESS KIND

Main2
Execute Action Tree: Start Process

The URL Table looks like this:
ID|URLs
1 |Sex.com
2 |Anus.com
3 |etc.com
4 |peepoo.com
Your action tree break down isn't very clear, but my guess would be it has something to do with it. Each time a new instance/process of Helium starts, it executes what is in your "Main" tree. What you want would be something like this:

"Project" (or whatever you want to call your tree)
-Start Processes

"Start Processes" (do not modify anything in this tree)
-Execute JS (Make Groups)
-Execute JS (Start Processes)
--Wait 100 miliseconds

"Main" (do not modify anything in this tree, either)
-Execute JS(Read URLs)
-Execute JS(For Each URL)
--Execute tree: "Extract From Each URL"
--Execute JS(Save Done and Increment)

"Extract From Each URL" (this is where your actions will be)
-Extract (whatever it is you want to extract)

So for example, if you are trying to extract all the link addresses on each URL you provide, then you just need to identiy that/those kind(s), and set them as your extract action in the "Extract From Each URL" tree. You will start with one Table containing your URL's that you are navigating to in your multi-process. When you first import the "multi-process" premade, you will get prompted with a dialog asking you to choose an action tree. Choose "start processes" and fill in the data you would like: how many URL's per processes, the maximum amount of processes that will run at one time, the aforementioned table containing the URL's you want to go to, an ID column (which will be required, so make sure you keep that option checked when creating the table), and the URL coloum (which is the column in the table that has all the addresses).

Post Reply