While Loop Function in recurring actions

Questions & Answers about Helium Scraper 3
Post Reply
celsiusx
Posts: 1
Joined: Sat Sep 04, 2021 3:59 pm

While Loop Function in recurring actions

Post by celsiusx » Sat Sep 04, 2021 7:08 pm

Hi.

I'm trying to scrape threads from a target forum. It uses IDs as thread indexes.

Let's say today I scraped successfully from thread 100 to thread 50 (showed in decrescent way).

I want to run the helium job tomorrow. So I have thread id on my sqlite db starting from threadId 50.

Forum page shows all threads, and they go from thread 200 to 0.

I'm trying to write a function that does a while loop that does, more or less:

- Navigate the forum and search the most recent thread. Save the id to a variable.
- Query local database for first record of threadId
- If they do not equal, execute the Scrape global, else return.

Instead of using an if function to do the comparison, i'd like to use a while loop that iterates over each thread ID in the forum. If the Id is different from my first database record, then proceed to scrape, else return.

What should be the best way of achieving this?

Also: Is there a way to avoid duplicates in the database while scraping?

Maybe related: Is there a way, from inside the software, to do a sort of scheduled run?


Thanks! Helium is really a wonder of a software!

webmaster
Site Admin
Posts: 513
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: While Loop Function in recurring actions

Post by webmaster » Sun Sep 26, 2021 10:22 pm

That's now easy to do (since version 3.2.7.9) with the WhileAny function. In the documentation there's an example showing how to stop the extraction when a post with a certain text is found.

In your case, you could, first create a query at Project Explorer > Data Flow > Queries that gets the latest ID, and then compare the post's id to that. Something like this:

Code: Select all

Query.LatestId
as (latestId)
Sequence.WhileAny
    ·  Browser.Load
         ·  "https://www.example.com"
      Browser.TurnPages
         ·  Select.NextButton
      Select.RowContainer         
   ·  Select.ThreadId
      as threadId
      if
         ·  =
               ·  threadId
               ·  latestId
         ·  Sequence.Empty
         ·  Sequence.Default
Select.ThreadLink
Browser.Navigate
Note that the RowContainer is selected on top. This must select each row on the list of threads, which must contain both the thread ID and the thread link (the one that visits the actual thread).

Regarding duplicates, I wouldn't worry about that during extraction, you can just remove them on a query. If you right click a table set and select Create Query there's a Distinct option which will remove duplicates.

And regarding scheduled runs, the best way is using the command line, but you probably already know that. If you just need to run an extraction every X minutes then you could add a global called LoopAction with this code:

Code: Select all

function (action delayMinutes)
   action
   Action.Run
      ·  Browser.Load
            ·  "helium://start"
         Browser.Wait
            ·  *
                  ·  delayMinutes
                  ·  1000
                  ·  60
   LoopAction
      ·  action
      ·  delayMinutes
Then to run an extraction every 30 minutes (or more precisely with delays of 30 minutes) you'd do this:

Code: Select all

LoopAction
   ·  Action.Extract
         ·  MyGlobal
         ·  "MyGlobal"
   ·  30
Note that that'll keep running until you stop it. Here, MyGlobal is the name of the global you'd normally run manually to start the extraction.
Juan Soldi
The Helium Scraper Team

Post Reply