Helium Scraper 3 New Features

News and announcements about Helium Scraper 3
Post Reply
webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Helium Scraper 3 New Features

Post by webmaster » Thu Nov 07, 2019 5:44 pm

Since the launch of Helium Scraper 3, a bunch of new features have been added. Now I'd like to take some time to share some of the ones that may have slipped under the radar:

Features
  • Credentials: They can be added and removed under the File menu, and then used in place of literal strings using the Credentials. module. This way, credentials don't need to be stored in plain text in the project file, but in an encrypted location in your local machine. Note that this also means that if a project is shared, the same credentials would need to exist on the target machine.
  • Extesions: Extensions can be written for Helium Scraper 3 to perform custom actions using any .NET language. We will provide more information about developing extensions, but for now, a few of them can be installed from the extensions section. Let us know if you have any idea for a new extension. If enough users can benefit from it, we'll definitely add it to the list.
Global Settings (File -> Settings)
  • Skip Updates: Automatic update checks can be set to be skipped by setting this to True. This is useful when using the command line to prevent update prompts that would otherwise pause the extraction waiting for user confirmation.
Project Settings (Project -> Settings)
  • Use Main Browser: The main browser can now be used to run extractions by setting this to True. When enabled, the page currently loaded on the main browser will be used as the starting point of the extraction, so that filters and other dynamic changes can be applied to the page. Note that if Auto Parallel is enabled, off-screen browsers will still be used to, for instance, visit child pages or perform any other drilling-down operation. To only use the main browser, be sure to turn off Auto Parallel too.
  • Maximum Browsers: This setting can be used to specify the maximum number of browsers a particular project may use. This number needs to be less than or equal to File -> Settings -> Browser Count to have any effect. This is particularly useful when a project needs to use only a few browsers because many simultaneous requests are causing the site to block it.
  • Synchronize Browser Sizes: Enabling this setting will cause off-screen browsers to always have the same size as the main browser. This is useful when the size of the browser determines the style of the site being scraped, and a particular style is preferred, or when selectors have already been created for a style that is different than the one shown in the off-screen browsers.
Miscellaneous
  • Extract Last Page: An option has been added to the Turn Pages wizard to use the last page only. A feature of the Turn Pages wizard is that it doesn't only work on pages, but on anything that needs to be clicked repeatedly, since what it does is blindly click the first element selected by the given selector, and repeat this until the selector is not found. For instance, sometimes pages contain a list of elements that show partial content and have a show more button (or link) that shows the full content when clicked. If the show more button goes away or is replaced by a show less button that is not selected by the Next Button Selector, then the Last Page Only option can be used to extract from the page only after all the show more buttons have been clicked and the full content has been displayed.
  • Automatic Database File Generation: When a project file is shared with someone or moved somewhere else without its corresponding database file, an empty database containing all the old tables will be automatically generated. This makes it easier to share the project without having to share the database file when the database is empty or the content of the database doesn't need to be shared.
Functions
  • Action.ExportXML: Exports the given data as an XML file.
  • Action.ExportJSON: Exports the given data as a JSON file.
  • Browser.SelectDocument: This action selects the full document when some other element is selected so that other elements inside the document (and not just inside the currently selected element) can be selected.
  • Browser.DownloadAs: Downloads the file located at the given URL with the given file name.
  • Gather.OwnText: Gathers the text of the currently selected element, excluding the text of child elements.
  • Sequence.Default: Produces a sequence containing a single empty value, such as an empty string or the number zero, depending on the type of the sequence.
  • Sequence.Last: Selects the last element from the given sequence.
  • Sequence.Reverse: Returns the given sequence reversed, without reversing the input sequence. Note that this action should only be used with sequences that don't perform any action, such as selectors.
  • Sequence.Chain: Sets the browser on the state that results from sequentially applying the given action to each of the elements selected by the given selector. This action can be very useful in some scenarios. See its documentation for more details and an example.
  • State.TurnPages: Use this action if Helium Scraper is going back to page 1 and then turning the pages all the way to the current page, for each visited page. This will occur if the next button is a javascript link and actions are performed on the browser after each page is loaded.
  • String.IsMatch: Checks whether the given regular expression pattern finds a match in the given text and returns true or false.
  • String.Download: Downloads a string from the given URL and returns its contents.
Juan Soldi
The Helium Scraper Team

Post Reply