Helium Scraper

Posted: **Sat Aug 10, 2019 9:20 pm**

This extension uses the Common Crawl database to find lists of URLs. To use it, first import it by downloading and double clicking the attached extension, or by installing it at File -> Extensions. A new action called CommonCrawl.Find will be added, which takes the following arguments:

urlPattern: A URL pattern for which similar URLs will be found. Wildcards can (and should) be used as described on the common crawl documentation, so more than one URL is returned.
yearFrom: The first year of archives to get. This is the year the URL was crawled, not necessarily when the page was created or updated. Years range from 2008 to the current year.
yearTo: The last year of archives to get. This is the year the URL was crawled, not necessarily when the page was created or updated. Years range from 2008 to the current year.
maxItemsPerMonth: The maximum number of URLs to return per month.

An additional action called CommonCraw.FindList will be added, which returns a list instead of a sequence. This can be used if list functions need to be applied to the results.

The following example will extract a list of Wikipedia URLs:

Code: Select all

CommonCrawl.Find
   ·  "https://en.wikipedia.org/*"
   ·  2010
   ·  2019
   ·  1000
as url
extract
   url
      url

Helium Scraper

Common Crawl (URL Finder)

Common Crawl (URL Finder)