Common Crawl (URL Finder)

Extensions can be downloaded and installed to add functionality to Helium Scraper 3
Post Reply
webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Common Crawl (URL Finder)

Post by webmaster » Sat Aug 10, 2019 9:20 pm

This extension uses the Common Crawl database to find lists of URLs. To use it, first import it by downloading and double clicking the attached extension, or by installing it at File -> Extensions. A new action called CommonCrawl.Find will be added, which takes the following arguments:
  • urlPattern: A URL pattern for which similar URLs will be found. Wildcards can (and should) be used as described on the common crawl documentation, so more than one URL is returned.
  • yearFrom: The first year of archives to get. This is the year the URL was crawled, not necessarily when the page was created or updated. Years range from 2008 to the current year.
  • yearTo: The last year of archives to get. This is the year the URL was crawled, not necessarily when the page was created or updated. Years range from 2008 to the current year.
  • maxItemsPerMonth: The maximum number of URLs to return per month.
An additional action called CommonCraw.FindList will be added, which returns a list instead of a sequence. This can be used if list functions need to be applied to the results.

The following example will extract a list of Wikipedia URLs:

Code: Select all

CommonCrawl.Find
   ·  "https://en.wikipedia.org/*"
   ·  2010
   ·  2019
   ·  1000
as url
extract
   url
      url
Attachments
CommonCrawl.hsxt
(200 Bytes) Downloaded 675 times
Juan Soldi
The Helium Scraper Team

Post Reply