Common Crawl (URL Finder)
Posted: Sat Aug 10, 2019 9:20 pm
This extension uses the Common Crawl database to find lists of URLs. To use it, first import it by downloading and double clicking the attached extension, or by installing it at File -> Extensions. A new action called CommonCrawl.Find will be added, which takes the following arguments:
The following example will extract a list of Wikipedia URLs:
- urlPattern: A URL pattern for which similar URLs will be found. Wildcards can (and should) be used as described on the common crawl documentation, so more than one URL is returned.
- yearFrom: The first year of archives to get. This is the year the URL was crawled, not necessarily when the page was created or updated. Years range from 2008 to the current year.
- yearTo: The last year of archives to get. This is the year the URL was crawled, not necessarily when the page was created or updated. Years range from 2008 to the current year.
- maxItemsPerMonth: The maximum number of URLs to return per month.
The following example will extract a list of Wikipedia URLs:
Code: Select all
CommonCrawl.Find
· "https://en.wikipedia.org/*"
· 2010
· 2019
· 1000
as url
extract
url
url