July 26, 2019

Introducing Common Crawler

Common Crawler is a free version of Helium Scraper that, instead of loading pages from the web, it loads them from the Common Crawl database. Aimed at both developers and non-developers, it makes it easy to query the common crawl data and then create selectors and actions that extract structured data from the target HTML pages, by rendering the HTML content in a web browser and allowing you to visually setup the extraction logic.

To get started, download and run the setup file and then follow the installation instructions. For a quick demonstration, check out the following video:

The first unique feature is the Common Crawl Editor, which can be found under the View menu item. This editor takes a list of parameters using the same specification as the Common Crawl Index API and compiles them into a query that is run against the full list of archives. The set of archives to be queried can be limited by modifying the year range on top. (The query string uses a custom format used by the Crawl.LoadAll action that I’ll mention below).

The list of results are then displayed, with an additional column called CrawlUrl, which is a custom URL that contains all the information needed to download the HTML from the Common Crawl Corpus. This URL can be copied and pasted into the web browser’s address bar to load the page, and then visually create selectors and actions that scrape information.

The second unique feature is the Crawl.LoadAll action, which takes the query generated by the common crawl editor. Note that this query uses a custom format that includes the year range, in addition to the actual query string (after the question mark), which is to be passed to each of the archives within the year range.

This action produces the same results as the ones displayed by the Common Crawl Editor, and it also loads the value of the CrawlUrl column on the off-screen browsers. Note that, for increased performance, the page is loaded in a lazy fashion, so if no data is retrieved from the HTML, the page is not actually requested nor loaded. This allows you to apply filters to the results and prevent unwanted pages from being downloaded.

The rest of the features are the same as in Helium Scraper, which are detailed in the documentation, and additional resources and information can be found in our Learn page. I recommend watching some of the introductory tutorials to get an idea of what the other features do.

If you have any questions or feature requests feel free to contact me using the contact page.

Tags:common crawl, common crawler, web scraping

Helium Scraper Blog

Introducing Common Crawler

About Author

Juan Soldi

Add a Comment