{"id":309,"date":"2019-07-26T17:40:28","date_gmt":"2019-07-26T17:40:28","guid":{"rendered":"https:\/\/www.heliumscraper.com\/wordpress\/?p=309"},"modified":"2023-08-15T11:37:21","modified_gmt":"2023-08-15T11:37:21","slug":"introducing-common-crawler","status":"publish","type":"post","link":"https:\/\/www.heliumscraper.com\/blog\/introducing-common-crawler\/","title":{"rendered":"Introducing Common Crawler"},"content":{"rendered":"\n<p>Common Crawler is a free version of Helium Scraper that, instead of loading pages from the web, it loads them from the Common Crawl database. Aimed at both developers and non-developers, it makes it easy to query the common crawl data and then create selectors and actions that extract structured data from the target HTML pages, by rendering the HTML content in a web browser and allowing you to visually setup the extraction logic.<\/p>\n\n\n\n<p>To get started, <a href=\"https:\/\/heliumscraper.com\/commoncrawler\/setup\/setup.exe\">download<\/a> and run the setup file and then follow the installation instructions. For a quick demonstration, check out the following video:<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" width=\"678\" height=\"381\" src=\"https:\/\/www.youtube.com\/embed\/QVqbySDpdyQ?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>The first unique feature is the <strong>Common Crawl Editor<\/strong>, which can be found under the <strong>View<\/strong> menu item. This editor takes a list of parameters using the same specification as the <a href=\"https:\/\/github.com\/webrecorder\/pywb\/wiki\/CDX-Server-API#api-reference\">Common Crawl Index API<\/a> and compiles them into a query that is run against the full list of archives. The set of archives to be queried can be limited by modifying the year range on top. (The query string uses a custom format used by the <strong>Crawl.LoadAll<\/strong> action that I&#8217;ll mention below).<\/p>\n\n\n\n<p>The list of results are then displayed, with an additional column called <strong>CrawlUrl<\/strong>, which is a custom URL that contains all the information needed to download the HTML from the Common Crawl Corpus. This URL can be copied and pasted into the web browser&#8217;s address bar to load the page, and then visually create selectors and actions that scrape information.<\/p>\n\n\n\n<p>The second unique feature is the <strong>Crawl.LoadAll<\/strong> action, which takes the query generated by the common crawl editor. Note that this query uses a custom format that includes the year range, in addition to the actual query string (after the question mark), which is to be passed to each of the archives within the year range.  <\/p>\n\n\n\n<p>This action produces the same results as the ones displayed by the Common Crawl Editor, and it also loads the value of the <strong>CrawlUrl<\/strong> column on the off-screen browsers. Note that, for increased performance, the page is loaded in a lazy fashion, so if no data is retrieved from the HTML, the page is not actually requested nor loaded. This allows you to apply filters to the results and prevent unwanted pages from being downloaded. <\/p>\n\n\n\n<p>The rest of the features are the same as in Helium Scraper, which are detailed in the <a href=\"https:\/\/www.heliumscraper.com\/eng\/help\/\">documentation<\/a>, and additional resources and information can be found in our <a href=\"https:\/\/www.heliumscraper.com\/eng\/learn.php\">Learn<\/a> page. I recommend watching some of the introductory tutorials to get an idea of what the other features do.<\/p>\n\n\n\n<p>If you have any questions or feature requests feel free to contact me using the <a href=\"https:\/\/www.heliumscraper.com\/eng\/contact.php?contact_reason=crawler_feedback\">contact page<\/a>.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Common Crawler is a free version of Helium Scraper that, instead of loading pages from the web, it loads them from the Common Crawl database. Aimed at both developers and non-developers, it makes it easy to query the common crawl data and then create selectors and actions that extract structured data from the target HTML [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":319,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[11,10,6],"_links":{"self":[{"href":"https:\/\/www.heliumscraper.com\/blog\/wp-json\/wp\/v2\/posts\/309"}],"collection":[{"href":"https:\/\/www.heliumscraper.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.heliumscraper.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.heliumscraper.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.heliumscraper.com\/blog\/wp-json\/wp\/v2\/comments?post=309"}],"version-history":[{"count":13,"href":"https:\/\/www.heliumscraper.com\/blog\/wp-json\/wp\/v2\/posts\/309\/revisions"}],"predecessor-version":[{"id":736,"href":"https:\/\/www.heliumscraper.com\/blog\/wp-json\/wp\/v2\/posts\/309\/revisions\/736"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.heliumscraper.com\/blog\/wp-json\/wp\/v2\/media\/319"}],"wp:attachment":[{"href":"https:\/\/www.heliumscraper.com\/blog\/wp-json\/wp\/v2\/media?parent=309"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.heliumscraper.com\/blog\/wp-json\/wp\/v2\/categories?post=309"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.heliumscraper.com\/blog\/wp-json\/wp\/v2\/tags?post=309"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}