Common Crawler: is it possible to download the found html files?

Questions & Answers about Helium Scraper 3
Post Reply
mangowuvvr69
Posts: 2
Joined: Mon Jun 08, 2020 2:56 pm

Common Crawler: is it possible to download the found html files?

Post by mangowuvvr69 » Mon Jun 08, 2020 3:02 pm

Hello.

First of all, thank you for a great app.

My question is if I can download the result of my query as html files. I was able to successfully retrieve the data from common crawl using your tool, but I need the extract htmls. Is it possible to do it? If yes, then how?

Thank you.

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Common Crawler: is it possible to download the found html files?

Post by webmaster » Tue Jun 09, 2020 9:12 pm

We've just updated Common Crawler to include the Sequence.WriteFile function. If you don't get an update prompt, this may be because we've migrated the publish location to AWS. If so, just uninstall it and reinstall it from here.

Once you have the latest version (3.2.4.9) you can do this to save the HTML into files:

Code: Select all

Crawl.LoadAll
   ·  "2018-2019?url=https%3A%2F%2Fwww.imdb.com%2Ftitle%2F%2A&limit=100&filter==status:200&filter==mime:text%2Fhtml"
as (digest fileName length mime mimeDetected offset status timestamp url urlKey crawlUrl)
extract
   fileName
      fileName
   file
      Gather.HTML
      as html
      Sequence.WriteFile
         ·  html
         ·  +
               ·  fileName
               ·  ".html"
         ·  false
Or if you just want to extract the full HTML into a table, it's even simpler:

Code: Select all

Crawl.LoadAll
   ·  "2018-2019?url=https%3A%2F%2Fwww.imdb.com%2Ftitle%2F%2A&limit=100&filter==status:200&filter==mime:text%2Fhtml"
as (digest fileName length mime mimeDetected offset status timestamp url urlKey crawlUrl)
extract
   fileName
      fileName
   file
      Gather.HTML
Juan Soldi
The Helium Scraper Team

mangowuvvr69
Posts: 2
Joined: Mon Jun 08, 2020 2:56 pm

Re: Common Crawler: is it possible to download the found html files?

Post by mangowuvvr69 » Wed Jun 10, 2020 10:11 am

Thank you very much for your help! I've just manually updated the app, followed your guide and it worked perfectly.

Post Reply