Page 1 of 1

Common Crawler: is it possible to download the found html files?

Posted: Mon Jun 08, 2020 3:02 pm
by mangowuvvr69
Hello.

First of all, thank you for a great app.

My question is if I can download the result of my query as html files. I was able to successfully retrieve the data from common crawl using your tool, but I need the extract htmls. Is it possible to do it? If yes, then how?

Thank you.

Re: Common Crawler: is it possible to download the found html files?

Posted: Tue Jun 09, 2020 9:12 pm
by webmaster
We've just updated Common Crawler to include the Sequence.WriteFile function. If you don't get an update prompt, this may be because we've migrated the publish location to AWS. If so, just uninstall it and reinstall it from here.

Once you have the latest version (3.2.4.9) you can do this to save the HTML into files:

Code: Select all

Crawl.LoadAll
   ·  "2018-2019?url=https%3A%2F%2Fwww.imdb.com%2Ftitle%2F%2A&limit=100&filter==status:200&filter==mime:text%2Fhtml"
as (digest fileName length mime mimeDetected offset status timestamp url urlKey crawlUrl)
extract
   fileName
      fileName
   file
      Gather.HTML
      as html
      Sequence.WriteFile
         ·  html
         ·  +
               ·  fileName
               ·  ".html"
         ·  false
Or if you just want to extract the full HTML into a table, it's even simpler:

Code: Select all

Crawl.LoadAll
   ·  "2018-2019?url=https%3A%2F%2Fwww.imdb.com%2Ftitle%2F%2A&limit=100&filter==status:200&filter==mime:text%2Fhtml"
as (digest fileName length mime mimeDetected offset status timestamp url urlKey crawlUrl)
extract
   fileName
      fileName
   file
      Gather.HTML

Re: Common Crawler: is it possible to download the found html files?

Posted: Wed Jun 10, 2020 10:11 am
by mangowuvvr69
Thank you very much for your help! I've just manually updated the app, followed your guide and it worked perfectly.