Page 1 of 1
Common Crawler: is it possible to download the found html files?
Posted: Mon Jun 08, 2020 3:02 pm
by mangowuvvr69
Hello.
First of all, thank you for a great app.
My question is if I can download the result of my query as html files. I was able to successfully retrieve the data from common crawl using your tool, but I need the extract htmls. Is it possible to do it? If yes, then how?
Thank you.
Re: Common Crawler: is it possible to download the found html files?
Posted: Tue Jun 09, 2020 9:12 pm
by webmaster
We've just updated Common Crawler to include the
Sequence.WriteFile function. If you don't get an update prompt, this may be because we've migrated the publish location to AWS. If so, just uninstall it and reinstall it from
here.
Once you have the latest version (3.2.4.9) you can do this to save the HTML into files:
Code: Select all
Crawl.LoadAll
· "2018-2019?url=https%3A%2F%2Fwww.imdb.com%2Ftitle%2F%2A&limit=100&filter==status:200&filter==mime:text%2Fhtml"
as (digest fileName length mime mimeDetected offset status timestamp url urlKey crawlUrl)
extract
fileName
fileName
file
Gather.HTML
as html
Sequence.WriteFile
· html
· +
· fileName
· ".html"
· false
Or if you just want to extract the full HTML into a table, it's even simpler:
Code: Select all
Crawl.LoadAll
· "2018-2019?url=https%3A%2F%2Fwww.imdb.com%2Ftitle%2F%2A&limit=100&filter==status:200&filter==mime:text%2Fhtml"
as (digest fileName length mime mimeDetected offset status timestamp url urlKey crawlUrl)
extract
fileName
fileName
file
Gather.HTML
Re: Common Crawler: is it possible to download the found html files?
Posted: Wed Jun 10, 2020 10:11 am
by mangowuvvr69
Thank you very much for your help! I've just manually updated the app, followed your guide and it worked perfectly.