December 11, 2018

The Web Scraping Dilemma

The web scraping community seems to be divided into two sub-worlds. One is the world of programmers, who would often use Python or JavaScript to carefully craft their agents down to the details in a time consuming but ultimately rewarding process. And the other is the world of layman users, who must choose between a plethora of existing web scrapers, all of which include a set of ready-to-go commands, but which are limited in scope. This often force users to spend money on several tools in an attempt to cover a wide range of possible web scraping scenarios.

The heart of the matter is something called Turing completeness. It is a property shared by most programming languages, including the ones I mentioned before, and it is what makes them so powerful when compared to most point&click tools. In a few words, a language or system is Turing complete when it can simulate any other system that can in principle be simulated—including itself!. This implies that the functionality of any ready-to-go web scraping software can be recreated using any Turing complete programming language, as long as it has access to the Internet and the content of websites.

Here is where the dilemma comes into play. Turing completeness is an all encompassing solution, but programming languages are hard to learn, and even programmers are often hesitant to learn new languages. But programming languages don’t have to be difficult. In fact, they don’t even need to be programming languages, in the way we typically understand the term.

Most programming languages include a long list of keywords and symbols, but the vast majority of them are simply syntactic sugar—pieces of code used to shorten up or prettify commands that then get translated by the compiler into other commands. But when I say prettify, I mean it for the eye of the programmer, not necessarily for everyone else. As an illustration, one of the oldest Turing complete languages, called Lambda calculus, consists only of 4 symbols. In fact, this language is the foundation stone of many of today’s programming languages, and as modest as it seems, it could be used to simulate any conceivable computer program. But then again, despite its simplicity, is not very user friendly, even for programmers.

The solution we came up with was to build on top of Lambda calculus, but this time, with the layman user in mind. Lambda calculus is versatile enough to, with a few tweaks, be transformed into a beautiful, symbol-less, bullet-point kind of language that looks like, and most importantly, can be worked with as you’d work with your everyday, ready-to-go, web scraping software, without sacrificing computational power. The result is that there’s no need to learn a new language, on top of learning the set of commands and features you’d have to learn when dealing with any new web scraper, while still being able to replicate the extraction logic of any other web scraping tool. In addition to this, it allowed us to seamlessly integrate ready-to-go features such as the wizard and automatic parallel extraction, or to write and share ready-made snippets that can be copied & pasted into Helium Scraper for common tasks.

It is a well known fact that programmers tend to think like programmers, and, naturally, programming languages are designed with programmers in mind. But it’s not only possible, but also preferable, to let non-programmers benefit from the power of Turing completeness. This can be achieved by resembling tools we’re all familiar with, such as bullet-point lists and natural language—which, arguably, is the one language every other Turing complete language is based upon.

Tags:data extraction, helium scraper, web scraping

Helium Scraper Blog

The Web Scraping Dilemma

About Author

Juan Soldi

Add a Comment