May 14, 2011

Programming Helium Scraper

By Juan Soldi Helium Scraper Projects, Miscellaneous 2 Comments

Note: This post was written for an older version of Helium Scraper.

I’m assuming you already have a little JavaScript knowledge. If not, here is a quick JavaScript tutorial that covers all you need to know for the purpose of this tutorial. I’m also assuming you have some experience working with Helium Scraper.

In Helium Scraper’s “Execute JS” action, all the JavaScript code will be injected into the current webpage as a function and then this function will be called. This means that all your code will have full access to all the elements inside the current page. All the information you will ever need regarding javascript as related to Helium Scraper can be found in the documentation at Actions -> Actions List -> Execute JavaScript. You might find Helium Scraper’s log, at Project -> View Log, useful when coding, since javascript errors will appear there.

So let’s do some coding. First, make sure Helium Scraper’s browser (the tab on the left with a little padlock) is on a webpage, any webpage. Go to an actions tree and add an “Execute JS” action. Remove the default line of code, and paste the following code:

currentUrl = window.location.href;
alert(currentUrl);

Now press play to get a message that shows the current URL. The window object is a global object that represents the window in which the webpage resides, and contains information about it. Here are some more details about it. One of the most frequently used objects inside the window object is the document object.

Now let’s mix this with kinds. Let’s use this very page as our guinea pig. Navigate here with Helium Scraper, and create a kind called “Items” (do call it “Items” please!) that selects the following 3 elements:

This is just random text
Txet modnar tsuj si siht
Hey, what’s that in number 2?

Now go back to our code editor, delete any code if present and paste this:

Global.Browser.SelectKind("Items");
var selectedItemsCount = Global.Browser.Selection.Count;
alert(selectedItemsCount);<

Now press play to get a message box that shows “3”. What this code does is, select the kind “Items” in the first line, then assign the amount of selected elements to the selectedItemsCount variable, and then show the value of that variable. Global here is a Helium Scraper built in object that is passed as a parameter to our code (which is as I said above, the code inside a function and therefore can receive parameters). The parameters that are passed to our code will always be Global, Tree and Node (unless there is some major update to Helium Scraper). See Actions -> Actions List -> Execute JavaScript in the documentation for more information about them.

Now delete the previous code and paste this one:

Global.Browser.SelectKind("Items");

for (i = 0; i < Global.Browser.Selection.Count; i++)
{
	alert(Global.Browser.Selection.GetItem(i).innerText);
}

Press play and you will get three message boxes showing the text in each of our three items. The Global.Browser.Selection.Count on this case will be 3, so this for loop will loop three times and the values of the i variable will be 0, 1 and 2. The GetItem method of the Selection object gets the selected element at the specified zero-based (this means the first item is the item 0) index. Most elements contain a property called innerText that contains the text of the element. Here is a complete list of properties used by elements. They are under the “Property” column of the “Members Table”. Note that this is Microsoft’s table, so some of these properties might not be compatible with other browsers. Helium Scraper uses Internet Explorer’s javascript interpreter, so we are good to go. Also, note that not all these properties apply to every kind of element.

Now let’s play a little with the database. Create a table called “Table1” with two columns (a column in the table corresponds to a row in the “New Table” dialog where you create your table after pressing the “Create table” button) called “Column1” and “Column2”. Leave everything else as is and press OK. Now double click the table, add three rows with random values, and press the “Save changes” button. Now go back to the code editor and change the code for this code:

results = Global.DataBase.Query("SELECT * FROM [Table1]").ToObjects();

for(i = 0; i < results.length; i++)
{
	row = results[i];
	col1 = row.Column1;
	col2 = row.Column2;
	alert("In row " + i + ", Column1 = " + col1 + " and Column2 = " + col2);
}

This will access the content of our table and show it to us. The string “SELECT * FROM [Table1]” is a simple SQL query that selects all the records in the “Table1” table. Check out this page for a quick explanation on the SQL query in case you are still curious. The Query methods runs the given SQL query and returns a DataBaseReader object, which contains a method called ToObjects. This method returns the results as an array of objects where each object correspond to a row in the database, and each property of each object correspond to a column. The documentation for the Query method is at Actions -> Actions List -> Execute JavaScript -> Class List -> DataBaseObject in Helium Scraper’s documentation.

Now I’ll put kinds and database together. Replace the previous code for this one:

results = Global.DataBase.Query("SELECT * FROM [Table1]").ToObjects();
Global.Browser.SelectKind("Items");

for(i = 0; i < 3; i++)
{
	row = results[i];
	Global.Browser.Selection.GetItem(i).innerText = row.Column1;	
}

This is not a remarkably useful code, but it does illustrate a point. This is also not a robust code whatsoever because it wont work properly if we don’t have exactly 3 rows in our database (don’t count the last empty row in the data table editor) and our kind selects 3 items. As you can see, after you press play, it will change the text of the items that our kind selects. And this is how easily you can hack a website with… nah, just kidding, you only changed the page in your local copy. You can also try changing row.Column1 to row.Column2 to see what happens.

So the code above is useless, but if your kind would select an input box in a web page, you could use a similar code to write, for instance, a search query in a search engine. Note that you would need to replace innerText for value, which represents the text that is written in an input box.

Now let’s go ahead and do something not so useless as everything I just did. Create a new project and navigate to Google (by the way, make sure “Google instant” is off if it’s there). Select the input box (where you would type your search) and create a kind with it called “Input Box”. Then select the “Search” button and create another kind with it called “Search Button”. Type anything and press Search. Now select again the input box at the top and add it to the “Input Box”, then select the “Search” button and add it to the “Search Button” kind. Now go to your database and create a table called “Queries” with a single column called “Query” and then add a few rows with random values, but not so random because these are going to be our search queries and we want Google to return results for them.

Now go to “Actions tree 1” and add a “Execute JS” action with this code:

function TreeData()
{
	this.CurrentRow = -1;
	this.Data = null;
}

Tree.UserData = new TreeData();
Tree.UserData.Data = Global.DataBase.Query("SELECT * FROM [Queries]").ToObjects();

The funtion TreeData() line and the code bellow between brackets is the definition of an object called TreeData that will store the current row being read in the CurrentRow property and all the rows in the Data property. In the Tree.UserData = new TreeData(); line we are creating an instance of the object and assigning it to the UserData property of the Tree object. The Tree object is accessible from within the whole actions tree. On the last line we are assigning an array of objects each of which represent a row in our data table to the Data property of our object.

Now create another “Execute JS” action bellow (not inside) the previous one. (Is a good idea to name these “Execute JS” actions by typing in the Comment box so you know which is which. I will name the previous one “Init” and this last one “Read”.) And then paste this code:

Tree.UserData.CurrentRow++;

if(Tree.UserData.Data.length > Tree.UserData.CurrentRow) return true;
else return false;

If you was wondering why did we set the CurrentRow to -1 on the previous code, here is the answer. The first line in this code increases the value of CurrentRow by one, so the first value that it will have on the subsequent lines will be 0. Then, if the value of CurrentRow is less than the amount of rows, we return true and otherwise false. Returning true tells Helium Scraper to execute the child nodes (this is explained also in the documentation for the “Execute JavaScript” action). So let’s add these child nodes now. Select the last “Execute JS” action you created and add inside a “Select Kind” action that selects the “Input Box” action and requires at least 1 element. Underneath this action, add a “Execute JS” action (I’ll name it “Write”) with this code:

Global.Browser.Selection.GetItem(0).value = Tree.UserData.Data[Tree.UserData.CurrentRow].Query;

This will write the content of our current row to the currently selected element, which is the input box that our “Select Kind” action above just selected. Now under this action add a “Navigate” one with the “Search Button” kind. Check “Simulate click” and require at least 1 item. If you press play now, this will perform searches for each of our rows in our “Queries” data table.

Now let’s create another kind called “Titles” that selects the titles in the results page so we have something to extract. The titles are the blue links at the top of each result that you would click to go to navigate to the page. Then go back to your actions tree and under the “Navigate: Search Button” action, add a “Extract” action that extract our “Titles”. In the “New Table” dialog, add another column called “Link” and set its kind to “Titles” and its property to “Link”. Set the requirements of both items to “At Least” 1.

Now before we proceed to extract, add a “Wait” action right between the “Navigate” and the “Extract” action (you can use the up and down arrows to move your actions up and down) and set it to 1000 ms. This is because Google uses AJAX to load its results so we need to wait a little for the results to load. You could use a smaller wait time, such as 500 ms, but I prefer to be safe. Now press play and let the magic begin!

Here is the final result. If you have any question related to programming Helium Scraper, don’t hesitate to use our forums.

About Author

Juan Soldi

2 Comments

Mark

Excellent idea to accompany the scraper, thanks.

By the way, what is it that the above project does exactly?
May 14, 2011 Reply

Juan Soldi

Nothing fancy. The first half is basically a bunch of small tricks that can be done with Helium Scraper. The second half uses some of the stuff I showed in the previous one to read data from a database and write it to an input box in a webpage.
May 14, 2011 Reply

Helium Scraper Blog

Programming Helium Scraper

About Author

Juan Soldi

Add a Comment