Extract to a pre-existing/imported database table

consumerazzi · Post by **consumerazzi** » Fri Dec 09, 2011 9:07 pm

Hello!

I'm demoing the Helium Scraper, and the learning process has been somewhat of a painful one. After reading through some of the help documentation and fiddling with the program, I've come so close yet so far away from doing what I want to do.

After realizing I can't import a .txt file of URLs, I finally made an .mdb-related table with an id, the URLs to navigate to, and an empty column to place the extracted information piece I need at each URL.

I've imported the table, made the Kind, and now I'm working on the Actions. I successfully added a "Navigate URLs" action, which points to the URL column of the imported table. I press play. It reads the table fine. I add the "Select Kind". Again, no problem.

Now here's where I get stuck, wanting to throw things and pull out my hair. I want to add the "Extract" action. (I'm extracting a website address listed in the contact section of each page.)

Add Child > Extract
And then I checkmark the associated Kind.
Then it takes me to a New Table dialog.

But I don't WANT to make a new table. I want to extract the information to the associated column in the imported table, the very same one I successfully linked to with the "Navigate URLs" action. I've tried everything that's obvious, but there's seemingly no option to extract to the imported table. I even tried renaming the New Table to the name of the imported table, but that only writes over the imported table.

I fear that I'm missing some obvious point here, but can you please help me? Why would I be able to easily link a Navigate URLs action to an imported table but not have the option to extract to it? HELP!

consumerazzi · Post by **consumerazzi** » Fri Dec 09, 2011 9:25 pm

I realize this may be a read-write issue that would slow down the system. That's ok if it is, but now I have a new, only semi-related question.

How do I tell the program that if there's a null/non-existent result, write "null" or leave the cell blank? Why is this important? I need a one-to-one match,
which is why I was hoping the system would write the results to the imported file. Here's my current example.

Input:

1 http://www.pricegrabber.com/info_retailer.php/r=0/
2 http://www.pricegrabber.com/info_retailer.php/r=1/
3 http://www.pricegrabber.com/info_retailer.php/r=2/
4 http://www.pricegrabber.com/info_retailer.php/r=3/
5 http://www.pricegrabber.com/info_retailer.php/r=4/
6 http://www.pricegrabber.com/info_retailer.php/r=5/

Extracted results:

1 http://www.Buy.com
2 http://www.macmall.com
3 http://www.pcmall.com

What I TRULY need:

1 null (or blank, something to indicate no success)
2 null (or blank, something to indicate no success)
3 http://www.Buy.com
4 null (or blank, something to indicate no success)
5 http://www.macmall.com
6 http://www.pcmall.com

Thanks so much for any assistance.

consumerazzi · Post by **consumerazzi** » Fri Dec 09, 2011 9:51 pm

Ok, I feel silly. Sorry if you think this is spammy. I sorta' figured out my problem. (The learning curve on this is a bit steep!)

For the benefit of anyone else who has this issue in the future, I added the action "Extract", selected the Kind, and then in the New Table dialog I had to select a "Property" from the drop-down menu for one of the table columns. I selected the ID column for the table I imported, and I also added a third column with the Property "URL", which pastes the URL of the visited site. This isn't the perfect solution, though. (I'd prefer to see a "null" or empty column for failed extracts, creating a true one-to-one and easily telling me which URLs failed.) As is, with extra work I can eventually deduce which URLs failed.

So while I sorta' got around the problem...

1. I assume I can't write to the same imported table due to read-write issues. Is this correct?
2. Is there a way to get true one-to-one extraction results, utilizing a null or blank cell? This would be glorious.

Sorry if this is spammy. The program has a lot of potential, but I need to know if it can do what I truly need.

Post by **webmaster** » Sat Dec 10, 2011 3:38 am

Hi,

The only thing you cannot do is prevent an Extract action from recreating the output table after you edit it. You could use hacky workarounds such as pasting your URLs after created your Extract action but I wouldn't recommend doing this. And the actual way of working it out wouldn't be as easy as it sounds. I'd just keep input and output tables separated.

I'm not 100% sure about this but seems like you are extracting the URL property of whatever kind you say is present only on some pages. If so, when the kind is not found, the URL won't be extracted either. But if you set the kind from which you are extracting the URL to the BODY kind, the URL will be always extracted since every page has a BODY.

This should work for most cases, but if there happen to be redirects in these pages, the extracted URL won't be the same as the original one. To prevent this you can do what I describe below which is also a more elegant way of doing it.

First, make sure the ID column in your imported table is a PRIMARY KEY. You can create a table with primary key from the database editor -> New Table button. Then set the first field's data type to Integer and the Primary option to Yes. If you want these IDs to be auto-generated, select Custom data type, type AUTOINCREMENT in the Custom DT column, and again, set the Primary option to Yes. If you paste all your URLs into the URL column of this table and press Save, the IDs will be auto-generated.

Then, in your Navigate URLs action, check the ID Column option and select your ID table, as well as setting your URL column. Finally, in your Extract action, extract the ID_TableName (where TableName is your URLs table) property of the BODY kind, and extract the other element you need from the page. Whenever your element is not found in that page, only the ID of the URLs table will be extracted together with a NULL value in the other field. Then you can easily join these tables with SQL and see them as if they were a single table. In fact, you can save the result of this query as a new table just by clicking on the Save into New Table button in the SQL editor.

The query would look something like this:

Code: Select all

SELECT 
	[ImportedTable].[URL],
	[OutputTable].[ThingINeed]
FROM 
([ImportedTable]
	INNER JOIN [OutputTable] ON [ImportedTable].[Id] = [OutputTable].[IdOfImportedTable])

You can then save this query so it will pretty much behave as any table would. Let me know if you need a bit more help with the SQL part or if you have any other question.

Helium Scraper

Extract to a pre-existing/imported database table

Extract to a pre-existing/imported database table

Re: Extract to a pre-existing/imported database table

Re: Extract to a pre-existing/imported database table

Re: Extract to a pre-existing/imported database table