Selection Mode Doesnt Work Sometime

Let us know if anything goes wrong with our baby :)
Post Reply
cfardon
Posts: 9
Joined: Fri Mar 04, 2011 9:01 am

Selection Mode Doesnt Work Sometime

Post by cfardon » Fri Mar 04, 2011 12:47 pm

Hi, I purchased your helium scraper and first let me congratulate you, it is pretty easy to use for most of it.

I have 3 problems:-

1. Sometimes the "Selection Mode" button doesnt work. I click on it and it doesnt get the blue square around it. i have to close the program and open it again for it to work. i am running Windows 7 home premium 64bit on an ie7 computer with 6gb ram. Any ideas?

2. When i am using the next buttons (they are ajax) i select "add to selection of this kind" and after 2 or 3 times it works. But then i save the project and close it down, and each time i open the project again it doesn't remember the next button and i have to go and do a few of them and add to selection of this kind again. How can i fix this?

3. The help files are not explanative enough. i got the gist of the "kinds" easy enough but when it comes to the actions it is a bit more daunting.
under the properties tab on add new table there are a lot of choices such as link, outer html, parent index......but no explanation in the help files as to what each of these are used for.
Also, there is not enough instructions for when you want to links deep...i.e. i have a page of links that i want the program to go into each of the links and then take some information and pictures and then go into the second link and do the same, then when it gets to the bottom of the page to go to the next one. i have sort of worked it out but i don't think properly.

What the structure and the child nodes etc mean i cant see written anywhere. Do you have a list of what the choices on the new table screen is? i saw where i need to have SrcAttribue clicked to download the pictures, but have no clude about the others.

The other is, what do i do when i want a picture and also the caption underneath a picture to download...

The site i am trying to scrape is www dot homeaway dot com i then click on "Europe" on the map and then choose "France Rentals" from the list. From there i want to scrap data from each of the properties when you click on the main link and for all the properties which is 58,014. Is there a maximum size for data it can collect? should i put the pictures which are sometimes 10 for each propert into a seperate data base? and if i do this, how do i link them to the propery details in the 1st database?

I await your reply and keep up the excellent work.
Chris Fardon
Fardon Webhosting and Design
http://www.fardonwebhosting.com

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Selection Mode Doesnt Work Sometime

Post by webmaster » Fri Mar 04, 2011 10:26 pm

Hi,

Let me go through your problems one by one.
1. Sometimes the "Selection Mode" button doesnt work. I click on it and it doesnt get the blue square around it. i have to close the program and open it again for it to work. i am running Windows 7 home premium 64bit on an ie7 computer with 6gb ram. Any ideas?
The "Selection Mode" button turns disabled when a page is loading. To enable it you can press the "Stop" button at the top right.
2. When i am using the next buttons (they are ajax) i select "add to selection of this kind" and after 2 or 3 times it works. But then i save the project and close it down, and each time i open the project again it doesn't remember the next button and i have to go and do a few of them and add to selection of this kind again. How can i fix this?
Some of the properties of that "Next" button can change after you navigate through it. Just keep adding it to the kind and you will see that the list of properties get shorter. Remember to save your project after doing this. I just tested it on the page you are trying to scrape and it remembered it after about 4 pages. I also tried it saving and reopening it and it still remembered it. Every time you add the selection to a kind, whatever property that was different gets removed from the list of properties that define a kind (this is the list on the left panel).
3. The help files are not explanative enough. i got the gist of the "kinds" easy enough but when it comes to the actions it is a bit more daunting.
under the properties tab on add new table there are a lot of choices such as link, outer html, parent index......but no explanation in the help files as to what each of these are used for.
Most of these properties are used internally by Helium Scraper to define Kinds and you will never need to worry about them. Some of them, though, can be useful when extracting information, such as the "InnerText" and "SrcAttribute". Normally, if the name doesn't make any sense to you is because you won't need it. But if you are in doubt, you can always test them and see what they do. To do this, click the "Choose visible properties" button in the "Selection" panel at the bottom of the screen. Then deselect all by clicking twice on "Select all" and select the one you want to test. Now select some elements on the web page and you will see the selected property for each of the selected elements.

I should probably highlight a couple of properties here:
  • The "SingleLineInnerText" property represents the InnerText of an element, but without line jumps. Is useful when extracting addresses that span through 2 or 3 lines but you want it as a single line.
  • The "Url" property is the current web address in which an element is located. This means the "Url" will be the same for every single element in a particular page.
Also, there is not enough instructions for when you want to links deep...i.e. i have a page of links that i want the program to go into each of the links and then take some information and pictures and then go into the second link and do the same, then when it gets to the bottom of the page to go to the next one. i have sort of worked it out but i don't think properly.
To extract information that is "inside" links, all you need to do is use a "Navigate each" action. This will navigate through each of the links and perform it's children actions on each of the landing pages (normally an "Extract" action or another "Navigate each" action if you want to go deeper).
What the structure and the child nodes etc mean i cant see written anywhere.
The meaning of the children nodes depends on the action that contains them. Most actions cannot contain children nodes, but the ones that do, have an explanation in the documentation of what they do with the children nodes (See, for example, the description of "Navigate Each" under the "Actions" node in the documentation).
The other is, what do i do when i want a picture and also the caption underneath a picture to download...
To extract the picture and the caption underneath, just create one Kind for the captions and another Kind for the pictures, and select both when creating your "Extract" action. In most cases (as in the case of homeaway.com), every caption / picture pair are contained under the same HTML element, which let Helium Scraper know that they must be extracted to the same row.
Is there a maximum size for data it can collect? should i put the pictures which are sometimes 10 for each propert into a seperate data base? and if i do this, how do i link them to the propery details in the 1st database?
No need to worry about the size of the data it can collect unless you want to scrape millions of rows. Helium Scraper uses the Jet database engine, which supports databases of up to 2GB. The pictures themselves are not stored in the database, but as separate files in the folder indicated at Project -> Options... -> Downloads Folder. See the the "Download" item under "Actions -> Extract" in the documentation.

I'm attaching a file that does most of what you need. Here are some points to note about this file:
  • If you go to "Project -> Options..." you will see that I set the timeout to 1 minute. Some pages in homeaway.com take way too long to complete loading (which by the way is what causes the "Selection Mode" button to stay disabled). This timeout will cancel the navigation and continue with the execution after a minute.
  • If you double click on the "Extract" action in the "Actions tree 1", you will see that the "title" column is marked as "Page Id". This will cause the title to be repeated through every row that is extracted from a particular page. This way you can tell from which page every picture have been extracted. Of course, you can use any other element that is unique to every page as a Page Id, for example the Url. To do this, you can choose the "Url" property for the title element, or any other element that you know it will be present once per page.
Attachments
France.hsp
(690.03 KiB) Downloaded 711 times
Juan Soldi
The Helium Scraper Team

cfardon
Posts: 9
Joined: Fri Mar 04, 2011 9:01 am

Re: Selection Mode Doesnt Work Sometime

Post by cfardon » Sat Mar 05, 2011 9:25 am

Hi,

Thank you very much for your speedy an explanative reply. Everything works great, and now i understand it a bit better it makes the scraping so much easier.

I have tried a few scrapers, and your is up there with the best of them (Maybe even the best and easiest to use), and definately far cheaper than the rest of them,. i have a blog on my web page, i definately will give you an awesome write up to help you out with SEO etc.

Good luck with your project.
Chris Fardon
Fardon Webhosting and Design
http://www.fardonwebhosting.com

cfardon
Posts: 9
Joined: Fri Mar 04, 2011 9:01 am

Re: Selection Mode Doesnt Work Sometime

Post by cfardon » Sat Mar 05, 2011 9:44 am

Sorry, One other question......if everything stops working, or you want to interupt and the scraping continue where you left off another time, how yuu sort of pause it, or make it remember where it was up to in the scrape?
Chris Fardon
Fardon Webhosting and Design
http://www.fardonwebhosting.com

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Selection Mode Doesnt Work Sometime

Post by webmaster » Sat Mar 05, 2011 7:08 pm

The best thing you can do there is wait for Helium Scraper to finish a cycle, stop it right there and save. It will remember the current page when you save it.

For example, on the file I sent you, I would stop it right after executing the "Navigate: next" action and save it right there. This way, the next time you start the project, it will be on a new page with a new list of links that "Navigate Each" will start navigating through.
Juan Soldi
The Helium Scraper Team

Post Reply