Separating elements outside tag blocks

What do you suggest? What should be improve? We are all ears.
Post Reply
vodkasushi
Posts: 11
Joined: Wed Apr 18, 2012 2:57 pm

Separating elements outside tag blocks

Post by vodkasushi » Tue May 01, 2012 12:58 pm

I've encountered a lot of sites with the following HTML structure:

<div id="1">
<div id="2">Title</div>
<br/>
Hola!
</div>

The issue I'm seeing is that the "Hola!" part is selected within the div with ID=1. This means that div with ID=2 is also included, which is not desirable (it could be that div ID=2 contains the title but I only wanted the description ("Hola!")). Is there a way around this, or will it be fixed soon?

webmaster
Site Admin
Posts: 501
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Separating elements outside tag blocks

Post by webmaster » Fri May 04, 2012 3:42 am

I haven't tested this particular case, but you should be able to select the inner div by selecting the parent and then using the Select first child button in the selection panel.
Juan Soldi
The Helium Scraper Team

vodkasushi
Posts: 11
Joined: Wed Apr 18, 2012 2:57 pm

Re: Separating elements outside tag blocks

Post by vodkasushi » Fri May 04, 2012 3:05 pm

Doesn't work. I've made a test page with the following

<html>
<body>
<div id="1">
<div id="2">Title</div>
<br/>
Hola!
</div>
</body>
</html>

Try clicking on "Hola!" - it will highlight everything, and the "Select First Child" option isn't available.

webmaster
Site Admin
Posts: 501
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Separating elements outside tag blocks

Post by webmaster » Sat May 05, 2012 4:31 am

Hi,

I just created a page with your code above, and I was able to select every element. Clicking "Hola" will select div 1, and clicking "Title" will select div 2. No need to use the Select first child button or any of those.

Is the problem that selecting div 1 will also include div 2? There's really no way around this since div 2 is a children of div 1. And selecting "Hola!" by itself is also not possible because is not an HTML element. What you can do is select the whole thing and then use Text Gatherers (Project -> Text Gatherers) to split the text as needed.
Juan Soldi
The Helium Scraper Team

vodkasushi
Posts: 11
Joined: Wed Apr 18, 2012 2:57 pm

Re: Separating elements outside tag blocks

Post by vodkasushi » Tue May 08, 2012 9:09 am

Yes, my problem is as you described. That's unfortunate to hear. I guess it's probably too "hacky" to interpret it as a HTML element :)

webmaster
Site Admin
Posts: 501
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Separating elements outside tag blocks

Post by webmaster » Wed May 09, 2012 5:57 am

Well, if you think about it, it wouldn't be just hacky but impractical. Since a text node doesn't have any attributes or properties other than it's parents properties there wouldn't be a way to distinguish one from another when they are children of the same HTML element, so you'd end up having to identify them by line number or any other text separator, which is basically what text gatherers do.

For your particular example, you could use a text gatherer that extracts the second line, which would be the text "Hola!".
Juan Soldi
The Helium Scraper Team

Post Reply