Emails scraping

Post by **webmaster** » Sun Feb 13, 2011 8:05 am

We had several people asking whether is possible or not to extract emails with Helium Scraper. It is perfectly possible, as long as the user provides where to extract them from. The attached file contains a Kind named Emails. If you click on the "Select kind in browser" button, it will select all the emails in the current page. You can use this kind to extract emails from any page. It also contains a sample actions tree called Extract emails from current page. All it does is to extract all emails from the current page to the Results table whenever you press the Play button. Of course, on a real project you would need to configure Helium Scraper to navigate through a set of pages and then extract the emails from them (if you don't know how to do this see this tutorial).

The attached file also serves as an example of how to use the JavaScript Gatherers in Helium Scraper. If you open the file and click on "Project -> JavaScript Gatherers..." you will see a Gatherer called JS_IsEmail. This is the function that returns true if the element is an email or false otherwise.

Emails.hsp: (323.7 KiB) Downloaded 7364 times

briggyd · Post by **briggyd** » Fri Mar 11, 2011 9:05 am

Hello Helium Scraper,

I have a task that I need to accomplish for my company and I think that your scraper can help us. Albeit, I think others will benefit from the task so I will post this int he forum as well so others will benefit!

Task: Extract Email Addresses from conference lists.

Our company needs to find email addresses off of websites from companies attending exhibitions that we market at. I need Helium Scraper to do the following:

1. Use an excel file with a list of names of exhibitors
2. Google the names of the exhibitors to find their respective websites (if the extractor took the first page of google, or even the first 5 links from the search, this is normally sufficient)
3. Extract the URL of the website
4. Run an e-mail scraper on the respective website to extract their email address.
5. Display the email addresses and company names in one location.

This is the basis of what I need to happen. I know Helium Scraper can help us, but I cannot for the life of me figure out how to make it work.

Please help!

Thanks,

Briggyd

Post by **webmaster** » Fri Mar 11, 2011 8:46 pm

This should get you started. It extracts the URLs from the first page of Google results for a list of search terms. Check the "READ ME" JavaScript action in the project.

Now, the email extraction will be difficult if there is no common structure in the web pages to extract from. If not, the best I can think of is to program Helium Scraper to, first, search for emails in the first page, and then search for a link with the text "Contact Us", click on it and then search for email addresses in the landing page. Of course, this is not the most reliable thing to do.

Another option would be to find those same companies on some structured web site such as Yellow Pages. This would be a lot more accurate then the previous option. What do you think?

briggyd · Post by **briggyd** » Tue Mar 15, 2011 8:50 am

Dear Juan,

Thank you for the help with the program.

The option of scraping yellow pages is not a bad idea. It unfortunately won't apply in are case, being that our market is world-wide. However, we have an e-mail scraping program that we can use once we get the URLs.

I modified the program you sent a bit. The java code is where I get stuck, but I feel like I am really close...

I am able to load the information onto the table and it now searches in google, however, it's having a bit of difficulty retreiving the results in the URL table. I might be doing something wrong, but would you mind taking a look at the code I have?

Thanks Juan!

Post by **webmaster** » Wed Mar 16, 2011 3:30 am

I just tested it and I'm getting the URL's just fine. What exactly is the problem, is it not extracting the URLs? If that's the problem, try this:

Go to the Kinds panel, expand the "urls" Kind, and click on the "Select kind in browser" button. If it doesn't select anything then here is the problem. To fix it just activate "Selection Mode" and select one URL. Then click on the "Add selection to this kind" button under the "urls" Kind. Now test it again by clicking on "Select kind in browser".

To be safe, you can also change the "Extract to table: 'Urls'" action by setting the first row's "Req. Mode" to "At Least" and "Req. Amount" to 5. This will ensure that URL's are being selected and will show a message if they are not.

briggyd · Post by **briggyd** » Wed Mar 16, 2011 10:18 am

Hi Juan,

I hope you are well.

Although I am now getting the URLs to extract, i'm getting the wrong URLs. May I troubleshoot with you?

Kinds:
I checked all kinds and they are working properly. Thank you for the advice.

Actually, the search was not working because I had the 'instant' search on. Once I turned this off, the search started working fine.

HOWEVER, a new problem is occurring:

the links retrieved are not the links of the website, but the Google link for the search item. For instance, when I type: IDRR into the Google search, I am getting 10 links, exactly the same, of this:

http://www.google.com/search?hl=en&biw= ... =&aql=&oq=

the group of links for the next search item in the list retrieves ten links of the next google search. For instance, once Infodent is searched for, I receive 10 links of:

http://www.google.com/search?hl=en&biw= ... =&aql=&oq=

Possible reason?

Kinds: URL
When a Google search appears, you have the blue link (header) and the green link underneath. I've created kinds for both links and i'm getting the same extracted data (the 10 google links of the search result).

When you manually click on the site in the google search, the link takes a second to transform from the google search link to the real link of the website. Could this wait time be the problem?

Juan thank you again for your help. Your product will be very important to our company.

Post by **webmaster** » Wed Mar 16, 2011 10:06 pm

Hi,

It seems to me that you are setting the "Property" column of the "Extract" action to "Url". This will extract the URL of the current page.

In order to extract the text of the element, it needs to be set to "InnerText". If you want to extract the URL that a link (such as the blue links in the Google results you mentioned) points to, you would need to set it to "Link".

I understand that this can be confusing. We are updating the documentation to make it more clear.

If this was not the problem, please let me know and, if possible, send me your project with the wrong URLs.

briggyd · Post by **briggyd** » Thu Mar 17, 2011 8:51 am

Juan,

You're a genius my man. The program works like a charm. Many thanks. That was the problem that I was having.

Briggs

luisantafe · Post by **luisantafe** » Tue May 03, 2011 9:22 pm

Hi Juan I just downloaded the trial version and so far it is fantastic. I am scraping yellowpages from my country and my problem is the email field. It says contact us, and I only can get the real email looking for in the source of that site. Can I do that with helium easily? thks

Post by **webmaster** » Wed May 04, 2011 2:52 am

Hi,

With Helium Scraper you have full access to the source code of the web page. There are a few built in gatherers that might fit your needs. See "Kinds -> Common Properties" in the documentation (The one you need is probably the "Link" gatherer). Also, see "Kinds -> JavaScript Gatherers" in case you need to create your own property gatherers.

These property gatherers can then be selected when creating an "Extract" action under the "Property" column.

Let me know if you need any more help.

Helium Scraper

Emails scraping

Emails scraping

Re: Emails scraping

Re: Emails scraping

Re: Emails scraping

Re: Emails scraping

Re: Emails scraping

Re: Emails scraping

Re: Emails scraping

Re: Emails scraping

Re: Emails scraping