Scraping from LinkedIn
We’ve created a ready-made template that can be used to extract people and company information from LinkedIn. An account is required for the extraction to work. Check with support to see how many profiles/companies you’re allowed to view per day, otherwise, your account could get banned.
To get started, download the template and place it in an empty folder, and then open it with Helium Scraper. This project can extract both people and companies. From each of these, it can extract the top-level information or the details. The top-level information is the information directly available on search results, such as these:
The detail information is the one available on specific people and company pages, such as this:
After loading the project file, open up the Settings global to configure the project.
Top-level extraction settings
- pageTurnDelayAvg: The average delay in seconds between page turns when extracting from search results. The actual delay will be a random number between N – (N/4) and N + (N/4), so if, for instance, you enter 40 seconds, the actual delay will be between 30 and 50.
- maximumPages: The maximum number of pages to visit when extracting from search results.
Details extraction settings
- maximumProfilesPerExtraction: The maximum number of profile/company pages to visit when running a details extraction.
- profileVisitDelayAvg: The average delay in seconds between profile/company page visits. The actual delay is calculated the same way as pageTurnDelayAvg.
After logging into LinkedIn, run a filtered search on the main browser, such as the one on the screenshot on top, and run either the ProfileLinks or CompanyLinks global, depending on the kind of search. Note that the extraction will run on the main browser, so it’s best to avoid interacting with it while the extraction runs. After completing the extraction, the corresponding ProfileLinks or CompanyLinks table will be populated. Both tables have a column called url, which will be used on the next step. Alternatively, if you already have a list of profile or company URLs, you can manually paste them into this column and skip the top level extraction.
To extract details, run ProfileDetails or CompanyDetails. These globals will take the URLs from the corresponding links table and extract up to the number specified by the maximumProfilesPerExtraction setting. The next time a details extraction runs, already extracted profiles or companies won’t be visited again, as long as the details table is not cleared. This allows you to run daily extractions on smaller chunks determined by the maximumProfilesPerExtraction setting. In general, the only globals that you will be dealing with are the ones that don’t start with an @ symbol.
Since the ProfileDetails table contains many tables, you can right-click the table set and select Join Tables to see all tables as one. Note that if you do this you’ll see many rows per user. Alternatively, use the query at Data Flow → Queries → Profile Contact, which will show one row per profile and contact details will be organized into separate columns.
You’ll likely want to use proxies when extracting from LinkedIn and make sure they work with LinkedIn. Luminati offers a type of proxy called gIP, which can be specifically configured to work with LinkedIn. Then you can follow these instructions to set them up in Helium Scraper.