Page 1 of 1

Text In Between Help

Posted: Mon Feb 11, 2013 5:26 am
by acecorp
I tried to use the TextInBetween premade but it doesn't seem to work the way I need it to.

There is a single field on each record that appears on all the pages I am extracting data from.

That field contains year, make, model, and some extra text as shown below, but its one field that has all this text in it.

Text Field 1
2009 Buick Enclave AWD 4dr CXL SUV
2003 Buick Rendezvous CXL AWD SUV
2010 Acura MDX AWD 4dr Technology Pkg SUV
2008 Acura MDX 4WD 4dr Tech/Pwr Tail Gate SUV

I need to split the text out so that instead of it all existing in a single field after extraction, it sits in separate fields/columns so that when extracted, it looks like this.

Year Make Model Extral
2009 Buick Enclave AWD 4dr CXL SUV
2003 Buick Rendezvous CXL AWD SUV
2010 Acura MDX AWD 4dr Technology Pkg SUV
2008 Acura MDX 4WD 4dr Tech/Pwr Tail Gate SUV

How exactly can I do this with the heliumscraper software?

I have uploaded my project file if anyone can look at the details of it and make specific suggestions on how to acchieve my goal.

Re: Text In Between Help

Posted: Mon Feb 11, 2013 5:35 pm
by webmaster
Hi

You can use regular expressions with Text Gatherers at Project -> Text Gatherers to do things like extract only the first word or extract everything after the third word (the attached project has samples of how you'd do this, you can see them at Project -> Text Gatherers, and to extract them you'd create an Extract action that extracts four times whatever kind selects the full text you want to split, and use a different property each time: JS_Extra, JS_Model, JS_Make and JS_Year).

Now, seems to me that what you require is more complex. If the model has more than one word, it won't work. In situations like this you usually want to ask yourself, how do you know what is the model inside the text? If it is because it is the third word, then just need a regex that extracts the third word. If it is because you know a bunch of models and you recognize one of them in the text, then you'll definitely need a database of models and an algorithm to recognize it inside the text that basically emulates what your brain is doing. If you require such algorithm, we could set it up as a custom project if you wish. You can contact us regarding this here.

If you're interested on learning more about regexes, here is a pretty good tutorial.