0

Extract adjacent word? (Names, streets, creeks, rivers)

Hi I am looking for a function that I can run through a massive list of paragraphs to extract the word proceeding ‘creek’ such that the creek names could be isolated.

For example a given paragraph might read:

“The site was located up stream three miles from the bridge along Clark Creek.”

The ideal output would be simply

Clark Creek

It would have to be something that looks up the word ‘creek’ as a criteria and extracts the preceding word, even just ‘Clark’ would work for me.

I have been playing around with the RQSlite package & gsub, but no luck so far… I am sure this is a common procedure.

j0k
  • 22,600
  • 28
  • 79
  • 90
Matthew Bayly
  • 556
  • 5
  • 7
  • 1
    Uhh, what language? This may be as "simple" as the applicable version of `/\w+\s+Creek/i` - but that won't work for things like "Walking Man Creek" or "Adam's Creek" (and will match "the creek"), and good luck with cases such as "Breath of the Gods Creek" (notice how introducing lowercase words throws off heuristics that could have been applied to the former examples). NL is a PITA in general and regular expressions don't make it easier. – user2864740 Nov 15 '13 at 07:39
  • Anyway, I guess, besides specifying the language, also specify *all* the applicable input cases. – user2864740 Nov 15 '13 at 07:41

1 Answers1

1

If you're extracting actual addresses, there are services which do this intelligently and can even verify the results: http://smartystreets.com/products/liveaddress-api/extract (To be fair, you should know I helped develop that, although I no longer work there.)

For place names, assuming the place is just one word, you could try a simple regex:

/(?<=\s)(\S+\s+(Creek|Street|River))/ig

Granted, I've never used RQSLite or gsub, but I imagine something like this would do the trick.

Matt
  • 22,721
  • 17
  • 71
  • 112