-1

Is there any better way of extracting proper nouns (e.g. "London", "John Smith", "Gulf of Carpentaria") from free text?

That is, a function like

proper_nouns <- function(text_input) {
  # ...
}

such that it would extract a list of proper nouns from the text input(s).

Examples

Here is a set of 7 text inputs (some easy, some harder):

text_inputs <- c("a rainy London day",
  "do you know John Smith?",
  "sail the Adriatic",
  
  # tougher examples
  
  "Hey Tom, where's Fred?" # more than one proper noun in the sentence
  "Hi Lisa, I'm Joan." # more than one proper noun in the sentence, separated by capitalized word
  "sail the Gulf of Carpentaria", # proper noun containing an uncapitalized word
  "The great Joost van der Westhuizen." # proper noun containing two uncapitalized words
  )

And here's what such a function, set of rules, or AI should return:

proper_nouns(text_inputs)

[[1]]
[1] "London"

[[2]]
[1] "John Smith" 

[[3]]
[1] "Adriatic"

[[4]]
[1] "Tom"    "Fred"

[[5]]
[1] "Lisa"    "Joan"

[[6]]
[1] "Gulf of Carpentaria"

[[7]]
[1] "Joost van der Westhuizen"

Problems: simple regex are imperfect

Consider some simple regex rules, which have obvious imperfections:

  • Rule: take capitalized words, unless they're the first word in the sentence (which would ordinarily be capitalized). Problem: will miss proper nouns at start of sentence.

  • Rule: assume successive capitalized words are parts of the same proper noun (multi-part proper nouns like "John Smith"). Problem: "Gulf of Carpentaria" would be missed since it has an uncapitalized word in between.

    • Similar problem with people's names containing uncapitalized words, e.g. "Joost van der Westhuizen".

Question

The best approach I currently have is to simply use the regular expressions above and make do with a low success rate. Is there a better or more accurate way to extract the proper nouns from text in R? If I could get 80-90% accuracy on real text, that would be great.

stevec
  • 41,291
  • 27
  • 223
  • 311
  • Using spacyr like below or udpipe will give you a start with upos code PROPN or xpos NNP, but names like Joost are Dutch and tend not to be in the English dictionaries used by spacyr or udpipe. These will not be recognized. Not even speaking of the word "of" between geographic names. You might have some success with geographical features if you can get your hands on lists of seas, rivers, places etc. and then using something like crfsuite or nametagger to build your own entity recognition models. – phiver Apr 25 '21 at 09:56
  • thanks @phiver - I will give those a try. I never anticipated getting close to 100% in any case. It may be a bit early, but only by a few years, to look to R interfaces to tools like [GPT-x](https://openai.com/blog/openai-licenses-gpt-3-technology-to-microsoft/) to "understand" which segments of sentences are proper nouns, rather than using lists of known proper nouns. I am not sure if the current versions of those models are capable of doing that though (and even if they are, whether they're available to the public yet) – stevec Apr 25 '21 at 10:02

1 Answers1

3

You can start by taking a look at spacyr library.

library(spacyr)
result <- spacy_parse(text_inputs, tag = TRUE, pos = TRUE)
proper_nouns <- subset(result, pos == 'PROPN')
split(proper_nouns$token, proper_nouns$doc_id)

#$text1
#[1] "London"

#$text2
#[1] "John"  "Smith"

#$text3
#[1] "Adriatic"

#$text4
#[1] "Hey" "Tom"

#$text5
#[1] "Lisa" "Joan"

#$text6
#[1] "Gulf"        "Carpentaria"

This treats every word separately hence "John" and "Smith" are not combined. You maybe need to add some rules on top of this and do some post-processing if that is what you require.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • +1 because it's a great start. But note `"John" "Smith"` should be `"John Smith"` (vector length 1 not 2), and also note `"Hey" "Tom"` should be `"Tom" "Fred"`. Also, `"Gulf" "Carpentaria"` should be `"Gulf of Carpentaria"` (length 1, not 2) – stevec Apr 25 '21 at 05:04
  • @stevec, You could try using spacy model "en_core_web_trf". It scores higher on named entity recognition. You do have to download it first via the command line: `python -m spacy download en_core_web_trf` Note that this is 438 MB file. The issue of the name "John Smith" will still be there. – phiver Apr 25 '21 at 13:20
  • 1
    If you want to use this kind of parsing from spaCy but do prefer to have output in dataframes/tibbles, check out the [cleanNLP](https://statsmaths.github.io/cleanNLP/) package. – Julia Silge Apr 26 '21 at 16:02