2

In analysing text, it can be useful to identify names of people in text data.

Objects prepackaged in tidytext include:

  • English negators, modals, and adverbs (nma_words)
  • Parts of Speech (parts_of_speech)
  • Sentiments (sentiments), and
  • Stop Words (see: ?stop_words)

Is there a similar object in R (or in accessible format elsewhere) containing a canonical list of names?

For reference, here are the existing data.frames that are supplied with tidytext

nma_words
# # A tibble: 44 x 2
# word      modifier
# <chr>     <chr>   
#   1 cannot    negator 
# 2 could not negator 
# 3 did not   negator 
# 4 does not  negator 
# 5 had no    negator 
# 6 have no   negator 
# 7 may not   negator 
# 8 never     negator 
# 9 no        negator 
# 10 not       negator 
# # … with 34 more rows


parts_of_speech
# # A tibble: 208,259 x 2
#    word    pos      
#    <chr>   <chr>    
#  1 3-d     Adjective
#  2 3-d     Noun     
#  3 4-f     Noun     
#  4 4-h'er  Noun     
#  5 4-h     Adjective
#  6 a'      Adjective
#  7 a-1     Noun     
#  8 a-axis  Noun     
#  9 a-bomb  Noun     
# 10 a-frame Noun     
# # … with 208,249 more rows


sentiments
# # A tibble: 6,786 x 2
#    word        sentiment
#    <chr>       <chr>    
#  1 2-faces     negative 
#  2 abnormal    negative 
#  3 abolish     negative 
#  4 abominable  negative 
#  5 abominably  negative 
#  6 abominate   negative 
#  7 abomination negative 
#  8 abort       negative 
#  9 aborted     negative 
# 10 aborts      negative 
# # … with 6,776 more rows


stop_words
# # A tibble: 1,149 x 2
#    word        lexicon
#    <chr>       <chr>  
#  1 a           SMART  
#  2 a's         SMART  
#  3 able        SMART  
#  4 about       SMART  
#  5 above       SMART  
#  6 according   SMART  
#  7 accordingly SMART  
#  8 across      SMART  
#  9 actually    SMART  
# 10 after       SMART  
# # … with 1,139 more rows

stevec
  • 41,291
  • 27
  • 223
  • 311
  • There are tens of millions of unique names in the world, and people create new ones every day. Any such list would be massive, incomplete, frequently out-of-date and never canonical. – Allan Cameron Apr 27 '20 at 00:01
  • @AllanCameron yep it would be, it would be up to user beware like the other lists – stevec Apr 27 '20 at 00:05
  • 3
    You might start with the `babynames` package although this is built solely from US data. – Ritchie Sacramento Apr 27 '20 at 00:07
  • 1
    Check the [lexicon package](https://github.com/trinker/lexicon). It contains lists of common names, first names, last names and a whole bunch muore – phiver Apr 27 '20 at 09:17
  • 1
    `library(lexicon); lexicon::freq_first_names` has ~5500 names from the 1990 US census. `library(genderdata); ssa_national` comes from the Social Security Administration through time from 1880 through 2012, showing some ~9500 female and similar number of male names in total. – stevec May 01 '20 at 23:18

1 Answers1

4

Datasets like these are super complicated and must be used with care. One source of such data is the genderdata package which includes several name datasets, including several from the US Social Security Administration.

library(genderdata)

head(ssa_national)
#>    name year female male
#> 1 aaban 2007      0    5
#> 2 aaban 2009      0    6
#> 3 aaban 2010      0    9
#> 4 aaban 2011      0   11
#> 5 aaban 2012      0   11
#> 6 aabha 2011      7    0

Created on 2020-04-27 by the reprex package (v0.3.0)

Julia Silge
  • 10,848
  • 2
  • 40
  • 48