10

What approach can I use to predict the nationality of a person from the surname?

I have a huge list of texts and surnames of authors. I would like to identify which texts have been written by latin-language speakers and which texts have been written by native english speakers, in order to study if certain writing style patterns are different in one group compared to the other.

I have looked in google and in pubmed for a database of surnames, but I could not find any accessible for free. Another approach is to use some regexs, for example ".*ez" to identify some hispanic surnames such as 'rodriguez', but it doesn't get me very far.

Do you have any suggestion? Since I will manually revise all the associations after making the prediction, I don't need a great accuracy, but any help or idea will be welcome.

dalloliogm
  • 8,718
  • 6
  • 45
  • 55
  • 4
    Someone at the TSA might know. – awm Sep 27 '11 at 13:48
  • 1
    Wow. That seems like quite a task. I doubt you'd be able to achieve any great accuracy as surnames can obviously change from generation to generation and people don't always consider themselves of a specific nationality even if their surname is from that nation. What kind of accuracy would you need on this anyway? I suppose if you had access to data such as phone books / census from different nations you could certainly look for common names and similarities to such common surnames. For example a difference of 1 character is basically the same name. – Vala Sep 27 '11 at 13:54
  • Because you have a Spanish surname does not imply that you are not a native English speaker, nor does it work in the other direction. – bitmask Sep 27 '11 at 13:56
  • Thanks everybody - I forgot to say that it is just a playtime project of mine, so I don't need a great accuracy. Moreover, since I am going to manually revise everything, I only need this as a support, to make the manual reviewing step easier. – dalloliogm Sep 27 '11 at 15:01
  • 1
    We used to have a utility that did this when I worked at Experian. Obviously it couldn't guess where someone was born, but given a surname, it suggested where that family name originated. As you've already noted though, this wasn't a free resource unfortunately. – Robbie Dee May 22 '13 at 10:10
  • would be interesting to know what accuracy is actually possible. when applied to my own family, it fails miserably with 0% due to migration, marriage, etc. – Cee McSharpface Mar 29 '17 at 12:35

4 Answers4

4

I don't think you can do this with any degree of reliability. A Rodriguez may well have a Spanish origin name, but could well have been born and brought up anywhere. They could be second generation British, and never have had Spanish spoken around them, and so come into the category of Native English speaker.

Schroedingers Cat
  • 3,099
  • 1
  • 15
  • 33
3

If Actual authors then maybe you can spider amazon and check their 'Author information' details?

I don't think you can guess. E.g. Irish last names - there are an estimated 80,000,000 people with Irish heritage however on 4.5 million of these live in Ireland/went through Irish education.

Dave Walker
  • 3,498
  • 1
  • 24
  • 25
2

There is no meaningful way to do this. There is no reason why people with hispanic names cannot be native english speakers.

If you are going to revise it anyway, why not use the data you have?

Mathias Schwarz
  • 7,099
  • 23
  • 28
  • I need to do this for a huge list of texts, so I need this to set up the default values and make the work easier. – dalloliogm Sep 28 '11 at 08:20
1

Assuming you are intending on doing a programmatic comparison of the texts, you have to manually categorize the texts. Incorrect guesses would likely lead you to build a broken algorithm for textual analysis. This will be especially problematic with machine learning, such as artificial neural networks.

mikerobi
  • 20,527
  • 5
  • 46
  • 42