Swift NSLinguisticTagger results for languages other than English

Question

I'm currently checking out Swift's NSLinguisticTagger. For test purposes I used the code from appcoda Introduction to Natural Language Processing.

For the English language it works as expected and described in the tutorial. But when I use NSLinguisticTagger on languages other than English the Lemmatization, Parts of Speech and Named Entity Recognition produces no useful results. I can understand this for the Named Entity Recognition but for the first two options I thought at least a basic result should be possible. Did I miss a language specific setting or is NSLinguisticTagger only good for language detection and Tokenization when used for languages other than English?

Here's the code Sai Kambampati uses in his tutorial:

import Foundation

let quote = "Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can't do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do. - Steve Jobs (Founder of Apple Inc.)"

let tagger = NSLinguisticTagger(tagSchemes:[.tokenType, .language, .lexicalClass, .nameType, .lemma], options: 0)
let options: NSLinguisticTagger.Options = [.omitPunctuation, .omitWhitespace, .joinNames]

func determineLanguage(for text: String) {
  tagger.string = text
  let language = tagger.dominantLanguage
  print("The language is \(language!)")
}

determineLanguage(for: quote)

func tokenizeText(for text: String) {
  tagger.string = text
  let range = NSRange(location: 0, length: text.utf16.count)
  tagger.enumerateTags(in: range, unit: .word, scheme: .tokenType, options: options) { tag, tokenRange, stop in
      let word = (text as NSString).substring(with: tokenRange)
      print(word)
  }
}

tokenizeText(for: quote)

func partsOfSpeech(for text: String) {
  tagger.string = text
  let range = NSRange(location: 0, length: text.utf16.count)
  tagger.enumerateTags(in: range, unit: .word, scheme: .lexicalClass, options: options) { tag, tokenRange, _ in
      if let tag = tag {
          let word = (text as NSString).substring(with: tokenRange)
          print("\(word): \(tag.rawValue)")
      }
  }
}

partsOfSpeech(for: quote)

func namedEntityRecognition(for text: String) {
  tagger.string = text
  let range = NSRange(location: 0, length: text.utf16.count)
  let tags: [NSLinguisticTag] = [.personalName, .placeName, .organizationName]
  tagger.enumerateTags(in: range, unit: .word, scheme: .nameType, options: options) { tag, tokenRange, stop in
      if let tag = tag, tags.contains(tag) {
          let name = (text as NSString).substring(with: tokenRange)
          print("\(name): \(tag.rawValue)")
      }
  }
}

namedEntityRecognition(for: quote)

For the English sentence the result is exactly as expected.

e.g. for the Parts of Speech Tagging and the Named Entity Recognition:

The: Determiner

troublemakers: Noun

The: Determiner

round: Noun

pegs: Noun

...

Apple Inc.: Noun

Steve Jobs: PersonalName

Apple Inc.: OrganizationName

But for a German sentence

let quote = "Apple führt die Hitliste der Silicon-Valley-Unternehmen an, bei denen sich Ingenieure das Wohnen in der Nähe nicht mehr leisten können. Dahinter folgen das Portal Reddit (San Francisco), der Suchriese Google (Mountain View) und die sozialen Netzwerke Twitter (San Francisco) und Facebook (Menlo Park)"

only the language detection and the tokenization seems to work correct. For the Parts of Speech Tagging only "OtherWord" and for the Named Entity Recognition no result at all is returned:

Apple: OtherWord

führt: OtherWord

die: OtherWord

Hitliste: OtherWord

...

Did anyone tried to use the Class in other languages than English or is it only seriously usable when working with English text. I couldn't find any Apple documentation explaining the language capabilities beside from a list of languages that should be supported. Or am I doing something wrong?

Any comment pointing me to a solution is greatly appreciated.

Krid

It seems I got the same issue with Russian language lemmatization using native Swift instruments. It works perfectly with English. But not for Russian. — Oleh Veheria, Dec 22 '18 at 08:07
I finally solved the issue by simply repeating the import of the Foundation model. Somehow it wasn't recognized during the first runs. After repeating the import the whole thing started working and it still does. You can as well try to paste it into a new Playground. — Krid, Dec 24 '18 at 15:09
Nope, this code example doesn't work for me with new playground for lemmatization Russian text correctly. Some words it did recognize. Some of them doesn't (65%). As I understand from your comment - you just repeated "import Foundation"? — Oleh Veheria, Dec 26 '18 at 07:20
you're right - I tried it with some Russian text and it did not produce any useful results. Appears that the Russian language model is not complete regarding NER or POS. — Krid, Dec 29 '18 at 14:09
when i use it with chinese, it does not also work regarding pos — sixsixsix, Dec 05 '19 at 13:23

jz_ · Answer 1 · 2019-03-08T12:58:24.087

I have not tested your above situation but I am attaching the following that I use to develop a part of speech tagger. It includes the setLanguage command and a setOthography command. (The latter, I have not yet experimented with yet).

My understanding is the tagger is to recognize the language and switch languages if needed or it can be set. It appears the logic used here is not fully revealed. I have determined that my best practice is set the language if I can. In this code the language stored as the the string language. (BTW, in my case is it done by reading a larger document that is also available.)

Lastly, I had a chance to see this in action this week. I was in an Apple store (in U.S.) on another matter and observed another customer testing a phone and discussing wanting to message in French. The tech demonstrated how if the iMessage continues to see French it will begin to understand. My thought observing this, is it did work, but it is better if the switch can be made externally if possible.

    if let language = language {
    // If language has a value, it is taken as a specification for the language of the text and set on the tagger.
    let orthography = NSOrthography.defaultOrthography(forLanguage: language)
    POStagger.setOrthography(orthography, range: range)
    POStagger.setLanguage(NLLanguage(rawValue: language), range: range)
}

Swift NSLinguisticTagger results for languages other than English

1 Answers1