I'm currently checking out Swift's NSLinguisticTagger. For test purposes I used the code from appcoda Introduction to Natural Language Processing.
For the English language it works as expected and described in the tutorial. But when I use NSLinguisticTagger on languages other than English the Lemmatization, Parts of Speech and Named Entity Recognition produces no useful results. I can understand this for the Named Entity Recognition but for the first two options I thought at least a basic result should be possible. Did I miss a language specific setting or is NSLinguisticTagger only good for language detection and Tokenization when used for languages other than English?
Here's the code Sai Kambampati uses in his tutorial:
import Foundation
let quote = "Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can't do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do. - Steve Jobs (Founder of Apple Inc.)"
let tagger = NSLinguisticTagger(tagSchemes:[.tokenType, .language, .lexicalClass, .nameType, .lemma], options: 0)
let options: NSLinguisticTagger.Options = [.omitPunctuation, .omitWhitespace, .joinNames]
func determineLanguage(for text: String) {
tagger.string = text
let language = tagger.dominantLanguage
print("The language is \(language!)")
}
determineLanguage(for: quote)
func tokenizeText(for text: String) {
tagger.string = text
let range = NSRange(location: 0, length: text.utf16.count)
tagger.enumerateTags(in: range, unit: .word, scheme: .tokenType, options: options) { tag, tokenRange, stop in
let word = (text as NSString).substring(with: tokenRange)
print(word)
}
}
tokenizeText(for: quote)
func partsOfSpeech(for text: String) {
tagger.string = text
let range = NSRange(location: 0, length: text.utf16.count)
tagger.enumerateTags(in: range, unit: .word, scheme: .lexicalClass, options: options) { tag, tokenRange, _ in
if let tag = tag {
let word = (text as NSString).substring(with: tokenRange)
print("\(word): \(tag.rawValue)")
}
}
}
partsOfSpeech(for: quote)
func namedEntityRecognition(for text: String) {
tagger.string = text
let range = NSRange(location: 0, length: text.utf16.count)
let tags: [NSLinguisticTag] = [.personalName, .placeName, .organizationName]
tagger.enumerateTags(in: range, unit: .word, scheme: .nameType, options: options) { tag, tokenRange, stop in
if let tag = tag, tags.contains(tag) {
let name = (text as NSString).substring(with: tokenRange)
print("\(name): \(tag.rawValue)")
}
}
}
namedEntityRecognition(for: quote)
For the English sentence the result is exactly as expected.
e.g. for the Parts of Speech Tagging and the Named Entity Recognition:
The: Determiner
troublemakers: Noun
The: Determiner
round: Noun
pegs: Noun
...
Apple Inc.: Noun
Steve Jobs: PersonalName
Apple Inc.: OrganizationName
But for a German sentence
let quote = "Apple führt die Hitliste der Silicon-Valley-Unternehmen an, bei denen sich Ingenieure das Wohnen in der Nähe nicht mehr leisten können. Dahinter folgen das Portal Reddit (San Francisco), der Suchriese Google (Mountain View) und die sozialen Netzwerke Twitter (San Francisco) und Facebook (Menlo Park)"
only the language detection and the tokenization seems to work correct. For the Parts of Speech Tagging only "OtherWord" and for the Named Entity Recognition no result at all is returned:
Apple: OtherWord
führt: OtherWord
die: OtherWord
Hitliste: OtherWord
...
Did anyone tried to use the Class in other languages than English or is it only seriously usable when working with English text. I couldn't find any Apple documentation explaining the language capabilities beside from a list of languages that should be supported. Or am I doing something wrong?
Any comment pointing me to a solution is greatly appreciated.
Krid