Let's consider this string:
let text = """
She was young the way an actual young person is young.
"""
You could use a linguistic tagger :
import NaturalLanguage
let options = NSLinguisticTagger.Options.omitWhitespace.rawValue
let tagger = NSLinguisticTagger(tagSchemes: NSLinguisticTagger.availableTagSchemes(forLanguage: "en"), options: Int(options))
To count the multiplicity of each word I'll be using a dictionary:
var dict = [String : Int]()
Let's define the accepted linguistic tags (you change these to your liking) :
let acceptedtags: Set = ["Verb", "Noun", "Adjective"]
Now let's parse the string, using the linguistic tagger :
let range = NSRange(location: 0, length: text.utf16.count)
tagger.string = text
tagger.enumerateTags(
in: range,
scheme: .nameTypeOrLexicalClass,
options: NSLinguisticTagger.Options(rawValue: options),
using: { tag, tokenRange, sentenceRange, stop in
guard let range = Range(tokenRange, in: text)
else { return }
let token = String(text[range]).lowercased()
if let tagValue = tag?.rawValue,
acceptedtags.contains(tagValue)
{
dict[token, default: 0] += 1
}
// print(String(describing: tag) + ": \(token)")
})
Now the dict
has the desired words with their multiplicity
print("dict =", dict)
As you can see a Dictionary is an unoreded collection. Now let's introduce some law and order:
let ordered = dict.sorted {
($0.value, $1.key) > ($1.value, $0.key)
}
Now let's get the keys only:
let mostFrequent = ordered.map { $0.key }
and print the three most frequent words :
print("top three =", mostFrequent.prefix(3))
To get the topmost frequent words, it would be more efficient to use a Heap (or a Trie) data structure, instead of having to hash every word, sort them all by frequency, and then prefixing. It should be a fun exercise .