How to put and sort word in NSCountedSet in swift?

Question

I'm try to getting most duplicated word from string with this code.

let text = """
  aa bb aa bb aa bb cc dd dd cc zz zz cc dd zz
  """
  let words = text.unicodeScalars.split(omittingEmptySubsequences: true, whereSeparator: { !CharacterSet.alphanumerics.contains($0) })
  .map { String($0) }
  let wordSet = NSCountedSet(array: words)
  let sorted = wordSet.sorted { wordSet.count(for: $0) > wordSet.count(for: $1) }
print(sorted.prefix(3))

result is

[cc, dd, aa]

Currently, it put all words, even it is a single charcter.

What I'm going to do is,

put a word to NSCountedSet which has more than one character.
if words in NSCountedSet have same count, sort it alphabetically. (desired result is aa ,cc, dd)

And if it is possible..

omit parts of speech from the string, such as 'and, a how,of,to,it,in on, who '....etc

What is your question, you have listed some things you want to do but it is hard to understand what your issue is? — Joakim Danielson, Mar 26 '20 at 15:02

score 1 · Answer 1 · answered Mar 26 '20 at 15:37

Let's consider this string:

let text = """
      She was young the way an actual young person is young.
      """

You could use a linguistic tagger :

import NaturalLanguage

let options = NSLinguisticTagger.Options.omitWhitespace.rawValue
let tagger = NSLinguisticTagger(tagSchemes: NSLinguisticTagger.availableTagSchemes(forLanguage: "en"), options: Int(options))

To count the multiplicity of each word I'll be using a dictionary:

var dict = [String : Int]()

Let's define the accepted linguistic tags (you change these to your liking) :

let acceptedtags: Set = ["Verb", "Noun", "Adjective"]

Now let's parse the string, using the linguistic tagger :

let range = NSRange(location: 0, length: text.utf16.count)
tagger.string = text

tagger.enumerateTags(
    in: range,
    scheme: .nameTypeOrLexicalClass,
    options: NSLinguisticTagger.Options(rawValue: options),
    using: { tag, tokenRange, sentenceRange, stop in
        guard let range = Range(tokenRange, in: text)
            else { return }

        let token = String(text[range]).lowercased()

        if let tagValue = tag?.rawValue,
            acceptedtags.contains(tagValue)
        {
            dict[token, default: 0] += 1
        }

        // print(String(describing: tag) + ": \(token)")
})

Now the dict has the desired words with their multiplicity

print("dict =", dict)

As you can see a Dictionary is an unoreded collection. Now let's introduce some law and order:

let ordered = dict.sorted {
    ($0.value, $1.key) > ($1.value, $0.key)
}

Now let's get the keys only:

let mostFrequent = ordered.map { $0.key }

and print the three most frequent words :

print("top three =", mostFrequent.prefix(3))

To get the topmost frequent words, it would be more efficient to use a Heap (or a Trie) data structure, instead of having to hash every word, sort them all by frequency, and then prefixing. It should be a fun exercise .

How about in different languages?Is there a list of languages that support NSLinguisticTagger? — alphonse, Mar 28 '20 at 10:19
@alphonse [Here](https://developer.apple.com/documentation/naturallanguage/nllanguage) is a list of supported languages. To get the associated String value of a language, all you have to do is get the `rawValue` property. For example, for French that would be: `NLLanguage.french.rawValue` — ielyamani, Mar 28 '20 at 11:16
sorry but I dind't understand. where should I put this rawvalue? — alphonse, Mar 29 '20 at 05:18

How to put and sort word in NSCountedSet in swift?

1 Answers1