How to get the N most common words in a string in Typescript?

Question

Complete noob to Javascript/Typescript here. How can I get the n most common words found in a sample text that contains punctuation such as

const sampleText = "hello world this is taco here is some foo bar text to say hello to my world of tacos in the world of text and it is very cool thanks stackoverflow for it's my birthday. This text also contains punctuation and my mom's car and periods and such. I like apples, pie, and apple pie. Case should be ignored so case and Case are the same. It's and its are two different words!"

I think punctuation can be filtered out of the resulting list after the fact if that makes it easier

Welcome to stack overflow! Questions here are best when you show your attempt, and we can help you with where you got stuck, or any errors you are getting. Please read [how to ask](https://stackoverflow.com/help/how-to-ask) and then [edit] this question with more information including your code so far. We are here you help you learn, but not to do your task for you. — Alex Wayne, Oct 14 '22 at 19:05
You should try yourself first to solve the problem, if you stuck about the algorithm you can share it in the question for us to help you. — ipikuka, Oct 14 '22 at 19:10
I think the answer depends on your specific requirements around punctuation and case and how to break ties, but [this approach](https://tsplay.dev/mq3eQw) is how I'd interpret what you're asking for. If that meets your needs I could write up an answer; otherwise, what am I missing? (Pls mention @jcalz in a comment to notify me if you reply) — jcalz, Oct 14 '22 at 19:17

score 0 · Accepted Answer · answered Oct 14 '22 at 19:16

Obviously you'll have to break that chunk of text up into words.

Then you'll need to count the occurrences of each (unique) word.

What is a "word"? Well, most straightforwardly, it's the characters between spaces.

You mention that you want to ignore punctuation.

Also, you probably want to ignore lettercase: "Hello" is the same word as "hello".

Step by step:

Convert the entire string to lowercase

let lowerText = sampleText.toLowerCase()

Remove punctuation from the string

This is easiest to do with a regular expression. This one removes every character that's not a letter, number, or dash. It replaces any other character with a space.

let stringWithoutPunct = lowerText.replace(/[^a-zA-Z0-9-]/gi, ' ')

Separate that chunk of text into separate words

let rawWords = stringWithoutPunct.split(' ')

Note that this will result in some "words" that are the empty string, if there is any place in the string that has two consecutive spaces. We'll make sure to ignore those items in subsequent steps

Produce a list of unique words

let uniqueWords: Array<string> = []
for(let word of rawWords) {
  // if this word is the empty string, ignore it
  if(word === '') continue
  // if this word is already on the list, ignore it
  if(uniqueWords.includes(word)) continue
  // otherwise, add this word to the list
  uniqueWords.push(word)
}

Count the occurrences of each word

We'll convert the list of unique words into a dictionary/hash whose keys are the words and whose values are the count.

let countedWords: Record<string, number> = {}
for(let word of uniqueWords) {
  let count = 0
  // loop through the list of raw words, counting occurrences of this word
  for(let rawWord of rawWords) {
    if(rawWord === word) count += 1
  }
  
  // now store this word+count pair in the dictionary
  countedWords[word] = count
}

very clear explanation! how would you filter countedWords to remove all values under a certain frequency? — xXx_emo_girl_xXx, Oct 14 '22 at 20:27

How to get the N most common words in a string in Typescript?

1 Answers1