Extracting and counting unique word frequency from a range

Question

I have a column where each row is a sentence. For example:

COLUMN1

R1: -Do you think they'll come, sir?

R2: -Oh they'll come, they'll come all right.

R3: Here. Stamp those and mail them.

R4: It's ringing.

R5: Would you walk Myron the other way?

From this range, I want to extract a list of unique words (COLUMN2), and a count of how often they appeared in the range (COLUMN3).

The trick is to remove punctuation marks like commas, periods, etc..

So the desired result for the above would be:

COLUMN2    COLUMN3

Do          1

you         2

think       1

they'll     3

come        2

sir         1

Oh          1

all         1

right       1

Here        1

Stamp       1

those       1

and         1

mail        1

them        1

It's        1

ringing     1

Would       1

walk        1

Myron       1

the         1

other       1

way         1

I tried parsing each row with the SPLIT function, separating each word into their own cells, but I'm stuck removing the punctuation, and building the list of unique words (which I know will involve the UNIQUE function). The count I'm guessing will also involve the COUNTUNIQUE function.

Any guidance will be appreciated!

Thanks everyone! I selected @player0's solution as the most useful to me because I understood it best and like the built-in sorting. I'm now trying to omit common words like "the", "and", etc.. I tried adding "the" as one of the REGEXREPLACE parameters but that returned an error. So then I tried reapplying REGEXREPLACE , but I can't get it to work on more than one of these common words. What I have: =REGEXREPLACE((REGEXREPLACE( TEXTJOIN(" ", 1, F7:F2048), "\.|\,|\?|\!|\--|\:", ))," the "," ") This removes "the" but I need to it remove a bunch of such words. — Shian Han, Dec 28 '19 at 06:42
Changing most helpful answer to me to @JPV 's , as it added a way to exclude certain words from a list. Thanks! — Shian Han, Dec 28 '19 at 09:46
UPDATE: I created a Google Sheet we can use: https://docs.google.com/spreadsheets/d/1bkso50ENLAU5ARd5pZ5DkmH-BvMaoM1LAqFsFPlrByg/edit?usp=sharing — Shian Han, Dec 29 '19 at 05:02

JPV · Accepted Answer · 2019-12-29T11:23:15.053

3

You could try something like

=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>'' group by Col1 label Count(Col1)''")

Change range to suit.

If you want to exclude a list of words (ex. in the range J1:J20) you can try

=ArrayFormula(query(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(J1:J20))&"\b'  group by Col1 order by Count(Col1) desc label Count(Col1)''"))

Alternatively, you can also add the list of exclusions to the regex pattern...

=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']|\b((?i)the|oh|or|and)\b" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>''  group by Col1 order by Count(Col1) desc label Count(Col1)''")

UPDATED:

=ArrayFormula(substitute(query(transpose(split(query(regexreplace(substitute(C11:C, char(39), "_"), "[^A-Za-z\s_]" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(substitute(G11:G,char(39),"_")))&"\b'  group by Col1 order by Count(Col1) desc label Count(Col1)''", 0), "_", char(39)))

or, using a different approach

=query(filter(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",), isna(match(upper(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",)), upper(filter(G11:G, len(G11:G))),0))), "Select Col1, count(Col1) group by Col1 order by count(Col1) desc label count(Col1)''", 0)

edited Dec 29 '19 at 11:23

answered Dec 27 '19 at 17:07

JPV

26,499
4
33
48

Finally I ended up using your original solution, because I realized there were too many strings to exclude :) Question: How do I exclude quotation marks (the " symbol" from the resulting list of words? Also, there appears to be a problem with words containing an apostrophe (the ' symbol), for example words like "it's" and "don't". They seem to cause an error. – Shian Han Dec 28 '19 at 13:53
See this Google Sheet to see what I mean: https://docs.google.com/spreadsheets/d/1bkso50ENLAU5ARd5pZ5DkmH-BvMaoM1LAqFsFPlrByg/edit?usp=sharing – Shian Han Dec 29 '19 at 05:01
1

Hi to work around the single quotes, try the updated formula. I could not see the problem with the double quotation mark with this formula. – JPV Dec 29 '19 at 08:23
thank you. That seems to get rid of the double-quotes too, but the problem is that it returns "theyll" in the list, instead of "they'll". Ideally I can exclude certain words which contain the apostrophe (single quote), but keep other words that also contain it. – Shian Han Dec 29 '19 at 09:02
wow, nicely done. I've updated the Google Sheet to capture your solution. – Shian Han Dec 29 '19 at 10:30
One more thing... how would we keep hyphenated words intact? For example in the Google Sheet, line 44 there is the string "dinner-and-a-murder", which becomes "dinnerandamurder" in the results. Ideally we can keep words like week-end, good-looking, etc.. intact in the results. – Shian Han Dec 29 '19 at 10:52
1

Please try the last (updated) formula (with filter) and see if that fixes the issue with hyphenated words. – JPV Dec 29 '19 at 11:24
Thank you! I made a couple of mods after some tinkering around: regexreplace(upper(C11:C), "[^A-Za-z\s'-]|\n|--" ," ") ; I was trying to make it so that 1) the same words are treated as one word regardless of capitalization, 2) "--" are removed, and 3) line breaks in the source text are replaced by spaces . Not sure if there's a better way to do it but it seems to work. – Shian Han Dec 29 '19 at 18:21

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

1

try:

=ARRAYFORMULA(QUERY(TRANSPOSE(SPLIT(REGEXREPLACE(
 TEXTJOIN(" ", 1, LOWER(A:A)), "\.|\,|\?", ), " ")), 
 "select Col1,count(Col1) 
  group by Col1 
  order by count(Col1) desc 
  label count(Col1)''", 0))

or:

=ARRAYFORMULA(QUERY(TRANSPOSE(SPLIT(REGEXREPLACE(
 QUERY(LOWER(A:A),,999^99), "[^a-z0-9а-я ]", ), " ")), 
 "select Col1,count(Col1) 
  group by Col1 
  order by count(Col1) desc 
  label count(Col1)''", 0))

UPDATE:

=ARRAYFORMULA(QUERY(TRANSPOSE(SPLIT(REGEXREPLACE(
 QUERY(LOWER(A:A),,999^99), "[^a-z0-9 ]", ), " ")), 
 "select Col1,count(Col1) 
  where not Col1 matches 'the|and|i|you|its'
  group by Col1 
  order by count(Col1) desc 
  label count(Col1)''", 0))

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 27 '19 at 18:02

player0

124,011
12
67
124

Thanks! Appreciate the sorting built-in. I never would have figured out the Query function parameters. Question: I'm now trying to ignore common words like "the","you","to","a","I","in","of","is","it","and","that", etc.. I tried re-applying REGEXREPLACE: =REGEXREPLACE((REGEXREPLACE( TEXTJOIN(" ", 1, F7:F2048), "\.|\,|\?|\!|\--|\:", ))," the "," ") but I can't figure out how to omit more than one of these common words... – Shian Han Dec 28 '19 at 06:12

user11982798 · Answer 3 · 2019-12-28T00:51:39.710

You can use Mid, RegexReplace, Query, Split, etc, Like this:

= query 
  ( 
    transpose 
    ( 
      split 
      ( 
        regexreplace ( textjoin ( " ", true,filter(mid(A11:A,4, len(A11:A)),A11:A<>"") ) , "[>,.?/!-]"," " ) ," ",true,true 
      ) 
    ) 
    ,"Select Col1, Count(Col1) group by Col1 label Col1 'Column2', Count(Col1) 'Column3' " 
  )

or if without prefix R1: ~ R5, use like this:

= query 
  ( 
    transpose 
    ( 
      split 
      ( 
        regexreplace ( textjoin ( " ", true,filter(A11:A,A11:A<>"")) , "[>,.?/!-]"," " ) ," ",true,true 
      ) 
    ) 
    , "Select Col1, Count(Col1) group by Col1 label Col1 'Column2', Count(Col1) 'Column3' " 
 )

Extracting and counting unique word frequency from a range

3 Answers3

UPDATE:

Linked