1

Please see: Find every Unique 2 word pairings from a column of terms in Google Sheets

In the above question the asker was was looking for unique 2 word pairings and gave the example of having 3 words in a cell.

The formula given works if there is only 1 to 3 words in a cell, but as soon as you go over (4+ words) the formula returns a #value Error "in ARRAY_LITERAL, an Array Literal was missing values for one or more rows"

=arrayformula(query(flatten({query({if(A:A<>"",iferror(trim(split(regexreplace(regexreplace(regexreplace(lower(A:A),"[^A-Za-z\ \']+",""),"([\w\']+\ [\w\']+)","$1,"),"(.*\,)(.*)","$1"),",")),),)},"where Col1 is not null",0);query({if(A:A<>"",iferror(trim(split(regexreplace(regexreplace(regexreplace(lower(if(len(A:A)=len(regexreplace(A:A," ",""))+1,,A:A)),"[^A-Za-z\ \']+",""),"(\w*)(\ )([\w\']+\ [\w\']+)","$3,"),"(.*\,)(.*)","$1"),",")),),)},"where Col1 is not null",0)}),"select Col1,count(Col1) where Col1 is not null group by Col1 label Col1 'Word pairs',Count(Col1) 'Qty' ",0))

Is it possible to make the formula work with cells that include 25, 50, 100+ words and then be able to exclude common words from a column like done in Extracting and counting unique word frequency from a range

   =ArrayFormula(query(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(J1:J20))&"\b'  group by Col1 order by Count(Col1) desc label Count(Col1)''"))

I'm trying to find unique word pairings in sentences and then count how many times they appear.

I've edited my question to try to make it more readable (sorry still don't completely now how to use stack-overflow formatting or respond to comments etc.)

TEXT column and EXCLUSIONS is the input

UNIQUE PAIRS and COUNT would be the Output

TEXT UNIQUE PAIRS COUNT EXCLUSIONS
I wake up with night sweats and struggle with staying asleep. night sweats 2 I wake
I wake up from night sweats and struggle with dry hair. staying asleep 1 wake up
dry hair 1 up with
with night
sweats and
and struggle
with staying
struggle with
up from
from night
with dry

Purpose is to be able to find common word pairings like

bed head, dry hair, night sweats, etc.

That you can't find just by looking at the frequency of single words

player0
  • 124,011
  • 12
  • 67
  • 124
bham
  • 13
  • 3
  • Kindly add input table and expected output table as **text table** (NOT as IMAGE) to the question. [Click here](https://webapps.stackexchange.com/a/161855/) to create a table easily. Adding such tables greatly increases your chances of getting a elegant answer, as **it is easier to copy/paste**. If you share spreadsheets, your question maybe closed, as questions here must be [self contained](https://meta.stackoverflow.com/a/260455). Your table should be a [mre].[Your email address can also be accessed by the public](https://meta.stackoverflow.com/questions/394304/), if you share Google files. – TheMaster Oct 14 '22 at 18:00
  • Sharing your research helps everyone. Tell us what you've tried and why it didn’t meet your needs. This demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and most of all it helps you get a more specific and relevant answer! See [How much research effort is expected of stackoverflow users?](https://meta.stackoverflow.com/questions/261592) – TheMaster Oct 14 '22 at 18:06
  • `the formula breaks`. Define breaks – TheMaster Oct 14 '22 at 18:08

1 Answers1

0

try:

=INDEX(LAMBDA(a, LAMBDA(x, QUERY(QUERY(x&" "&{QUERY(x, 
 "offset 1", ); ""}, "where not Col1 contains '×××'", ), 
 "select Col1,count(Col1) group by Col1 order by count(Col1) desc label count(Col1)''"))
 (FLATTEN(SPLIT(QUERY(FLATTEN({REGEXREPLACE(LOWER(a), "[\.,\?!\(\)]", ), 
 IFERROR(a/0, "×××")}),,9^9), " "))))(A1:A2))

enter image description here


update:

=INDEX(LAMBDA(a, LAMBDA(x, QUERY(QUERY(x&" "&{QUERY(x, 
 "offset 1", ); ""}, "where not Col1 contains '×××'", ), 
 "select Col1,count(Col1) 
  where not Col1 matches '"&TEXTJOIN("|", 1, UNIQUE(LOWER(E2:E)))&"' 
  group by Col1 order by count(Col1) desc label count(Col1)''"))
 (FLATTEN(SPLIT(QUERY(FLATTEN({REGEXREPLACE(LOWER(a), "[\.,\?!\(\)]", ), 
 IFERROR(a/0, "×××")}),,9^9), " "))))(A1:A2))

enter image description here

player0
  • 124,011
  • 12
  • 67
  • 124