3

given these two tables (documents and labels) how can I find match counts for all the label's patterns from the labels table, found in the document field of documents table (count exact matches, using regex is optional)

WITH documents  AS (
  SELECT 1 AS id, "foo bar, foo baz" AS document UNION ALL
  SELECT 2, "foo bar bar qux" UNION ALL
  SELECT 3, "etc blah blah"
),

labels as (
  select 'FOO_LABEL' as label, 'foo' as pattern UNION ALL
  select 'FOO_LABEL', 'qux' UNION ALL
  select 'BAR_LABEL', 'bar' UNION ALL
  select 'ETC_LABEL', 'etc'
)

the expected matches counts by document:

id, label, cnt
1,  FOO_LABEL, 2
1,  BAR_LABEL, 1
2,  FOO_LABEL, 2
2,  BAR_LABEL, 2
3,  ETC_LABEL, 1

the difference from this question is that I need actual match counts

and unlike this question my patterns are coming from a separate table

there are ~100M documents, and ~1000 rows in labels table

alex
  • 1,757
  • 4
  • 21
  • 32

2 Answers2

1

Consider below approach

select id, label, count(*) cnt
from documents, unnest(regexp_extract_all(document, r'[\w]+')) pattern
join labels
using(pattern)
group by id, label       

if applied to sample data in your question - output is

enter image description here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • just realized i actually needed substring match vs. word match, slightly modified your answer in https://stackoverflow.com/a/73522957/165130 , thanks again – alex Aug 28 '22 at 23:29
0

Mikhail Berlyant's answer above is perfect, I just realized I needed substring match instead of exact word match, so i slightly modified it by replacing using(pattern) with ON STRPOS(word, pattern)>0 in the JOIN:

WITH documents  AS (
  SELECT 1 AS id, "foo bar foobar qux" AS document UNION ALL
  SELECT 2, "foooooo barbar"
),

labels as (
  select 'FOO_LABEL' as label, 'foo' as pattern UNION ALL
  select 'FOO_LABEL', 'qux' UNION ALL
  select 'BAR_LABEL', 'bar' 
)

select id, label, count(*) cnt
from documents, unnest(regexp_extract_all(document, r'[\w]+')) word
join labels
ON STRPOS(word, pattern)>0 --faster that regexp_contains(word, pattern)
group by id, label 

enter image description here

Edit: another minor change to work with phrases/sentences instead of matching individual words (the doc is split into phrases delimited by commas, etc), so that it works for multi-word patterns. this misses some counts (i.e. repeated substring matches within phrase) but faster overall

WITH documents  AS (
  SELECT 1 AS id, "foo 123 test foo 123" AS document UNION ALL
  SELECT 2, "all bars, all bars all bars "
),

labels as (
  select 'FOO_LABEL' as label, 'foo 123' as pattern UNION ALL
  select 'BAR_LABEL', 'all bar' 
)

select id, label, count(*) cnt
from documents, UNNEST(REGEXP_EXTRACT_ALL(document , r'([^.!;(),\n~|]+)')) AS phrase
join labels
ON STRPOS(phrase, pattern)>0 
group by id, label 

enter image description here

alex
  • 1,757
  • 4
  • 21
  • 32