2

I have a table of ~4.7M documents stored in BigQuery. Some are plaintext, some HTML. They're around 2k tokens per, with wide variation. I'm mainly using DataPrep to do my processing.

I want to extract those tokens and calculate TF-IDF values.

Token counting

One of the more time-intensive steps is taking this:

id, document
1, "foo bar foo baz"
2, "foo bar bar qux"

And turning it into this:

id, word, count
1, foo, 2
1, bar, 1
1, baz, 1
2, foo, 1
2, bar, 2
2, qux, 1

One way to do it is this:

  1. extractlist on document by {alphanum-underscore}+ id, wordlist 1, ["foo", "bar", "foo", "baz"] 2, ["foo", "bar", "bar", "qux"]
  2. flatten wordlist id, word 1, foo 1, bar 1, foo 1, baz 2, foo 2, bar 2, bar 2, qux
  3. aggregate by group: id, word, value: count() id, word, count 1, foo, 2 1, bar, 1 1, baz, 1 2, foo, 1 2, bar, 2 2, qux, 1

However, steps 2 & 3 are very slow, especially with large documents.

Ideally, I'd be able to have a function that converts ["foo", "bar", "foo", "baz"] into {"foo":2, "bar":1, "baz":1}. That wouldn't require the flatten-then-group operation to extract the count, and the subsequent flatten would be smaller (since it's operating on unique terms rather than each term).

I've not figured out any way to do that in DataPrep, however. :-/

What's a more efficient way to do this?

HTML to plaintext

My source data is a combination of plaintext and html. Only about 800k of the 3.7M documents have plaintext available.

I'd like to convert the html to plaintext in some reasonable way (e.g. the equivalent of Nokogiri #content) that would work at this scale, so that I can then do token extraction on the result.

I could spin up a cluster that does bq query, ingests the html, processes it with nokogiri, and outputs it to a processed table. But that's kinda complicated and requires a lot of i/o.

Is there an easier / more efficient way to do this?

Sai
  • 6,919
  • 6
  • 42
  • 54

1 Answers1

4

I think you can do all within BigQuery
Below should give you good start
There you have words frequency in each document and in whole corpus
And html is stripped out as well as words which are just digits
You can now add here any extra processing including TF-IDF

#standardSQL
WITH removed_html AS (
  SELECT id, REGEXP_REPLACE(document, r'<[^>]*>', ' ') AS document
  FROM `yourTable`
),
words_in_documents AS (
  SELECT id, 
    ARRAY(
      SELECT AS STRUCT word, COUNT(1) AS cnt 
      FROM UNNEST(REGEXP_EXTRACT_ALL(document, r'[\w_]+')) AS word 
      GROUP BY word
      HAVING NOT REGEXP_CONTAINS(word, r'^\d+$')
    ) AS words
  FROM removed_html
),
words_in_corpus AS (
  SELECT word, SUM(cnt) AS cnt
  FROM words_in_documents, UNNEST(words) AS words
  GROUP BY word
)
SELECT * 
FROM words_in_corpus

You can test / play with this using dummy data from your question

#standardSQL
WITH `yourTable` AS (
  SELECT 1 AS id, "foo bar, foo baz" AS document UNION ALL
  SELECT 2, "foo bar bar qux" UNION ALL
  SELECT 3, '''
<h5 id="last_value">LAST_VALUE</h5>
<pre class="codehilite"><code>LAST_VALUE (value_expression [{RESPECT | IGNORE} NULLS])</code></pre>
  '''
),
removed_html AS (
  SELECT id, REGEXP_REPLACE(document, r'<[^>]*>', ' ') AS document
  FROM `yourTable`
),
words_in_documents AS (
  SELECT id, 
    ARRAY(
      SELECT AS STRUCT word, COUNT(1) AS cnt 
      FROM UNNEST(REGEXP_EXTRACT_ALL(document, r'[\w_]+')) AS word 
      GROUP BY word
      HAVING NOT REGEXP_CONTAINS(word, r'^\d+$')
    ) AS words
  FROM removed_html
),
words_in_corpus AS (
  SELECT word, SUM(cnt) AS cnt
  FROM words_in_documents, UNNEST(words) AS words
  GROUP BY word
)
SELECT * 
FROM words_in_corpus
ORDER BY cnt DESC
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230