I have a table of ~4.7M documents stored in BigQuery. Some are plaintext, some HTML. They're around 2k tokens per, with wide variation. I'm mainly using DataPrep to do my processing.
I want to extract those tokens and calculate TF-IDF values.
Token counting
One of the more time-intensive steps is taking this:
id, document
1, "foo bar foo baz"
2, "foo bar bar qux"
And turning it into this:
id, word, count
1, foo, 2
1, bar, 1
1, baz, 1
2, foo, 1
2, bar, 2
2, qux, 1
One way to do it is this:
- extractlist on document by
{alphanum-underscore}+
id, wordlist 1, ["foo", "bar", "foo", "baz"] 2, ["foo", "bar", "bar", "qux"]
- flatten wordlist
id, word 1, foo 1, bar 1, foo 1, baz 2, foo 2, bar 2, bar 2, qux
- aggregate by group: id, word, value: count()
id, word, count 1, foo, 2 1, bar, 1 1, baz, 1 2, foo, 1 2, bar, 2 2, qux, 1
However, steps 2 & 3 are very slow, especially with large documents.
Ideally, I'd be able to have a function that converts ["foo", "bar", "foo", "baz"]
into {"foo":2, "bar":1, "baz":1}
. That wouldn't require the flatten-then-group operation to extract the count, and the subsequent flatten would be smaller (since it's operating on unique terms rather than each term).
I've not figured out any way to do that in DataPrep, however. :-/
What's a more efficient way to do this?
HTML to plaintext
My source data is a combination of plaintext and html. Only about 800k of the 3.7M documents have plaintext available.
I'd like to convert the html to plaintext in some reasonable way (e.g. the equivalent of Nokogiri #content) that would work at this scale, so that I can then do token extraction on the result.
I could spin up a cluster that does bq query
, ingests the html, processes it with nokogiri, and outputs it to a processed table. But that's kinda complicated and requires a lot of i/o.
Is there an easier / more efficient way to do this?