How to count words frequency in Teradata

Question

For example if I have 1000 rows of data that has customer ID (e.g. 123) and their comments on our product (e.g. great product easy use)

How do I use Teradata (version 15) to do a word frequency count so that the output has two columns one with the word and the other with frequency e.g. (Great: 20, Product: 10)?

Thank you

The final output is a word cloud that has the size of the word based on frequency. — user2525645, Apr 01 '15 at 16:08

JNevill · Accepted Answer · 2015-04-02T12:59:36.730

You could use strtok_split_to_table to pull this off.

Something like the following:

SELECT d.token, SUM(d.outkey)
FROM TABLE (strtok_split_to_table(1, <yourtable>.<yourcommentsfield>, ' ')
        RETURNS (outkey integer, tokennum integer, token varchar(20)character set unicode) ) as d 
GROUP BY 1

This will split each word in your comments field into individual records, then it counts the occurrence of each word. Just stick your own <yourtable>.<yourcommentsfield> in there and you should be good to go.

More information on strtok_split_to_table: http://www.info.teradata.com/HTMLPubs/DB_TTU_14_00/index.html#page/SQL_Reference/B035_1145_111A/String_Ops_Funcs.084.242.html

Here is the SQL and results for a test on my system:

CREATE SET TABLE db.testcloud ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
     (
      customer VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      comments VARCHAR(1000) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( customer );


INSERT INTO testcloud (1, 'This is a test comment');
INSERT INTO testcloud (2, 'This is also comment of something');

SELECT d.token, SUM(d.outkey)
FROM TABLE (TD_SYSFNLIB.strtok_split_to_table(1, testcloud.comments, ' -/')
        RETURNS (outkey integer, tokennum integer, token varchar(20)character set unicode) ) as d 
GROUP BY 1

--token Sum(outkey)
--is    2
--also  1
--This  2
--of    1
--test  1
--a 1
--comment   2
--something 1

SELECT d.token, SUM(d.outkey) FROM TABLE (strtok_split_to_table(1,., ' ') RETURNS (outkey integer, tokennum integer, token varchar(20)character set unicode) ) as d GROUP BY 1 Thanks so much, but I got a 3707 error saying expected something like '.' between the word strtok_split.. .and ( — user2525645, Apr 01 '15 at 16:34
Don't actually put your Table and Field name in those `<` and `>` symbols. I was just marking them with those. Also, in case you run into more problems use `TD_SYSFNLIB.strtok_split_to_table` as the function name if it still balks. Lastly, I've added the bits from my own system I used to quickly test this. — JNevill, Apr 01 '15 at 16:50

How to count words frequency in Teradata

1 Answers1

Linked