1

For example if I have 1000 rows of data that has customer ID (e.g. 123) and their comments on our product (e.g. great product easy use)

How do I use Teradata (version 15) to do a word frequency count so that the output has two columns one with the word and the other with frequency e.g. (Great: 20, Product: 10)?

Thank you

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
user2525645
  • 73
  • 3
  • 8

1 Answers1

3

You could use strtok_split_to_table to pull this off.

Something like the following:

SELECT d.token, SUM(d.outkey)
FROM TABLE (strtok_split_to_table(1, <yourtable>.<yourcommentsfield>, ' ')
        RETURNS (outkey integer, tokennum integer, token varchar(20)character set unicode) ) as d 
GROUP BY 1

This will split each word in your comments field into individual records, then it counts the occurrence of each word. Just stick your own <yourtable>.<yourcommentsfield> in there and you should be good to go.

More information on strtok_split_to_table: http://www.info.teradata.com/HTMLPubs/DB_TTU_14_00/index.html#page/SQL_Reference/B035_1145_111A/String_Ops_Funcs.084.242.html

Here is the SQL and results for a test on my system:

CREATE SET TABLE db.testcloud ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
     (
      customer VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      comments VARCHAR(1000) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( customer );


INSERT INTO testcloud (1, 'This is a test comment');
INSERT INTO testcloud (2, 'This is also comment of something');

SELECT d.token, SUM(d.outkey)
FROM TABLE (TD_SYSFNLIB.strtok_split_to_table(1, testcloud.comments, ' -/')
        RETURNS (outkey integer, tokennum integer, token varchar(20)character set unicode) ) as d 
GROUP BY 1

--token Sum(outkey)
--is    2
--also  1
--This  2
--of    1
--test  1
--a 1
--comment   2
--something 1
JNevill
  • 46,980
  • 4
  • 38
  • 63
  • SELECT d.token, SUM(d.outkey) FROM TABLE (strtok_split_to_table(1,., ' ') RETURNS (outkey integer, tokennum integer, token varchar(20)character set unicode) ) as d GROUP BY 1 Thanks so much, but I got a 3707 error saying expected something like '.' between the word strtok_split.. .and ( – user2525645 Apr 01 '15 at 16:34
  • Don't actually put your Table and Field name in those `<` and `>` symbols. I was just marking them with those. Also, in case you run into more problems use `TD_SYSFNLIB.strtok_split_to_table` as the function name if it still balks. Lastly, I've added the bits from my own system I used to quickly test this. – JNevill Apr 01 '15 at 16:50