how to create a query in sql to chop sentences into words and add them to new table with their frequency

Question

I'm trying to do a query that I'm not sure if it's possible I have a table called sentencess which contain ID, Sentences, and verify as shown in the picture bellow.

I have another table called, word count which contains ID, words, and there frequency. so I want when ever if a sentence entered updated, or deleted for this table to be updated accordingly or updated ones a day because there might be a lot of sentences

my expected output is something like the picture bellow.

any ideas is this doable can anyone help please.

score 5 · Accepted Answer · answered May 10 '20 at 01:11

5

If you are running MySQL 8.0, I would recommend a recursive common table expression for this. The idea is to iteratively walk each message, splitting it into words along the way. All that is then left to do is to aggregate.

with recursive cte as (
    select 
        substring(concat(sent, ' '), 1, locate(' ', sent)) word,
        substring(concat(sent, ' '), locate(' ', sent) + 1) sent
    from messages
    union all
    select 
        substring(sent, 1, locate(' ', sent)) word,
        substring(sent, locate(' ', sent) + 1) sent
    from cte
    where locate(' ', sent) > 0
)
select row_number() over(order by count(*) desc, word) wid, word, count(*) freq
from cte 
group by word
order by wid

In earlier versions, you could emulate the same behavior with a numbers table.

Demo on DB Fiddle

Sample data:

sent                       | verif
:------------------------- | ----:
hello my name is alex      |  null
hey alin and alex I'm tom  |  null
hello alex my name is alin |  null

Results:

wid | word   | freq
--: | :----- | ---:
  1 | alex   |    3
  2 | alin   |    2
  3 | hello  |    2
  4 | is     |    2
  5 | my     |    2
  6 | name   |    2
  7 | and    |    1
  8 | hey    |    1
  9 | I'm    |    1
 10 | tom    |    1

When it comes to maintaining the results of the query in a separate table, it is probably more complicated than you think: you need to be able to insert, delete or update the target table depending on the changes in the original table, which cannot be done in a single statement in MySQL. Also, keeping a flag up to date in the original table creates a race condition, where changes might occur while your are updating the target target table.

A simpler option would be to put the query in a view, so you get an always-up-to-date perspective on your data. For this, you can just wrap the above query in a create view statement, like:

create view words_view as < above query >;

If performance becomes a problem, then you could also truncate and refill the words table periodically.

truncate table words;
insert into words < above query >;

answered May 10 '20 at 01:11

GMB

216,147
25
84
135

I don't want an up to date target file, but I want it to be updated at the end of each day based on the source file. – programming freak May 10 '20 at 06:17
and I'm using Linux ubuntu 18.04 with xampp 7.4.5 configured in it so hope your idea works for my case – programming freak May 10 '20 at 06:19
1

@programmingfreak: it is about your MySQL version, not the version o your OS or xamp. You can run `select version()` to see which MySQL version you are on. – GMB May 10 '20 at 14:04
Ok that worked but can I do something, print the WID of the less frequent word in the sentence in the main table next to verification, so add another field with which will print the WOrd Id of the less frequent word used inside that sentence? I think its a bit tricky @GMB – programming freak May 10 '20 at 16:22
1

@programmingfreak: this looks like a different question than the one your originally asked. I would suggest asking a [new question](https://stackoverflow.com/questions/ask) rather than adding to the existing (and bountied) question. – GMB May 10 '20 at 22:12
there might be extra spaces at the beginning of the words or at the end of the words and your code does not trim there is there a way to trim the extra spaces ? – programming freak May 13 '20 at 10:44
@programmingfreak: to remove extra spaces, you can replace `word` with `trim(word)` in the `select` and `group by` clauses of the outer query. – GMB May 14 '20 at 00:06
I found a bug in your code that I cant fix can you please help me with it so I can award the bounty to you if the field of the sentence has one word in it it will count something as well https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=85f45f8cc180351b1a6d7f0c183ab385 check here – programming freak May 14 '20 at 07:08
how can I solve or go around this issue because this will make a problem for me – programming freak May 14 '20 at 07:08
@programmingfreak: just `UPDATE messages set sent = TRIM(REPLACE(sent, '\n', ''))`. Here is [your edited fiddle](https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=1173a01081cd79d90f3ba16c0b6a0a9e). – GMB May 14 '20 at 23:56

Rick James · Answer 2 · 2020-05-16T17:06:00.743

Perl and PHP and others have a much more robust regexp engine for splitting. I would use one of them, not SQL.

I would use batch inserts, using

INSERT INTO words (word, ct)
    VALUES ('this', 1), ('that', 1), ...   -- about 100 words at a time
    ON DUPLICATE KEY UPDATE ct = VALUES(ct) + 1;

CREATE TABLE words (
    word VARCHAR(66) NOT NULL,
    ct MEDIUMINT UNSIGNED NOT NULL,
    PRIMARY KEY(word)
) ENGINE=InnoDB;

I see no need for having words and counts in separate tables, nor any need for an AUTO_INCREMENT for a "word_id". The word is a perfectly good "natural PK". However, you should decide what to do about case folding and accent stripping.

As for splitting into words... double-quotes and some other characters are clearly word boundaries. But some characters are ambiguous:

' -- part of a contraction or a quote?
. -- abbreviation or end of a sentence

Etc.

I agree. A non-SQL solution -including the use of regex's - would probably be simpler, more robust, and more efficient. — FoggyDay, May 10 '20 at 05:39

score 2 · Answer 3 · answered May 07 '20 at 10:04

Based on this DBA Stack Exchange post I could imagine something like the following.

Basic steps:

Create a table to contain a list of words (word_index in my example)
Create a table to contain the word counts (word_count in my example)
Create a Stored Procedure to split the sentences into words based on SPACE (might have to be tweaked to also allow other whitespace like line wraps) and write it into the word_index table
calculate statistics and write it into word_count

Step by step in code:

Create word_index:

CREATE TABLE IF NOT EXISTS `word_index` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `word` varchar(150) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

Create word_count:

CREATE TABLE IF NOT EXISTS `word_count` (
  `word` varchar(150) NOT NULL,
  `occurrences` int(11) DEFAULT NULL,
  PRIMARY KEY (`word`)
)

Create procedure transfer_cell to transfer split words into target table:

DELIMITER //
CREATE FUNCTION `SPLIT_STRING`(val TEXT, delim VARCHAR(12), pos INT) RETURNS text CHARSET latin1
BEGIN
        DECLARE output TEXT;
        SET output = REPLACE(SUBSTRING(SUBSTRING_INDEX(val, delim, pos), CHAR_LENGTH(SUBSTRING_INDEX(val, delim, pos - 1)) + 1), delim, '');
        IF output = '' THEN
            SET output = null;
        END IF;
        RETURN output;
    END//
DELIMITER ;

-- Dumping structure for procedure test.TRANSFER_CELL
DELIMITER //
CREATE PROCEDURE `transfer_cell`()
BEGIN
        DECLARE i INTEGER;
        SET i = 1;
        REPEAT
            INSERT INTO word_index (word)
            SELECT SPLIT_STRING(sent, ' ', i)
            FROM sentences
            WHERE SPLIT_STRING(sent, ' ', i) IS NOT NULL;
            SET i = i + 1;
        UNTIL ROW_COUNT() = 0
        END REPEAT;
    END//
DELIMITER ;

That's the basic setup. Please note that I used table sentences rather than sentencess with double s.

Update statistics:

TRUNCATE TABLE word_index;
TRUNCATE TABLE word_count;

CALL transfer_cell();
INSERT INTO word_count
  SELECT word, COUNT(1) occurrences FROM word_index 
  GROUP BY word;

Result:

Here's a screenshot of the results as documented above:

sorry for asking this question but I'm inside phpmyadmin and went into sql to execute the first 2 sql codes but the third one did not got executed not in phpmyadmin nor in php how can I test it ? — programming freak, May 07 '20 at 11:01
@programmingfreak do you get any errors when passing it into the query window of PhpMyAdmin? I would recommend to use a database IDE (HeidiSQL is free). Depending on the user that is being used to execute DDS statements creating functions might be prohibited. — SaschaM78, May 07 '20 at 11:16
yes my system is ubuntu and im using xampp local host to access phpmyadmin and it gives me error when passing the query — programming freak, May 07 '20 at 11:19
@programmingfreak could you share the error message you got? — SaschaM78, May 07 '20 at 11:31
I downloading something called Vstudio and getting this error 14:31:45 Kernel error: Error( 1064 ) 42000: "You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'FUNCTION `SPLIT_STRING`(val TEXT, delim VARCHAR(12), pos INT) RETURNS text CHARS' at line 1" — programming freak, May 07 '20 at 11:32
@programmingfreak from the error you receive I would say the problem lies in the part "DELIMITER // CREATE". Could you try creating the function and procedure without the "DELIMITER" in it? Would mean you execute the part between `CREATE ...` and `END`. — SaschaM78, May 07 '20 at 13:02
For stored routines in phpmyadmin see: [how-to-write-a-stored-procedure-in-phpmyadmin](https://stackoverflow.com/questions/17481890/how-to-write-a-stored-procedure-in-phpmyadmin) — Paul Spiegel, May 12 '20 at 09:48

score 2 · Answer 4 · answered May 14 '20 at 10:03

For recent MySQL versions (8.0.4 and later) you may use

SELECT ROW_NUMBER() OVER (ORDER BY COUNT(word) DESC, word) wid, word, COUNT(word) freq 
FROM sentencess 
CROSS JOIN JSON_TABLE( CONCAT('["', REPLACE(sentencess.sent, ' ', '","'), '"]'),
                       "$[*]" COLUMNS( word VARCHAR(254) PATH "$" )
                     ) AS jsontable
GROUP BY word
ORDER BY freq DESC, word;

fiddle

PS. I cannot reproduce the output ordering because I cannot understand ordering criteria within freq groups.

score 1 · Answer 5 · edited May 16 '20 at 17:10

Caveat: This is TSql, not MySQL.

-- 1. To create a function that splits the sentence into words, and returns the Words Table
-- 2. To insert into your Result Table all of the words Table results
-- 3. Calculate the Frequency

----------[ The Split Function ]
CREATE FUNCTION dbo.udf_SplitString 
        (
                @Sentence   varchar(max)
            ,   @Separator  char(1) 
        )
        RETURNS @WordList TABLE (Word varchar(50)) 
    AS
        BEGIN
            SET @Separator  =   ISNULL(@Separator, ' ') 

            DECLARE @Word   varchar(50)

            SET @Sentence = LTRIM(@Sentence) + @Separator -- Make sure last word has a separator after. Also get rid of leading spaces.

            WHILE   (CHARINDEX(@Separator, @Sentence) > 0)
                BEGIN
                    SET @Word = SUBSTRING(@Sentence, 1, CHARINDEX(@Separator, @Sentence) - 1)
                    INSERT INTO @WordList   SELECT LTRIM(@Word)
                    -- Remove word added to the List from the sentence.
                    SET @Sentence =  SUBSTRING(@Sentence,   CHARINDEX(@Separator, @Sentence) + 1,   LEN(@Sentence))
                    SET @Sentence =  LTRIM(@Sentence)           
            END                 
            RETURN
        END 

----------[ The Script ]
DECLARE     @SentenceList   TABLE   (Sentence varchar(max))
INSERT INTO @SentenceList   VALUES
            ('hello my name is alex')
        ,   ('hey alin and alex I''m tom')  
        ,   ('hello alex my name is alin')

DECLARE     @WordList   TABLE   (Word varchar(50))

INSERT INTO @WordList   
SELECT  
        W.Word
FROM        @SentenceList   S
CROSS APPLY (
                SELECT Word FROM dbo.udf_SplitString(S.Sentence, ' ')
            ) W 

SELECT 
        ID  =   ROW_NUMBER() OVER(ORDER BY SUM(1) DESC, Word)
    ,   Word
    ,   Frequency   =   SUM(1)
FROM @WordList
GROUP BY Word

It should just count one space between each word and trim any other extra spaces from beginning end or between words — programming freak, May 13 '20 at 18:20
So basically words should be tokenized without any spaces before or after them — programming freak, May 13 '20 at 18:21
It will do it. The function takes care of the extra spaces. Add extra spaces to any of the sentences and run the script. Remember to create the function. — Andy3B, May 13 '20 at 18:33
which program are you using to access the databse and create the function my phpmyadmin gui sql does not take it ? — programming freak, May 14 '20 at 06:19
im getting this You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '@Sentence varchar(max) , @Separator char(1) ) ' at line 3 — programming freak, May 14 '20 at 06:22

how to create a query in sql to chop sentences into words and add them to new table with their frequency

5 Answers5

Linked