1

I had a database of paragraph document. I want to split each sentence in the paragraph on table "master_data" and store it into different table "splittext".

master_data table :

id | Title | Paragraph

splittext table

id_sen | sentences | doc_id 

I tried using this query to select every sentences in Paragraph.master_data

SELECT Paragraph FROM pyproject.master_data  where REGEXP_SUBSTR '[^\.\!\* 
[\.\!\?]'; 

But it yields bracket error. So i tried using brackets, and yield error Incorrect Parameter Count

SELECT Paragraph FROM pyproject.master_data  where REGEXP_SUBSTR '([^\.\!\* 
[\.\!\?])'; 

My expected result is that the paragraph got splitted into sentences and stored to new table. And return the original id of the paragraph and stored into doc_id.

As example :

master_data :

id | Title | Paragraph  |
 1 | asds..| I want. Some. Coconut and Banana !! |
 2 | wad...| Milkshake? some Nice milk.          |

splittext_table :

id| sentences | doc_id  |

 1|   I want   |    1    |
 2|   Some     |    1    |
           .
           .
           . 
 5| Some Nice milk |   2   |
FH337
  • 69
  • 7
  • 1
    That's a very naive way of splitting sentences. How are you planning on handling something like `That will be £15.25 Mr. Jones. Here is your new IP address 10.5.123.45` – JGNI Aug 28 '19 at 07:43
  • No problem sir, each sentence will be separated by delimiter. The sentences still will be processed on my next function. – FH337 Aug 28 '19 at 08:07
  • What version of MySql are you using? – Andrei Odegov Aug 28 '19 at 10:01

1 Answers1

1

For MySQL 8.0, you can use a recursive CTE, given its limitations.

with
  recursive r as (
      select
        1 id,
        cast(regexp_substr(
               Paragraph, '[^.!?]+(?:[.!?]+|$)'
             ) as char(256)) sentences,
        id doc_id, Title, Paragraph
      from master_data
    union all
      select id + 1,
        regexp_substr(
          Paragraph, '[^.!?]+(?:[.!?]+|$)',
          1, id + 1
        ),
        doc_id, Title, Paragraph
      from r
      where sentences is not null
  )
select id, sentences, doc_id, Title
from r
where sentences is not null or id = 1
order by doc_id, id;

Output:

| id |       sentences       | doc_id | Title  |
+----+-----------------------+--------+--------+
|  1 | I want.               |      1 | asds.. |
|  2 | Some.                 |      1 | asds.. |
|  3 | Coconut and Banana !! |      1 | asds.. |
|  1 | Milkshake?            |      2 | wad... |
|  2 | some Nice milk.       |      2 | wad... |
|  1 | bar                   |      3 | foo    |

Demo on DB Fiddle.

Andrei Odegov
  • 2,925
  • 2
  • 15
  • 21
  • Works like a charm sir,i want to add the Title column from master_data to splittext_table.i just add simply ,Title from the select after Paragraph but it yields null in my splittext table – FH337 Aug 28 '19 at 15:19
  • I think that if we simply add the `Title` column to the column list of the `r` recursive CTE, we will get the result you want. Look at my corrected answer. – Andrei Odegov Aug 28 '19 at 16:13