1

Because of speed I'm thinking about a bad word filter written totally in MySQL but on my search I only found the MySQL replace function.

REPLACE(string_column, 'search', 'replace')

But with this function I only can replace word by word. Is there a String function in MySQL which can check the whole string and replace and search multiple values from a table? (with php I'm totally clear how to do this easy task)

Is a MySQL loop a reasonable solution?

I'm happy about every hint.

Fritz
  • 831
  • 7
  • 23
  • I don't think there is a solution, at least not a reasonable one... I think it's better if you use php here – fthiella Sep 21 '15 at 22:45
  • I had the same thought this is why I asked this question. I want to learn something from solving this problem native in mysql ;) – Fritz Sep 21 '15 at 22:51
  • better than written in PHP, because if you have a huge filter list mysql would be faster and getting of some load from apache, however its primary for learing – Fritz Sep 21 '15 at 23:01
  • I was thinking about loops, do you have something other/faster in mind? – Fritz Sep 21 '15 at 23:09
  • it wouldn't be a loop let me show you something. So php has a form, naughty words get posted in a string, and a clean one is saved out, with the original one too? – Drew Sep 21 '15 at 23:11
  • thats right in an table(id,bad,good) to easily add new ones – Fritz Sep 21 '15 at 23:13
  • the post is only bad, in a table(id, good,bad) are the words to be replaced. I think it should be the same if the bad think comes from an select or should save as an insert. More sense it would make if its in the insert cause then you have to do it only once. I'm not sure if i understood your question right, but i hope – Fritz Sep 21 '15 at 23:19
  • alright, back in a bit, have to do an errand – Drew Sep 21 '15 at 23:21
  • you don't need to hurry ;) – Fritz Sep 21 '15 at 23:25

3 Answers3

3

I'm posting it as a new answer, since I'm using a different technique here. I'm thinking we can just use a MySQL function and a BEFORE INSERT trigger. The function to split a string is taken from this other answer.

CREATE FUNCTION strSplit(x VARCHAR(1000), delim VARCHAR(12), pos INTEGER) 
RETURNS VARCHAR(1000)
BEGIN
  DECLARE output VARCHAR(1000);
  SET output = REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos)
                 , CHAR_LENGTH(SUBSTRING_INDEX(x, delim, pos - 1)) + 1)
                 , delim
                 , '');
  IF output = '' THEN SET output = null; END IF;
  RETURN output;
END

and the INSERT trigger would be like this:

CREATE TRIGGER change_words
BEFORE INSERT ON sentences
FOR EACH ROW
BEGIN
  DECLARE i INT;
  DECLARE s VARCHAR(1000);
  DECLARE r VARCHAR(1000);
  SET i = 1;
  SET s = '';
  REPEAT
    SET s = (
      SELECT
        REPLACE(split, COALESCE(bad, ''), good)
      FROM
        (SELECT strSplit(new.sentence, ' ', i) AS split) s
        LEFT JOIN words w ON s.split = w.bad
      LIMIT 1
      );
    SET r = CONCAT_WS(' ', r, s);
    SET i = i + 1;
    UNTIL s IS NULL
  END REPEAT;
  SET new.sentence = r;
END

this will be faster, since the sentence will be converted only once, when you insert it in the database. Still there are some improvements that we need, same as before:

LEFT JOIN words w ON s.split = w.bad

it won't match words containing separator , . ! ? and the replace function

REPLACE(split, COALESCE(bad, ''), good)

will be case sensitive. It can be fixed qute easily if you wish. Please see a fiddle here.

Community
  • 1
  • 1
fthiella
  • 48,073
  • 15
  • 90
  • 106
  • if you understand the trigger this is a lot better,shorter,faster answer so I accepted it. Real thanks for your effort. I hope also other people will find this question and can learn from it. – Fritz Sep 22 '15 at 22:31
1

I think it's better to use PHP here, unfortunately MySQL doesn't support substitutions using regular expressions.

I'm answering anyway since in your comment you said that you want to learn something from MySQL but I would not suggest to use this solution, unless you have no other choice :)

First we can split our sentence into rows. We need a numbers table that contains a sequence of numbers 1,2,3,..., etc.

CREATE TABLE numbers (n INT PRIMARY KEY);
INSERT INTO numbers VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);

SELECT
  s.id,
  n.n,
  SUBSTRING_INDEX(SUBSTRING_INDEX(s.sentence, ' ', n.n), ' ', -1) word
FROM
  numbers n INNER JOIN sentences s
  ON CHAR_LENGTH(s.sentence)
     -CHAR_LENGTH(REPLACE(s.sentence, ' ', ''))>=n.n-1
ORDER BY
  s.id, n.n

then we can join this query to the table that contains bad words that have to be substituted:

SELECT
  id,
  n,
  REPLACE(word, COALESCE(bad, ''), good) AS new_word
FROM (
  SELECT
    s.id,
    n.n,
    SUBSTRING_INDEX(SUBSTRING_INDEX(s.sentence, ' ', n.n), ' ', -1) word
  FROM
    numbers n INNER JOIN sentences s
    ON CHAR_LENGTH(s.sentence)
       -CHAR_LENGTH(REPLACE(s.sentence, ' ', ''))>=n.n-1
  ORDER BY
    s.id, n.n
  ) w LEFT JOIN words
  ON w.word = bad

note the LEFT JOIN and the COALESCE(..., '') - and finally with a GROUP BY and GROUP_CONCAT you can have your string back:

SELECT
  id,
  GROUP_CONCAT(new_word ORDER BY n SEPARATOR ' ') AS new_sentence
FROM (
  SELECT
    id,
    n,
    REPLACE(word, COALESCE(bad, ''), good) AS new_word
  FROM (
    SELECT
      s.id,
      n.n,
      SUBSTRING_INDEX(SUBSTRING_INDEX(s.sentence, ' ', n.n), ' ', -1) word
    FROM
      numbers n INNER JOIN sentences s
      ON CHAR_LENGTH(s.sentence)
         -CHAR_LENGTH(REPLACE(s.sentence, ' ', ''))>=n.n-1
    ORDER BY
      s.id, n.n
    ) w LEFT JOIN words
    ON w.word = bad
  ) s

Please see it working here. I would not suggest you to use this solution, it won't be very performant and it's more like a "hack" than a real solution, better to use PHP here, but I hope you can learn something new from this answer :)

Some easy improvements that can be made:

ON w.word = bad

this will match only words that are exactly the same (might be case insensitive, but it depends how's the table defined) and it won't support separators like , . ! ? etc.

And this:

REPLACE(word, COALESCE(bad, ''), good) AS new_word

will be case sensitive. Can be improved but I would suggest you to do those improvements in PHP :)

Community
  • 1
  • 1
fthiella
  • 48,073
  • 15
  • 90
  • 106
  • really good and descriptive answer, learned a lot about string handling, I will accept the better answer if/after drew posted his – Fritz Sep 21 '15 at 23:30
1

As this was about learning, and doing it in MYSQL only (and not about implementing a PHP better way to do it, and boy are there way better ways to do it), I present something to learn a few mysql techniques.

About the only remotely-interesting line to me is the set @sql1 line. Perhaps several large paragraphs could be written about it. For now, I just present it.

The Schema

-- drop table badGood;
create table badGood
(   -- maps bad words to good
    id int auto_increment primary key,
    bad varchar(100) not null,
    good varchar(100) not null,
    dtAdded datetime not null
);
-- truncate table badGood;
insert badGood(bad,good,dtAdded) values ('god','gosh',now()),('rumpus','rear section',now());

-- drop table posts;
create table posts
(   postId int auto_increment primary key,
    orig varchar(1000) not null,
    cleanified varchar(1000) not null,
    dtAdded datetime not null, -- when it was inserted into system, ready for cleaning
    dtCleaned datetime null,    -- when it was cleaned
    isViewable int not null -- or bool, whatever. 0=No (not ready yet), 1=Yes (clean)
);
-- truncate table posts;

-- drop table xxx;
create table xxx
(   -- this table will contain one row for every word passed to stored proc,
    -- ordered by word sequence left to right in sentence
    -- order by meaning column "id" (auto_inc). Note, there is no guarantee, in fact expect it not to happen,
    -- that for any given postId, that the id's will be consecutive, but they will be in order
    --
    -- Reason being, multiple concurrent access of posts coming index
    --
    -- Decided against making this a temp table inside stored proc, but it was considered
    id int auto_increment primary key,
    postId int not null,    -- avoid FK for now due to speed
    word varchar(50) not null,  -- word as presented by poster guy
    word2 varchar(50) null, -- a more rated-G version of the word that is substituted
    isDirty int not null,   -- or bool, whatever. 0=clean, 1=naughty
    key(postId)
);
-- truncate table xxx;

The Stored Procedure

DROP PROCEDURE IF EXISTS cleanAndInsert;
delimiter $$
CREATE PROCEDURE cleanAndInsert
(   suspectTxt varchar(255) # this text is suspect. Might contain non G-rated words
    # other parameters too probably
)
BEGIN
    declare insertedId int; -- this will house the PK value of the postId

    insert posts(orig,cleanified,dtAdded,dtCleaned,isViewable) values (suspectTxt,'',now(),null,0); -- insert the passed string
    set @insertedId:=LAST_INSERT_ID();  # now we have the PK id just inserted
    -- the concat routine below is VERY FRAGILE to write, so as the sql string is slowly tweaked into perfection, with one working at that moment
    -- I rem it out and create a new version under it, so the slightest error does not set me back 10 minutes (at least)
    -- SET @sql1 = CONCAT("INSERT INTO xxx (word) VALUES ('",REPLACE((SELECT GROUP_CONCAT(orig) AS colx FROM posts where id=1), " ", "',null,0),('"),"');");
    -- SET @sql1 = CONCAT("INSERT INTO xxx (postId,word) VALUES (",@insertedId,",'",REPLACE((SELECT GROUP_CONCAT(orig) AS colx posts where postId=@insertedId), " ", "',null,0),('"),"',null,0);");
    SET @sql1 = CONCAT("INSERT INTO xxx (postId,word,word2,isDirty) VALUES (",@insertedId,",'",REPLACE((SELECT GROUP_CONCAT(orig) as colx FROM posts where postId=@insertedId), " ", "',null,0),(¿^?fish╔&®,'"),"',null,0);");
    -- select @sql1;    -- debugging purposes, rem'd out

    -- Ideally @insertedId is inserted in the SET @sql1 line a few above, and NOT with the fish hard-coded bizareness, but it was too fragile
    -- and time consuming. So this is an ugly hack and nothing to be proud of. So fixing it is a "TO DO"
    set @sql2=replace(@sql1,'¿^?fish╔&®',@insertedId); -- This is the insert statement to run to blast out the words
    -- select @sql2; -- debugging purposes, rem'd out.

    PREPARE stmt FROM @sql2;    -- you now have a prepared stmt string to execute (which inserts words into table xxx)
    EXECUTE stmt;

    -- now the clean word section
    update xxx x
    join badGood bg
    on bg.bad=x.word
    set x.isDirty=1,x.word2=bg.good
    where postId=@insertedId;

    -- I know, this is lame, but it allows us to use word2 simply as the final word and simplify our UPDATE posts after this block
    update xxx
    set word2=word
    where postId=@insertedId and isDirty=0;

    -- now the update section, to save the much cleaner string out to the posts table
    update posts
    set cleanified=
    (  select group_concat(word2 ORDER BY id SEPARATOR ' ') as xyz
       from xxx where postId=@insertedId
    ), isViewable=1, dtCleaned=now()
    where postId=@insertedId;

    -- one could do a "delete from xxx where postId=@insertedId" if they wanted to. I kept it for debugging. Others delete the rows

    select @insertedId as id;   -- useful for calling routine, telling it the PK value
END
$$

Test (call the stored procedure)

in PHP, you would just call it with a normal query, starting with the $sql beginning with "call ..."

call cleanAndInsert('I type acceptable sentences'); -- returns 1 row, id is 1
call cleanAndInsert('Stan you can kiss my rumpus'); -- returns 1 row, id is 2
-- note this is very easy to trick, such as a naughty word not surrounded by whitespace, or broken out with spaces like "r u m p u s"

The Results

select * from posts order by postId desc;
+--------+-----------------------------+-----------------------------------+---------------------+---------------------+------------+
| postId | orig                        | cleanified                        | dtAdded             | dtCleaned           | isViewable |
+--------+-----------------------------+-----------------------------------+---------------------+---------------------+------------+
|      2 | Stan you can kiss my rumpus | Stan you can kiss my rear section | 2015-09-22 11:08:29 | 2015-09-22 11:08:29 |          1 |
|      1 | I type acceptable sentences | I type acceptable sentences       | 2015-09-22 11:08:23 | 2015-09-22 11:08:23 |          1 |
+--------+-----------------------------+-----------------------------------+---------------------+---------------------+------------+

Wrapping up

It was for learning. Take it for what it is.

Drew
  • 24,851
  • 10
  • 43
  • 78