Build a more intuitive chat filter in PHP

Question

Profanity API

I have built a basic profanity API that echoes a 1 if it identifies any, and a 0 if the message is okay. I run into some silly problems though.

For example, if the word hell is on my swear list it'll also identify words like hello as profanity.

Each word is in a txt file in this format

badword
badword
badword
lolanotherbadword
naughtyword

LeetSpeak

1 4l50 w4n7 70 1mpl3m3n7 50m3 50r7 0f l337 func710n, 50 7h47 1 d0n'7 h4v3 70 l157 3v3ry p0551bl3 v4r14710n 0f 7h3 w0rd. (I also want to implement some sort of leet function, so that I don't have to list every possible variation of the word.)

Bypassing the Chat Filter

Whether you access the API from

api.domain.tld/chat/profanity.php?access_token=whatever&filter_string=whatever

or

api.domain.tld/chat/profanity/access_token/filter_string

the same problem occurs. If people put an & or ? before their message it allows them to bypass the filter (and echoes a 0). When checking the logs I've noticed that messages that begin with an & or ? are logged as blank messages, so I'm guessing it's just messing up a variable or something.

Spacing

People think they are clever by saying h e l l or h e l l, etc. An intuitive chat filter would likely be able to identify this sort of thing.

Data Storage and Retrieval

I've also been thinking to myself if a txt file is really a valid storage and retrieval mechanism. Right now I've only got 400 words, but it'll keep growing and it's bound to be slow. What is better? An in-line PHP array, a txt file, or a database?

The Code

<?php
require('conn.php');

$date     = gmdate('Y-m-d');
$time = gmdate('h:i:s');

$access_token  = $_GET["access_token"];
$filter_string = $_GET["filter_string"];

function wordsExist(&$string, $words)
{
    foreach ($words as &$word) {
        if (stripos($string, $word) !== false) {
            return true;
        }
    }
    return false;
}

if (isset($access_token)) {
    $sql  = "SELECT * FROM api WHERE access_token='" . $access_token . "'";
    $sql2 = "UPDATE api SET calls = calls + 1 WHERE access_token='" . $access_token . "'";
    $sql3 = "UPDATE api SET last_query = CURRENT_TIMESTAMP WHERE access_token='" . $access_token . "'";
    $sql4 = "UPDATE api SET profanity_api_calls = profanity_api_calls + 1 WHERE access_token='" . $access_token . "'";
    $sql5 = "UPDATE api SET last_profanity_query = CURRENT_TIMESTAMP WHERE access_token='" . $access_token . "'";

    $sql6 = "UPDATE api SET profanity_detected = profanity_detected + 1 WHERE access_token='" . $access_token . "'";

    $result  = mysqli_query($conn, $sql);
    $result2 = mysqli_query($conn, $sql2);
    $result3 = mysqli_query($conn, $sql3);
    $result4 = mysqli_query($conn, $sql4);
    $result5 = mysqli_query($conn, $sql5);
    if (mysqli_num_rows($result) >= 1) {
        if (wordsExist($filter_string, file('curse-list.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES))) {
            $result6 = mysqli_query($conn, $sql6);
            file_put_contents('logs/profanity/' . $date . '-log.txt', "1 [$time] $filter_string\n", FILE_APPEND);
            echo '1';
        } else {
            file_put_contents('logs/profanity/' . $date . '-log.txt', "0 [$time] $filter_string\n", FILE_APPEND);
            echo '0';
        }
    }
}

mysqli_kill();
mysqli_close();
?>

My .htaccess

RewriteEngine On
RewriteRule ^profanity/(.*)/(.*)$ profanity.php?access_token=$1&filter_string=$2
RewriteRule ^advertising/(.*)/(.*)$ advertising.php?access_token=$1&filter_string=$2

Escaping User input

As is - how secure is my above code implementation? If it's vulnerable could I have specific examples as of how hackers could abuse it?

You mention that `chat/profanity/access_token/filter_string` could be used to access the api. Is there a .htaccess file involved? — Steve E., Oct 20 '15 at 19:00
As for where to store the list, it doesn't really matter. IMO, I would put it in a database. I would probably also turn each "bad word" into a regex pattern to look for common substitutions for letters and test with that. You could also setup a pattern to check for spaces between characters. As for preventing `?` or `&` to bypass the filter, in either URL style, run url_encode on the string before submitting it the filter url. There should be some version of it in any language. — Jonathan Kuhn, Oct 20 '15 at 19:09
[Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?](http://blog.codinghorror.com/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea/) — ceejayoz, Oct 20 '15 at 19:13
So if I'm sending the message to the API I'd have to include the url_encode function in whatever is sending it (a form or something), but if people directly enter the ? or & into the URL I'm guessing it'll still show up. — Dalton Edwards, Oct 20 '15 at 19:14
@ceejayoz I hate profanity filters, but this is being implemented on gaming servers that little kids play on. — Dalton Edwards, Oct 20 '15 at 19:14
I'm going to give myself access to your API. Or better yet, I could destroy access for anyone else. `$access_token = "'; DROP TABLE api;--"`. Fortunately `mysqli::query` protects you from more than one query being run at once, however you should protect yourself from these kinds of vulnerabilities. — sjagr, Oct 20 '15 at 19:15
@DaltonEdwards Doesn't matter. People will find creative ways around it. — ceejayoz, Oct 20 '15 at 19:16
Yeah I get that @ceejayoz, but any implementation is better than none. — Dalton Edwards, Oct 20 '15 at 19:18
Thanks for your example @sjagr I patched it with help from Steve E. I've always wondered about MySQL injection, and wow that would be really bad. — Dalton Edwards, Oct 20 '15 at 19:28

score 3 · Accepted Answer · edited May 23 '17 at 11:52

3

Here are a few quick changes you could make to the code which would solve some but not all issues.

1) Your code is vulnerable to SQL injection attacks where an attacker can craft urls that will become SQL queries and perform all kinds of unintended operations on your database. Fix those asap with:

  $access_token = mysqli_real_escape_string($conn, $access_token);

2) Split your filter strings up into individual words, this will solve the hello issue. A client could use characters other than spaces between words. preg_split will let you specify a range of characters to split on.

$filter_words = preg_split("/[\s,\-_]+/", $string);

3) Test out fuzzy matching by using the soundex of words rather than exact text. In PHP soundex is a 4 character representation of the pronunciation of the input string. Anticipate that any fuzzy matching could generate some false positives.

if(soundex($filter_word) == soundex($word)) ...

Additional example of how to split words based on whitespace and underscores and compare with a list of words:

function wordsExist($filter_string, $words)
{
    $filter_words = preg_split("/[\s,\-_]+/", $filter_string);

    foreach ($words as $word) {
        foreach($filter_words as $filter_word) {
            if (
                ($filter_word == $word ) ||
                (levenshtein($filter_word, $word) < 2) ||
                (soundex($filter_word) == soundex($word))
                ) {

                return true;
            }

        }
    }
    return false;
}

I've added in soundex and levenshtein as different ways of comparing words. In the few quick tests I did, I got some false positives so it is up to you to decide whether to keep those lines or not.

I also noticed you used the '&' operator to alias variables. This is different to '&' in C which is can be used to pass by reference. There is usually no performance benefit to aliasing since PHP postphones the copy process on variables until one of them is later written to. There is a good SO question on it: In PHP (>= 5.0), is passing by reference faster?

edited May 23 '17 at 11:52

Community

1
1

answered Oct 20 '15 at 19:15

Steve E.

9,003
6
39
57

In regards to 2 if I split into individual words will I still be able to input entire sentences into the API? – Dalton Edwards Oct 20 '15 at 19:17
Yes, just do the split in your wordsExist function. This will split a sentence into words for checking and not affect other usage else. – Steve E. Oct 20 '15 at 19:21
Also whenever I try to replace `$access_token = $_GET["access_token"];` with `$access_token = mysqli_real_escape_string($conn, $access_token);` the page echoes nothing. Normally the page echoes nothing if no access_token in inputted, so this is somehow breaking the access_token function presumably. – Dalton Edwards Oct 20 '15 at 19:21
Replace it with `$access_token = mysqli_real_escape_string($conn, $_GET["access_token"]); – Steve E. Oct 20 '15 at 19:24
Thank you so much for your help Steve. You gave a very detailed answer. – Dalton Edwards Oct 20 '15 at 19:25
Could you show me an example of how to implement $filter_words = preg_split("/[\s,\-_]+/", $string); ? – Dalton Edwards Oct 20 '15 at 21:32
I have updated my answer. It is far from perfect and may not scale well, I get the impression in your use case that something is better than nothing. – Steve E. Oct 21 '15 at 11:48
Why won't it scale well? What if hundreds will be querying this API every minute? – Dalton Edwards Oct 21 '15 at 17:33
I've also noticed that the preg_split still blocks words like hello. If it helps the words are each on their own line in a text file. – Dalton Edwards Oct 21 '15 at 17:35
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/93001/discussion-between-dalton-edwards-and-steve-e). – Dalton Edwards Oct 21 '15 at 17:46