Best way to 'filter' user input for username

Question

I have a site which allows users to create a 'unique URL' so they can pass along to colleagues in the form of www.site.com/customurl.

I, of course, run a check to make sure the input is actually unique but I also want to filter out things like large company names (copyrighted names, etc) and curse words. To do this, my thought was to build a txt file with a list of every possible name/word which came to mind. The file size on the test txt file we have is not a concern but am curious if this is the best way to go about this. I do not think a DB call is as efficient as reading in the text file.

My code is:

$filename = 'badurls.txt';
$fp = fopen($_SERVER['DOCUMENT_ROOT'] . '/' .$filename, 'r'); 
if ($fp) { 
  $array = explode("\n", fread($fp, filesize($_SERVER['DOCUMENT_ROOT'] . '/' .$filename))); 
}

if(in_array($url, $array)) {
  echo 'You used a bad word!';
} else {
  echo 'URL would be good'; 
}

NOTE

I am talking about possibly a list of the top 100-200 companies and maybe 100 curse words. I could be wrong but do not anticipate this list ever growing beyond 500 words total, let alone 1000.

I would actually say that using DB is more efficient - especially if the file is getting bigger and bigger. — Vilius, Sep 07 '11 at 17:35
I believe you should use a table in your database, it will be faster searching and make working with adding new forbidden names easier within the back-end of your script. even if it is only 500, logs can also be created counting or recording which users are using which urls, 1 table could serve many purposes... dont be lazy looking for a quick solution, also what about words that use different charsets and caps — Lawrence Cherone, Sep 07 '11 at 17:43
I removed the last paragraph as it was off topic and offensive. — NikiC, Sep 07 '11 at 17:46
@NikiC - I sincerely appreciate the help of certain SO users and apologize if offensive to all but it is garbage (IMO) when users get on, downvote something (likely because their given answer was downvoted), and don't even mention why. The point of a collaborative website is to share information and *constructive* opinions. If somebody disagrees with the content of my question, I have no problem with that, but take the time to state why as opposed to simply clicking a mouse and leaving the page. — JM4, Sep 07 '11 at 17:50

ircmaxell · Accepted Answer · 2011-09-07T17:56:33.513

You may not think that a DB call is as efficient, but it is much more efficient. The database generates indexes on the data, and so it doesn't actually have to iterate through each item (as in_array does internally) to see if it exists. Your code will be O(n) and the DB will be O(log n)... Not to mention the memory savings from not having to load the file in its entirety on each page load. (see B-Tree Indexes).

Sure, 500 elements isn't a whole lot. It wouldn't be a huge deal to just stick that in a file, would it? Actually, it would. It's not a much a performance issue (the overhead of the DB call will cancel out the efficiency loss of the file, so they should be roughly even in terms of time). But it is an issue of maintainability. You say today that 500 words is the maximum. What happens when you realize that you need to provide duplicate detection? That is, check for the existence of existing URLs in your site. That will require a DB query anyway, so why not just take care of it all in one place?

Just create a table with names, index it, and then do a simple SELECT. It will be faster. And more efficient. And more scalable... Imagine if you reach 1gb of data. A database can handle that fine. A file read into memory cannot (you'll run out of RAM)...

Don't try to optimize like this, Premature Optimization should be avoided. Instead, implement the clean and good solution, and then optimize only if necessary after the application is finished (and you can identify the slow parts)...

One other point worth considering. The code as is will fail if $url = 'FooBar'; and foobar is in the file. Sure, you could simply do strtolower on the url, but why bother? That's another advantage of the database. It can do case-insensitive traversal. So you can do:

SELECT id FROM badnametable WHERE badname LIKE 'entry' LIMIT 1

And just check that there are no matching rows. There's no need to do a COUNT(*), or anything else. All you care about is the number of matching rows (0 is good, !0 is not good).

good point on the file being read into memory. I am talking about an assumed maximum filesize of 10kB however. — JM4, Sep 07 '11 at 17:43
Databases are designed and optimized for this purpose (well, this is one of their usages). You could implement your own algorithm, but why bother? Just use a database and be done with it. You already have a connection for the rest of the application, right? Go with the easy and simple solution (and querying the DB is simpler, especially since you need to do error checking and edge-cases on your code, but the DB does that for you)... — ircmaxell, Sep 07 '11 at 17:47
Thanks for the update. As I mention above, I already do check for duplicate URLs (among other things) and have a DB connection active to allow for the followup 'insert' for a new URL. — JM4, Sep 07 '11 at 17:48
Well, then there's really nothing stopping you from putting that data in the database... If you weren't using a DB at all, I could see an argument, but since you are, just re-use the connection. Besides, if the file is as small as you're saying, either way will be quite fast, so the short answer: *don't worry about it*... — ircmaxell, Sep 07 '11 at 17:50

Best way to 'filter' user input for username

NOTE

1 Answers1