80

Looking for a regex/replace function to take a user inputted string say, "John Smith's Cool Page" and return a filename/url safe string like "john_smith_s_cool_page.html", or something to that extent.

A-Sharabiani
  • 17,750
  • 17
  • 113
  • 128
ndmweb
  • 3,370
  • 6
  • 33
  • 37
  • 1
    Define "filename/url safe string". Browsers will do URL encoding of strings in addresses, modern computers have very few restrictions on file name characters. – RobG Dec 13 '11 at 06:09
  • 1
    I'd use something like `" aAbc1290!@#$%^&*()-=_+;:[]{}'\"|,./<>? ".replace(/[\\\/:\*\?"<>\|]/g, "").trim() + ".html"` – loxaxs Apr 06 '19 at 14:04

5 Answers5

181

Well, here's one that replaces anything that's not a letter or a number, and makes it all lower case, like your example.

var s = "John Smith's Cool Page";
var filename = s.replace(/[^a-z0-9]/gi, '_').toLowerCase();

Explanation:

The regular expression is /[^a-z0-9]/gi. Well, actually the gi at the end is just a set of options that are used when the expression is used.

  • i means "ignore upper/lower case differences"
  • g means "global", which really means that every match should be replaced, not just the first one.

So what we're looking as is really just [^a-z0-9]. Let's read it step-by-step:

  • The [ and ] define a "character class", which is a list of single-characters. If you'd write [one], then that would match either 'o' or 'n' or 'e'.
  • However, there's a ^ at the start of the list of characters. That means it should match only characters not in the list.
  • Finally, the list of characters is a-z0-9. Read this as "a through z and 0 through 9". It's a short way of writing abcdefghijklmnopqrstuvwxyz0123456789.

So basically, what the regular expression says is: "Find every letter that is not between 'a' and 'z' or between '0' and '9'".

Shalom Craimer
  • 20,659
  • 8
  • 70
  • 106
  • this is really close, how do I add in a few more individual safe characters like `_` and `-` ? – ndmweb Dec 13 '11 at 06:21
  • 1
    can i do this? `var filename = s.replace(/[^a-z0-9_-]/gi, '_').toLowerCase()` – ndmweb Dec 13 '11 at 06:22
  • 18
    Ooh, that's so close! You're just missing one bit of information - the `-` is a reserved character inside `[]`. You'll need to escape it. So instead of writing `-` for the dash ('-'), you need to use `\-`. In other words, the regular expression would be `/[^a-z0-9_\-]/gi` – Shalom Craimer Dec 13 '11 at 06:31
  • This is not really a full solution. Take a look at MS's page on filenames, there are more requirements involved to make the page safe (e.g., max length): https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx – speedplane Apr 15 '15 at 02:31
  • 12
    I will add a `.replace(/_{2,}/g, '_')` to eliminate consecutive `_` chars in the result which are very ugly. – fguillen Jun 29 '16 at 13:35
  • what about extension e.g. `.png` is valid part of filename right? Here dot is not taken into consideration. – jay shah Oct 08 '18 at 14:02
  • 3
    Ooh, @ShalomCraimer! So, so close! ;-) `-` is a special character inside `[]`, but it's unnecessary to escape it as long as it's the last character in the brackets. This is also `eslint's` preference (`no-useless-escape`). So: `/[^a-z0-9_-]/gi`! – Arel Dec 02 '18 at 17:46
  • Thanks @Arel for pointing it out - TIL! For those curious where this was in the standard at the time, see https://www.ecma-international.org/ecma-262/5.1/#sec-15.10.2.16 and look at NOTE 3 which says: "A - character [...] is treated literally if it is the first or last character of ClassRanges [...]" – Shalom Craimer Dec 06 '18 at 08:00
  • if this is a file name, why are you converting `.` to `_`? Dots are perfectly ok in filenames, thus I'd use `/[^a-z0-9\.]/gi` – João Pimentel Ferreira Jan 08 '21 at 13:35
  • thanks for taking the time to explain the regex! @ShalomCraimer – Alberto Camargo Mar 18 '21 at 23:51
  • Is the lasting `toLowerCase()` required to get a filename-safe string ? – mbesson Sep 07 '21 at 09:25
22

I know the original poster asked for a simple Regular Expression, however, there is more involved in sanitizing filenames, including filename length, reserved filenames, and, of course reserved characters.

Take a look at the code in node-sanitize-filename for a more robust solution.

speedplane
  • 15,673
  • 16
  • 86
  • 138
  • When they do `truncate` at the end, wouldn't that cut the extension file name? ex: .jpg – denislexic Jun 07 '21 at 21:40
  • 1
    they truncate to get under 255 chars which is a limitation in some filesystems. It'd only truncate the extension if the filename was already too long – phette23 Jun 28 '22 at 19:03
6

For more flexible and robust handling of unicode characters etc, you could use the slugify in conjunction with some regex to remove unsafe URL characters

const urlSafeFilename = slugify(filename, { remove: /"<>#%\{\}\|\\\^~\[\]`;\?:@=&/g });

This produces nice kebab-case filenemas in your url and allows for more characters outside the a-z0-9 range.

Adam D
  • 1,962
  • 2
  • 21
  • 37
3

Here's what I did. It works to convert full sentences into a decently clean URL.

First it trims the string, then it converts spaces to dashes (-), then it gets rid of anything that's not a letter/number/dash

function slugify(title) {
  return title
    .trim()
    .replace(/ +/g, '-')
    .toLowerCase()
    .replace(/[^a-z0-9-]/g, '')
}

slug.value = slugify(text.value);
text.oninput = () => { slug.value = slugify(text.value); };
<input id="text" value="Foo: the old @Foobîdoo!!  " style="font-size:1.2em">

<input id="slug" readonly style="font-size:1.2em">
caub
  • 2,709
  • 2
  • 28
  • 31
Russell Beattie
  • 326
  • 2
  • 5
0

I think your requirement is to replaces white spaces and aphostophy `s with _ and append the .html at the end try to find such regex.

refer

http://www.regular-expressions.info/javascriptexample.html

Hemant Metalia
  • 29,730
  • 18
  • 72
  • 91
  • That string was just an example, but ultimately looking for something that would replace ALL non safe characters. basically anything that's not filename safe.. [a-z][A-Z][0-9]["_","-"] – ndmweb Dec 13 '11 at 06:19