83

I have files with invalid characters like these

009_-_�%86ndringshåndtering.html

It is a Æ where something have gone wrong in the filename.

Is there a way to just remove all invalid characters?

or could tr be used somehow?

echo "009_-_�%86ndringshåndtering.html" | tr ???
Sandra
  • 10,303
  • 38
  • 112
  • 165
  • 6
    The characters probably aren't "invalid", else the filesystem wouldn't store them (unless you did something _really_ nasty to the FS). Have you tried changing your locale (e.g. to UTF8) to display the names correctly? – James O'Gorman Jan 10 '12 at 14:29
  • Something _really nasty_ like `cp -r /mnt/broken_but_mountable_old_flash_disk/ /some/dir` can actually happen very easily leading to _undeletable_ files. To save time trying, the perl answer below does work on those: https://serverfault.com/a/348496/327691 – kub1x Sep 15 '21 at 21:24

11 Answers11

71

I had some japanese files with broken filenames recovered from a broken usb stick and the solutions above didn't work for me.

I recommend the detox package:

The detox utility renames files to make them easier to work with. It removes spaces and other such annoyances. It'll also translate or cleanup Latin-1 (ISO 8859-1) characters encoded in 8-bit ASCII, Unicode characters encoded in UTF-8, and CGI escaped characters.

Example usage:

detox -r -v /path/to/your/files
-r Recurse into subdirectories
-v Be verbose about which files are being renamed 
-n Can be used for a dry run (only show what would be changed)
H. Hess
  • 811
  • 6
  • 2
  • 8
    This should be much higher, I urge everyone to have a look at `detox` before essentially reinventing the wheel. If you look at the man page, you will see that it covers all the other proposed solutions here because of its flexibility. – emk2203 Apr 10 '18 at 08:04
  • 5
    Ezekiel 25:17 - Blessed is he who, in the name of charity and good will upvotes this solution, for he is truly his brother's keeper and the finder of lost children. – Jan Sila Feb 14 '19 at 15:18
  • 6
    Unintuitively, the path can not be '.' in debian. If you use a '.' it finds nothing. – isaaclw Sep 10 '19 at 18:19
  • 3
    I wonder if it really works, it seems remove/replace Chinese characters, e.g. `的节奏啊`, but those characters are valid filename. – 林果皞 Sep 11 '19 at 19:50
  • 2
    Use absolute path! /home/you/some - relative "." doesnt work with it, otherwise great tool! :) – jave.web Mar 26 '20 at 18:48
  • 2
    be careful with this tool. it's pretty aggressive. it even changes spaces into underscores :/ it also renames `__init__.py` to init_.py – jaksco Dec 20 '20 at 16:35
  • 1
    BTW If you need to setup a custom remap table (e.g. I needed to remove commas as well, for upload to a pesky web application), you can create a ~/.detoxrc file that defines a custom sequence, and then in that sequence, override the table that's used for one of the prefixes (e.g. replace the `safe` prefix with a custom file in your home folder that changes the remap table). Run `man detoxrc` for info on the format. – GuyPaddock Aug 11 '21 at 20:17
60

One way would be with sed:

mv 'file' $(echo 'file' | sed -e 's/[^A-Za-z0-9._-]/_/g')

Replace file with your filename, of course. This will replace anything that isn't a letter, number, period, underscore, or dash with an underscore. You can add or remove characters to keep as you like, and/or change the replacement character to anything else, or nothing at all.

James Sneeringer
  • 6,835
  • 24
  • 27
  • 7
    I used: `f='file'; mv 'file' ${f//[^A-Za-z0-9._-]/_}` – Louis Oct 07 '15 at 15:05
  • 2
    Look for the best solution by H. Hess below... (and my funny comment alongside :) ) – Jan Sila Feb 14 '19 at 15:20
  • 3
    This _will_ fail miserably on accented characters. Also on anything else than ascii. Definitely _not_ the solution for the original question. – grin Jan 03 '21 at 12:00
  • @grin what do you mean by fail miserably? It appears to simply ignore characters like `ä`. Can anyone explain why it does this? – Big McLargeHuge Sep 12 '21 at 19:46
  • 1
    This is a great observation by @grin. The solution I offered naively assumes the C locale, which uses the literal byte values of characters for collating. ASCII tends to form the basis of most western character sets, and it was adopted into Unicode with the same byte values. In ASCII, the byte values of the letters `A` through `Z` are sequential, as are `a` to `z` and `0` to `9`. However, other character sets have different collating rules. UTF-8, which is now a pretty common default, includes accented characters in those ranges, so `a-z` would include `ä`. – James Sneeringer Sep 13 '21 at 03:40
  • Unfortunately this doesn't handle corrupted characters (filename copied from broken filesystem, looks something like `''$'\265''0ADE9~3.JPG`). I got it sorted only by using perl from answer below: https://serverfault.com/a/348496/327691 – kub1x Sep 15 '21 at 21:19
46

I assume you are on Linux box and the files were made on a Windows box. Linux uses UTF-8 as the character encoding for filenames, while Windows uses something else. I think this is the cause of the problem.

I would use "convmv". This is a tool that can convert filenames from one character encoding to another. For Western Europe one of these normally works:

convmv -r -f windows-1252 -t UTF-8 .
convmv -r -f ISO-8859-1 -t UTF-8 .
convmv -r -f cp-850 -t UTF-8 .

If you need to install it on a Debian based Linux you can do so by running:

sudo apt-get install convmv

It works for me every time and it does recover the original filename.

Source: LeaseWebLabs

mevdschee
  • 581
  • 5
  • 6
  • 1
    this looks promising, but any idea how to tell what the encoding is? I have a directory called `Save the current file in Word 97-2004 format\sco.workflow` that got created on my Mac (via Microsoft Office) and the above encodings don't have any effect. – Sridhar Sarnobat Dec 07 '16 at 06:49
  • 2
    It's worth pointing out that by default convmv runs in "test" mode, where it just performs a dry run and tells you which files it would move. It will then tell you to run it again with the `--notest` option to actually rename the files. – Kenny Rasschaert Jan 28 '19 at 10:47
23

I assume you mean you want to traverse the filesystem and fix all such files?

Here's the way I'd do it

find /path/to/files -type f -print0 | \
perl -n0e '$new = $_; if($new =~ s/[^[:ascii:]]/_/g) {
  print("Renaming $_ to $new\n"); rename($_, $new);
}'

That would find all files with non-ascii characters and replace those characters with underscores (_). Use caution though, if a file with the new name already exists, it'll overwrite it. The script can be modified to check for such a case, but I didnt put that in to keep it simple.

phemmer
  • 5,909
  • 2
  • 27
  • 36
17

Following answers at https://stackoverflow.com/questions/2124010/grep-regex-to-match-non-ascii-characters, You can use:

rename 's/[^\x00-\x7F]//g' *

where * matches the files you want to rename. If you want to do it over multiple directories, you can do something like:

find . -exec rename 's/[^\x00-\x7F]//g' "{}" \;

You can use the -n argument to rename to do a dry run, and see what would be changed, without changing it.

naught101
  • 893
  • 8
  • 11
  • Is there a way to modify this to keep foreign characters such as ü and ä for example? – Elder Geek Feb 06 '16 at 22:51
  • Only the second one worked for me. Everything was in the same directory so I'm not sure what's the difference..? – Shautieh Mar 09 '17 at 14:51
  • 1
    @Shautieh: the -n stops it from actually running. I'll clarify the answer. – naught101 Mar 13 '17 at 05:38
  • rename can be slow when dealing with lots of files. If you want to speed this up, push the check into find. I'm not sure how to do that though. – isaaclw Sep 10 '19 at 18:13
  • This was the one to help me - `detox`, as nice as it sounded, just errored out with "unsupported unicode length" exactly on the files I wished it fixed :) – Tomáš M. Jan 29 '22 at 10:49
7

This shell script sanitizes a directory recursively, to make files portable between Linux/Windows and FAT/NTFS/exFAT. It removes control characters, /:*?"<>\| and some reserved Windows names like COM0.

sanitize() {
  shopt -s extglob;

  filename=$(basename "$1")
  directory=$(dirname "$1")

  filename_clean=$(echo "$filename" | sed -e 's/[\\/:\*\?"<>\|\x01-\x1F\x7F]//g' -e 's/^\(nul\|prn\|con\|lpt[0-9]\|com[0-9]\|aux\)\(\.\|$\)//i' -e 's/^\.*$//' -e 's/^$/NONAME/')

  if (test "$filename" != "$filename_clean")
  then
    mv -v "$1" "$directory/$filename_clean"
  fi
}

export -f sanitize

sanitize_dir() {
  find "$1" -depth -exec bash -c 'sanitize "$0"' {} \;
}

sanitize_dir '/path/to/somewhere'

Linux is less restrictive in theory (/ and \0 are strictly forbidden in filenames) but in practice several characters interfere with bash commands (like *...) so they should also be avoided in filenames.

Great sources for file naming restrictions:

KrisWebDev
  • 253
  • 3
  • 6
  • 1
    It what I search! but add quotes to support dirs with spaces find "$1" -depth -exec bash -c 'sanitize "$0"' {} \; – mmv-ru May 22 '17 at 14:02
3

I use this one-liner to remove invalid characters in subtitle files:

for f in *.srt; do nf=$(echo "$f" |sed -e 's/[^A-Za-z0-9.-]/./g;s/\.\.\././g;s/\.\././g'); test "$f" != "$nf" && mv "$f" "$nf" && echo "$nf"; done
  1. Only process *.srt files( * could be used in place of *.srt to process every file)
  2. Removes all other characters except for letters A-Za-z, numbers 0-9, periods ".", and dash's "-"
  3. Removes possible double or triple periods
  4. Checks to see if the file name needs changing
  5. If true, it renames the file with the mv command, then outputs the changes it made with the echo command

It works to normalize directory names of movies:

for f in */; do nf=$(echo "$f" |sed -e 's/[^A-Za-z0-9.]/./g' -e 's/\.\.\././g' -e 's/\.\././g' -e 's/\.*$//'); test "$f" != "$nf" && mv "$f" "$nf" && echo "$nf"; done

Same steps as above but I added one more sed command to remove a period at the end of the directory

X-Men Days of Future Past (2014) [1080p]
Modified to:
X-Men.Days.of.Future.Past.2014.1080p

1

I know this is a bit old but recently I've discovered Google's translate-shell really helps with foreign named files with unicode-choking names. Helpful batch renaming with translation in shell.

$ echo скачать  | trans -b
download

https://github.com/soimort/translate-shell

[UPDATE] The Google Translate API tends to block you if you hit it too many times but I also found a convenient local option that converts between alphabets called uconv. Helpful phonetically but not translation:

echo скачать | uconv -x 'Any-Latin;Latin-ASCII'
skacat'
BoeroBoy
  • 156
  • 3
1

This is loosely based on @KrisWebDev's search string.

  • don't touch files/dirs, create batch list instead (to review)
  • going via a two-stage temp file (is faster on my machine)
  • more edge cases for samba (trailing/leading spaces)
  • a basic progress indicator

note: there may occur "already exists" problems when doing the actual rename. to be solved manually

# tested on: bash linux
# needs: bc
# this function doesn't change files on its own
sanitize_dir() {

    rm -f /tmp/filenames_toreview_$$.txt
    touch /tmp/filenames_toreview_$$.txt
    
    echo "
    Batch mv review file is gonna be
    /tmp/filenames_toreview_$$.txt
    "

    # find... and reverse list it, to prevent "file disappeared" (parent dirs are changed last)
    find "$1" -depth | sort | tac >/tmp/filenames$$.txt
    
    FOUNDNUM=$(cat /tmp/filenames$$.txt | wc | awk '{ print $1 }')
    echo "# found $FOUNDNUM filenames or dirnames to check."
    echo "# found $FOUNDNUM filenames or dirnames to check."  >> /tmp/filenames_toreview_$$.txt
    
    IFS=$'\n'
    shopt -s extglob;
    
    COUNT=1
    PROC_OLD=N

    for THISLINE in $(cat /tmp/filenames$$.txt);do
    
        # Some percentage info
        PROC=$(printf %1.f $(echo "($COUNT/$FOUNDNUM)*100" | bc -l))
        
        if [ "$PROC" != "$PROC_OLD" ];then
            echo "# $PROC%"
            echo "# $PROC%" >> /tmp/filenames_toreview_$$.txt
            PROC_OLD=$PROC
        fi
        
        filename=$(basename "$THISLINE")
        directory=$(dirname "$THISLINE")

        filename_clean=$(echo "$filename" | sed -E -e 's/[\\/:\*\?"\|\x01-\x1F\x7F]//g' -e 's/^(nul|prn|con|lpt[0-9]|com[0-9]|aux)$/_\1/' -e 's/^$/NONAME/')
        
        # multi spaces => single spaces
        filename_clean=$(echo "$filename_clean" | sed -E -e 's/\s+/ /g' )

        # leading and trailing spaces
        filename_clean=$(echo "$filename_clean" | sed -E -e 's/^\s+//; s/\s+$//;' )

        if (test "$filename" != "$filename_clean")
        then
            echo "missmatch: '$filename' != '$filename_clean'"
            
            if [ -d "$THISLINE" ] || [ -f "$THISLINE" ];then
                
                echo mv -v "'$THISLINE'" "'$directory/$filename_clean'" >> /tmp/filenames_toreview_$$.txt
            
            else
                
                echo "File or dir disappeared. This shouldn't happen."
                
            fi
        fi
        COUNT=$((COUNT+1))

    done
    rm -f /tmp/filenames$$.txt
    
    echo "
    
    please review batch rename execution:
    cat /tmp/filenames_toreview_$$.txt
    
    "
}


sanitize_dir /goto/dir

Manu
  • 21
  • 3
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 17 '23 at 10:52
1

If you want to handle embedded newlines, multibyte characters, spaces, leading dashes, backslashes and spaces you are going to need something more robust, see this answer:
https://superuser.com/a/858671/365691

I put the script up on code.google.com if anyone is interested: r-n-f-bash-rename-script

Adam D.
  • 119
  • 3
-3

for file in *; do mv "$file" $(echo "$file" | sed -e 's/[^A-Za-z0-9.-]//g'); done &

  • 2
    You should explain what your code does and use proper formatting. Your code can cause files to be deleted by introducing collisions in the names. And running the entire thing in the background is kind of silly. – kasperd Jul 04 '17 at 23:19