8

I have a number of files hiding in my LANG=en_US:UTF-8 filesystem that have been uploaded with unrecognisable characters in their filename.

I need to search the filesystem and return all filenames that have at least one character that is not in the standard range (a-zA-Z0-9 and .-_ etc.)

I have been trying to following but no luck.

find . | egrep [^a-zA-Z0-9_\.\/\-\s]

I'm using Fedora Code 9.

Kalle Richter
  • 8,008
  • 26
  • 77
  • 177
  • 1
    Why should they only allow those characters? Others are perfectly fine as well, as long as they are correctly encoded – Joachim Sauer Mar 08 '09 at 16:04

4 Answers4

16

convmv might be interesting to you. It doesn't just find those files, but also supports renaming them to correct file names (if it can guess what went wrong).

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • 1
    I had 1000+ files with Windows 1252 encoding and lots of umlauts. "convmv -r -f cp1252 -t utf8 --notest ." worked for me. Didn't know there was such a program. Thanks! – sl0815 Oct 05 '16 at 11:58
8
find . | perl -ne 'print if /[^[:ascii:]]/'
Bklyn
  • 2,559
  • 2
  • 20
  • 16
Fedir RYKHTIK
  • 9,844
  • 6
  • 58
  • 68
  • 4
    if something is not ascii it doesn't men it is not utf. – Emiter Feb 06 '20 at 20:10
  • Example: `emil@galeon:/tmp/expermients$ ls` `laka.txt łąka.txt` `emil@galeon:/tmp/expermients$ find . | perl -ane '{ if(m/[[:^ascii:]]/) { print } }' `./łąka.txt` And "łąka.txt" is proper utf8 encoded name. – Emiter Feb 16 '20 at 21:31
2

find . | egrep [^a-zA-Z0-9_./-\s]

Danger, shell escaping!

bash will be interpreting that last parameter, removing one level of backslash-escaping. Try putting double quotes around the "[^group]" expression.

Also of course this disallows a lot more than UTF-8. It is possible to construct a regex to match valid UTF-8 strings, but it's rather ugly. If you have Python 2.x available you could take advantage of that:

import os.path
def walk(dir):
    for child in os.listdir(dir):
        child= os.path.join(dir, child)
        if os.path.isdir(child):
            for descendant in walk(child):
                yield descendant
        yield child

for path in walk('.'):
    try:
        u= unicode(path, 'utf-8')
    except UnicodeError:
        # print path, or attempt to rename file
bobince
  • 528,062
  • 107
  • 651
  • 834
-1

I had a similar problem to the OP for which I was given a solution on Superuser (see also further comments) that I found more satisfactory than the "convmv solution", although I appreciate to have discovered comvmv too.

Community
  • 1
  • 1
asoundmove
  • 1,292
  • 3
  • 14
  • 28
  • You should always write the solution in your answer, don't simply link to it. I think you're referring to `LANG=C find . -regex '.*[^a-zA-Z./-].*'` which IMHO isn't great since it will "detect" any filename that contains a space, a number, an underscore, or an ASCII symbol (like a $). – bobpaul Jul 11 '23 at 21:10