3

I tried to tag late onto a question similar to mine (Find Non-UTF8 Filenames on Linux File System) to elicit further replies, with no luck so far, so here goes again...

I have the same problem as the OP in the link above and convmv is a great tool to fix one's own filesystem. My question is therefore academic, but I find it unsatisfactory (in fact I can't believe) that 'find' is not able to find non standard ascii characters.

Is there anyone out there that would know what combination of options to use to find filenames that contain non standard characters on what seems to be a unicode FS, in my case the characters seem to be 8bits extended ascii rather than unicode, the files come from a Windows machine (iso-8859-1) and I regularly need to fetch them. I'd love to see how find and/or grep can do the same as convmv.

Sample files:

> ls
Abc�def ÉÈéèáà-rest everest éverest

> ls -b
Abc\251def  ÉÈéèáà-rest  everest  éverest

First file comes from Windows (or simulated with touch $(printf "Abc\xA9def")).

> find . -regex '.*[^a-zA-Z./].*'
./ÉÈéèáà-rest

> ls | egrep '[^a-zA-Z]'
ÉÈéèáà-rest

Missing almost all of them (the hyphen saved that file, can be seen with coloured grep). Whatever is happening here is not what I would expect: neither find nor grep is able to take an accented letter as being outside the range provided [^a-zA-Z./].

> find . -regex '.*é.*'
./éverest
./ÉÈéèáà-rest

> ls | egrep 'é'
ÉÈéèáà-rest
éverest

> ls | egrep '[é]'
ÉÈéèáà-rest
éverest

> find . -regex '.*[é].*'
./éverest
./ÉÈéèáà-rest

Bizarrely both are able to pick up a standard accent when provided (including in the range). Any find or grep trial with \xA9, \0251 or \o251 fails (no match).

> ls | fgrep e
Abc�def
ÉÈéèáà-rest
everest
éverest

Looking for a non-controversial character shows all files with grep, as I would have expected.

> find . -regex '.*e.*'
./éverest
./ÉÈéèáà-rest
./everest

> find . -name '*e*'
./éverest
./ÉÈéèáà-rest
./everest

find, however, is very discriminatory: even looking up a normal character, it seems to me that it eliminates filenames that contain characters outside the range of acceptable characters for the filesystem's name encoding schema.

As far as I am concerned if the file is in the filesystem, then find should find it, right? But maybe there's a feature I don't know about?

Any insights would be very much appreciated.

Community
  • 1
  • 1
asoundmove
  • 1,292
  • 3
  • 14
  • 28
  • possible duplicate of [(grep) Regex to match non-ascii characters?](http://stackoverflow.com/questions/2124010/grep-regex-to-match-non-ascii-characters) – moinudin Dec 05 '10 at 17:38
  • I’ve seen really nasty thiings happen to filesystems because of conflicting ideas about the encoding of filenames. I do not think that just looking for non-ASCII actually addresses this very well, either, because there are too many other issues lurking on the edges. Was there nothing on Superuser about this? – tchrist Dec 06 '10 at 03:13
  • @marcog: definitely not a duplicate. @tchrist: I posted the same question to Superuser and Jander came back with an answer, see my self-reply to this post. – asoundmove Dec 06 '10 at 03:54

1 Answers1

0

Jander answered to the same question I posted on Super User

Jander's answer does the job perfectly, for those interested in getting more out of this, here is one more tip.

With LANG=C, find displays non-ascii characters with question marks. To convert that back to their normal display with that file system, just pipe the output to cat.

LANG=C find . -regex '.*[^a-zA-Z./-].*'
./??verest
./????????????-rest
./Abc?def

LANG=C find . -regex '.*[^a-zA-Z./-].*' | cat
./éverest
./ÉÈéèáà-rest
./Abc�def
Community
  • 1
  • 1
asoundmove
  • 1,292
  • 3
  • 14
  • 28