5

I have a textfile encoded in UTF-16. Each line contains a number of columns separated by tabs. For those who care, the file is a playlist TXT export from iTunes. Column #27 contains a filename.

I am reading it using Perl 5.8.8 in Linux using code similar to:

binmode STDIN, ":encoding(UTF-16)";
while(<>)
{
    chomp;
    my @cols = split /\t/, $_;
    my $filename = $cols[26];   # Column #27 contains the filename
    print "File exists!" if (-e "$filename");
}

(Please note: I've shortened this code snippet. In my actual code I do some substitutions to convert the absolute windows filename used by iTunes into a filename valid on my Linux box)

Even though the files exist, the (-e) file test does not return true. I believe it has something to do with the string being in UTF-16 but cannot figure out what the problem is. The actual filename uses only ASCII characters. And the filename prints correctly if I print the $filename variable.

Can filenames in Perl be in UTF16? Any ideas how to get this code snippet to work?

brian d foy
  • 129,424
  • 31
  • 207
  • 592
blt04
  • 692
  • 5
  • 12
  • 1
    Before I spend any time on this, what is `my $filename =~ $cols[26];`? – Sinan Ünür Aug 22 '09 at 20:15
  • sorry - a typo. Should have been = Typo in StackOverflow, not my original code. Problem still exists. – blt04 Aug 22 '09 at 20:17
  • Filenames can't natively be UTF-16, because UTF-16 is full of zero bytes. Many Linux distros these days are using UTF-8, so that would be the first encoding to try. – bobince Aug 22 '09 at 20:46

3 Answers3

5

The UTF-16 text is processed by the :encoding layer. By the time it gets into $_, there's no way to tell that it was ever UTF-16. I don't think that's your issue.

My guess would be that you've either got some whitespace in your filename (that you didn't notice when you tried printing it out) or you're not in the directory you think you are.

Try

if (-e $filename) { print "File exists!" } 
else { print "File <$filename> not found" }

and check the filename carefully. You might also use Cwd; and print out the current directory.

cjm
  • 61,471
  • 9
  • 126
  • 175
4

I figured out the solution:

Column 27 is the last column, and the file is encoded with 0d0a (\r\n) line endings. chomp was only removing 0a (\n). Not sure why I didn't see this before, but it doesn't have anything to do with UTF16.

Adding:

s/\r$//;

after chomp fixes the problem.

Thanks for your help - sorry to send you down a rabbit trail.

blt04
  • 692
  • 5
  • 12
  • You could also try `:crlf:encoding(UTF-16)`, although I've never tried using :crlf with UTF-16, so I'm not sure if that works. I've only used :crlf with single-byte encodings. – cjm Aug 22 '09 at 20:56
2

If, as you say, the actual filename uses only ASCII characters, wouldn't

$filename =~ s/\0//g;

work? Anyway, xxd should help the next time you run into something like this

[sinan@archardy ~]$ xxd /mnt/c/Documents\ and\ Settings/sinan/Desktop/test.txt
0000000: fffe 2f00 6800 6f00 6d00 6500 2f00 7300  ../.h.o.m.e./.s.
0000010: 6900 6e00 6100 6e00 2f00 7400 6500 7300  i.n.a.n./.t.e.s.
0000020: 7400 6d00 6500 2e00 7400 7800 7400 0d00  t.m.e...t.x.t...
0000030: 0a00                                     ..

I see that you have solved your problem in the time it took me to create a test file and reboot into Linux. Oh well.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • You would think. But it does not. -e still returns false. Just to test the rest of my code, I tried hardcoding a filename inside the Perl file, and it worked. Reading from the iTunes UTF16 file (even with your null substitution suggestion) does not work. – blt04 Aug 22 '09 at 20:24
  • Try utf8:downgrade($filename) before the null substitution. – Inshallah Aug 22 '09 at 20:30
  • Well let's see some debugging then, what's actually inside $filename, byte by byte? – bobince Aug 22 '09 at 20:48
  • Thanks again Sinan. I finally saw the 0d0a when looking more closely via xxd. – blt04 Aug 22 '09 at 20:59