2

I have >5000 textual files generated in Windows from PDF files that I need to process on a Mac OS X machine. I run dos2unix on all of them to correct the newline and to convert the encoding from UTF-16LE to UTF-8.

In 4949 cases everything goes fine, but for 320 files dos2unix skips the executions saying they are binary files.

This is coherent with of file -c that gives me data for the 320 skipped files and text for the other files. However they are text from a visual inspection ...

How can I repair the 320? At first I suspected it was the presence of the BOM, but it appears also on the files that don't give problems.

Furthermore, both the data and the text files start with:

0000000 ff fe 3d 00 20 00 70 00 61 00 67 00 65 00 20 00
0000010 31 00 20 00 3d 00 0a 00 0d 00 0d 00 0a 00

Any hint? Thanks in advance.

agaved
  • 258
  • 2
  • 9

3 Answers3

3

I have found that sometimes text files contain unprintable ASCII characters. In such cases, even though the files are "text" files, dos2unix thinks they are binary. If this is the case, you can use the tr command as such:

tr -cd '\11\12\15\40-\176' < file.txt

This is the basic command and will clean out those unprintable characters and output your new ASCII-clean text to stdout. To actually save this output as a file, just pipe the output to a file:

tr -cd '\11\12\15\40-\176' < file.txt > newfile.txt

Now newfile.txt is your text file on which you can run dos2unix.

The complement (ie, -c) of string '\11\12\15\40-\176' means that the tr command strips out everything but the characters defined in that string, which are:

  • octal \11: tab
  • octal \12: new line
  • octal \15: carriage return
  • octal \40-\176: all the good/normal keyboard characters
0

According to dos2unix --help, you can pass the argument --force to dos2unix to "force conversion of binary files". So in your shell, while inside a directory with just the 320 skipped files, you might type dos2unix --force *.

Rory O'Kane
  • 29,210
  • 11
  • 96
  • 131
  • Rory, thanks, but this would create just other garbled files in the end that I cannot process further. – agaved May 29 '13 at 20:59
0

You could try the latest version of dos2unix (6.0.3). It will print the line number of the first binary symbol. This may help you to analyse the problem.

Best regards,

  • Version 6.0.4-beta will also print the value of the binary symbol. Get the beta version from http://waterlan.home.xs4all.nl/dos2unix.html – Erwin Waterlander Jun 24 '13 at 15:27