How to use .bat formatting to batch-format unicode files to ANSI files?

Question

Total beginner to .bat programming, so please bear with me: I've been trying to convert a massive database of Unicode files collected from scientific instruments to ANSI format. Furthermore, I need to convert all these files to .txt files.

Now, the second part is pretty trivial -- I used to do it with the "Bulk Rename Utility", and I've been able to make it work so far, I think.

The first part should be pretty straight forward, and I've found multiple different similar questions, but they all seem to be for powershell, a single file, or end in long discussions about the specific encoding being used. One question seems to match mine exactly, but having tried their suggested code, only half the file seems to transfer fine, the other half comes through as nonsense code. I've been using the code:

for %%F in (*.001) do ren "*SS.001" "*SS1.001"

for %%F in (*.001) do type "%%F" >"%%~nF.txt"

and then deleting/moving the extra files.

I've converted the files by hand successfully in the past (left), but the current encoding seems to be failing (right): Side by side comparison of files encoded by hand vs by program

My questions are:

Is it possible that a single file I get from my instrument is in multiple encodings (part UTF-8, part UTF-16), and that this is messing up my program (or more likely, i'm using an encoding that is too small)? If this is the case, I'd understand why the special characters like the squareds and the degree symbol are breaking, but not the data, which is just numbers.
Is there some obvious typo in my code that is causing this bizarre error?
If the error might be embedded in what unicode (8 vs 16 vs 32) or ANSI (1252 vs ???) I'm using, how would I check?
How would I fix this code to work?

If there's any better questions I should be asking or additional information I need to add, please let me know. Thank you!!

ANSI encodings cannot support encoding all Unicode characters. Most support only representing 256 Unicode characters each. So if you have Russian characters but encoding to ANSI code page 1252 (Western European) you will lose information. — Mark Tolonen, Mar 13 '17 at 23:03
Is it possible that the values stored in the file are binary rather than Unicode? If so there's no standard utility that will be able to help you. — Mark Ransom, Mar 13 '17 at 23:14
How do you know they are "Unicode" files? That isn't really a thing. Files need to be encoded. What Microsoft Notepad calls "Unicode" is really little-endian UTF-16-encoded. Your screenshot looked like Notepad, so just select "File, Save As..." and see what Microsoft thinks the file format is by default. Do you know what the encoding of your target format is? "ANSI" is what Microsoft calls the default localized encoding. On the U.S. version of Windows, it is `Windows-1252`. — Mark Tolonen, Mar 14 '17 at 05:59

rojo · Answer 1 · 2017-03-15T04:52:43.113

Is it possible that a single file I get from my instrument is in multiple encodings (part UTF-8, part UTF-16), and that this is messing up my program (or more likely, i'm using an encoding that is too small)?

I don't believe a single file can contain multiple encodings.

Is there some obvious typo in my code that is causing this bizarre error?

The cmd environment can handle different code pages easily enough, but it struggles with multi-byte encodings and byte order marks. Indeed, this is a common problem when trying to read WMI results returned in UCS-2 LE. Although there exists a pure batch workaround for sanitizing WMI results, it unfortunately doesn't work universally with every other encoding.

If the error might be embedded in what unicode (8 vs 16 vs 32) or ANSI (1252 vs ???) I'm using, how would I check? How would I fix this code to work?

.NET is much better at sanely dealing with files of unknown encodings. The StreamReader class, when it reads its first character, will read the BOM and detect the file encoding automatically. I know you were hoping to avoid a PowerShell solution, but PowerShell really is the easiest way to access IO methods to handle these files transparently.

There is a simple way to incorporate PowerShell hybrid code into a batch script though. Save this with a .bat extension and see whether it does what you want.

<# : batch portion
@echo off & setlocal

powershell -noprofile "iex (${%~f0} | out-string)"
goto :EOF
: end batch / begin PowerShell hybrid #>

function file2ascii ($infile, $outfile) {

    # construct IO streams for reading and writing
    $reader = new-object IO.StreamReader($infile)
    $writer = new-object IO.StreamWriter($outfile, [Text.Encoding]::ASCII)

    # copy infile to ASCII encoded outfile
    while (!$reader.EndOfStream) { $writer.WriteLine($reader.ReadLine()) }

    # output summary
    $encoding = $reader.CurrentEncoding.WebName
    "{0} ({1}) -> {2} (ascii)" -f (gi $infile).Name, $encoding, (gi $outfile).Name

    # Garbage collection
    foreach ($stream in ($reader, $writer)) { $stream.Dispose() }
}

# loop through all .001 files and apply file2ascii()
gci *.001 | %{
    $outfile = "{0}\{1}.txt" -f $_.Directory, $_.BaseName
    file2ascii $_.FullName $outfile
}

While it's true that this could process could be simplified using the get-content and out-file cmdlets, the IO stream methods demonstrated above will avoid your having to load the entire data file into memory -- a benefit if any of your data files is large.

How to use .bat formatting to batch-format unicode files to ANSI files?

1 Answers1

Linked

Related