How do I convert encoding of a large file (>1 GB) in size - to Windows 1252 without an out-of-memory exception?

Question

Consider:

public static void ConvertFileToUnicode1252(string filePath, Encoding srcEncoding)
{
    try
    {
        StreamReader fileStream = new StreamReader(filePath);
        Encoding targetEncoding = Encoding.GetEncoding(1252);

        string fileContent = fileStream.ReadToEnd();
        fileStream.Close();

        // Saving file as ANSI 1252
        Byte[] srcBytes = srcEncoding.GetBytes(fileContent);
        Byte[] ansiBytes = Encoding.Convert(srcEncoding, targetEncoding, srcBytes);
        string ansiContent = targetEncoding.GetString(ansiBytes);

        // Now writes contents to file again
        StreamWriter ansiWriter = new StreamWriter(filePath, false);
        ansiWriter.Write(ansiContent);
        ansiWriter.Close();
        //TODO -- log success  details
    }
    catch (Exception e)
    {
        throw e;
        // TODO -- log failure details
    }
}

The above piece of code returns an out-of-memory exception for large files and only works for small-sized files.

You don't need to read whole contents with ReadToEnd. Read chunk, convert, write, repeat. — Evk, Mar 02 '17 at 09:18
Use `foreach(string line in File.ReadLines(filePath)) ... process line ...` — Matthew Watson, Mar 02 '17 at 09:18
Side note: don't write `throw e;` but rather only `throw;` you'll keep your stack trace in tact this way. And please, `Dispose` your disposables (the `Streams`) — pinkfloydx33, Mar 02 '17 at 11:19
When OutOfMemoryException is seen on a machine with plenty of available memory, it's a sign that the .Net Runtime could not allocate a single contiguous block of memory large enough to satisfy the request. As containers such as List grow, the underlying arrays double in size each time. I've seen this happen when running X86 (32 bit) code because the address space is limited to 4GB. — sevzas, Mar 02 '17 at 12:07
This code doesn't look like it will work even for small files, because you're reading and writing data as a `string` without specifying an encoding. Any time you do that, C# will pick some encoding for you, and that's not what you want. If you want to read and write bytes to and from files, I think you'll want to use `BinaryReader` and `BinaryWriter`. — Tanner Swett, Mar 02 '17 at 16:38
there is nothing to suggest that the file is not just one big line — njzk2, Mar 02 '17 at 22:45

score 12 · Accepted Answer · answered Mar 02 '17 at 15:42

I think still using a StreamReader and a StreamWriter but reading blocks of characters instead of all at once or line by line is the most elegant solution. It doesn't arbitrarily assume the file consists of lines of manageable length, and it also doesn't break with multi-byte character encodings.

public static void ConvertFileEncoding(string srcFile, Encoding srcEncoding, string destFile, Encoding destEncoding)
{
    using (var reader = new StreamReader(srcFile, srcEncoding))
    using (var writer = new StreamWriter(destFile, false, destEncoding))
    {
        char[] buf = new char[4096];
        while (true)
        {
            int count = reader.Read(buf, 0, buf.Length);
            if (count == 0)
                break;

            writer.Write(buf, 0, count);
        }
    }
}

(I wish StreamReader had a CopyTo method like Stream does, if it had, this would be essentially a one-liner!)

Thanks @Matti. This question helps me achieving the task. I could convert encoding of file more than 1.5GB without any exception. — Tino Jose Thannippara, Mar 03 '17 at 06:41

score 1 · Answer 2 · answered Mar 02 '17 at 09:18

1

Don't readToEnd and read it like line by line or X characters at a time. If you read to end, you put your whole file into the buffer at once.

answered Mar 02 '17 at 09:18

Dimitri Bosteels

381
4
12

Daniel Wardin · Answer 3 · 2017-03-02T09:40:26.940

-1

Try this:

using (FileStream fileStream = new FileStream(filePath, FileMode.Open))
{
    int size = 4096;
    Encoding targetEncoding = Encoding.GetEncoding(1252);
    byte[] byteData = new byte[size];

    using (FileStream outputStream = new FileStream(outputFilepath, FileMode.Create))
    {
        int byteCounter = 0;

        do
        {
            byteCounter = fileStream.Read(byteData, 0, size);

            // Convert the 4k buffer
            byteData = Encoding.Convert(srcEncoding, targetEncoding, byteData);

            if (byteCounter > 0)
            {
                outputStream.Write(byteData, 0, byteCounter);
            }
        }
        while (byteCounter > 0);

        inputStream.Close();
    }
}

Might have some syntax errors as I've done it from memory but this is how I work with large files, read in a chunk at a time, do some processing and save the chunk back. It's really the only way of doing it (streaming) without relying on massive IO overhead of reading everything and huge RAM consumption of storing it all, converting it all in memory and then saving it all back.

You can always adjust the buffer size.

If you want your old method to work without throwing the OutOfMemoryException, you need to tell the Garbage Collector to allow very large objects.

In App.config, under <runtime> add this following line (you shouldn't need it with my code but it's worth knowing):

<gcAllowVeryLargeObjects enabled="true" />

edited Mar 02 '17 at 09:40

answered Mar 02 '17 at 09:24

Daniel Wardin

1,840
26
48

4

That just won't work with all input. The input is in UTF8, and there's no guarantee that by reading exactly 4K bytes you won't read in a partial character that's been encoded in more than one byte. If that happens, it won't be read correctly and you'll have invalid data. – Matthew Watson Mar 02 '17 at 09:29
I can't see anywhere in the question referring to UTF8, isn't Source Encoding passed in as a parameter? Yeah it will need tweaking for UTF8 but, if your file is all in a single line (to save space by not using unnecessary whitespace or new lines eg. XML) then doing line by line won't work and only way I'm aware of is streaming the file. The buffer size can always be adjusted in each iteration based on partial data being read. – Daniel Wardin Mar 02 '17 at 09:33
The [`StreamReader(string path)`](https://msdn.microsoft.com/en-us/library/f2ke0fzy(v=vs.110).aspx) constructor that the OP is using opens the input stream as UTF8. See the linked documentation. In the extremely unlikely event that all the text is on one line, then the correct approach is to use the [`StreamReader.Read()`](https://msdn.microsoft.com/en-us/library/9kstw824(v=vs.110).aspx) overload that reads a specified number of characters from a file. NEVER read a fixed sized buffer to read from a file where the characters may have variable-length encoding. It's almost always a bug. – Matthew Watson Mar 02 '17 at 09:37
As an experiment, try your code with a file produced like this: `File.WriteAllText(filePath, new string('x', 4095) + "ÿ");` – Matthew Watson Mar 02 '17 at 09:47
You'd be surprised how many HUGE files are in a single line provided the format allows it (of course tab or comma separated wouldn't work) but most of the XML files I process are saved to a single line to save on storage costs and transfer costs (especially the indentation.) It's also possible to check if a byte is a single UTF8 character or part of a multi byte character. The answer posted here obviously doesn't do that, and the question never asked for it explicitly. Therefore it's NOT the wrong approach and with UTF8 byte checking it will be a godo way to handle HUGE single line XML files. – Daniel Wardin Mar 02 '17 at 09:52
All you need to do is keep a remainder buffer with partial characters and process them in the next loop when the characters are complete and keep going. I completely appreciate your concern about the example you posted but keep in mind that the solution posted here is not general case (question has the source encoding as a parameter) and I'm well aware of the downside of UTF8 size variations, it's difficult to work with but not impossible. All problems blown to scale will pose a challenge. Imagine a data stream which is a few TB in size, flowing through the server, streaming is the only way. – Daniel Wardin Mar 02 '17 at 09:57
And how are you going to detect and process the partial characters without writing special code to parse UTF8? Streaming is indeed the only way if you have a single line that doesn't fit in memory, but why would you do all that when you can use use `StreamReader.Read()` to read a fixed number of characters in order to solve that issue? The simple fact is that the code you've posted doesn't work, which is easy to demonstrate by using the file created by `File.WriteAllText(filePath, new string('x', 4095) + "ÿ");`. – Matthew Watson Mar 02 '17 at 10:00
I've agreed to the fact that the code I've posted won't work with this scenario. I'm just not agreeing with your accusations that it's ALWAYS wrong. It's a bold statement that can easily be disproved. How do you think StreamReader does the checking internally? By having a fixed size buffer with size based on the encoding (use ILSpy to inspect if interested) and then figures out how much to read based on the bit values in the bytes for UTF8. And to confim yes you could simply use StreamReader.Read() in this case which clearly I've missed. – Daniel Wardin Mar 02 '17 at 10:12
It's a bold statement that I didn't make. I said `It's *almost* always a bug`. And to clarify, when I said "Never read a fixed size buffer to read from a file with a file where the characters may have a variable-length encoding", I should have appended "unless you are prepared to handle all the variable-length character encoding yourself". – Matthew Watson Mar 02 '17 at 10:16
`*NEVER* read a fixed sized buffer to read from a file where the characters may have variable-length encoding` - this is the one I am referring to by the way. I don't see any value in continuing this debate any further Matthew, don't know about you. It's moved away from software to nit picking the wording in the answers – Daniel Wardin Mar 02 '17 at 10:20

How do I convert encoding of a large file (>1 GB) in size - to Windows 1252 without an out-of-memory exception?

3 Answers3

Linked