4

Update:

I can confirm that the behaviors noted below were down to me doing something that I had not specified before which is that I was playing manually with the reader charPos property and therefore the question could be renamed: "How to screw up your working fine Read(buffer,int,int) method" and the answer is to simply manually set the position of the reader (SR1) position outside the stream (FSr) buffersize (not to be confused with the read operation buffer):

before the loop (in the codes in the original question)

 System.Reflection.FieldInfo charPos_private = typeof(StreamReader).GetField("charPos", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.DeclaredOnly);

and within the loop (in the codes in the original question)

charPos_private.SetValue(SR1, string_index);

The file reader actually reads to 1024 and then it goes to 0 when the File Stream reads the next 1024 chars. I was attempting to set the position manually (as I'm messing up with some patterns) and i had not noticed that it can't ever go to 1025.

And then, that's how you screw up with simple stuff. Thanks a lot to all that commented! Much appreciated! I'll set the answer to the one that contains the example on how to do it correctly, the codes I was posting also work fine had it not been for those couple of lines that I had not mentioned.


Original question

First time around here,

I'm self learning C#. I'm trying to use streamreader to read from a big UTF-8 Linux LF (ended on \n) (an xml) char by char (or block by block) and I'm performing some operations on it and then later writing it into a new file char by char (or block by block). I have a streamreader and streamwriter.

I will explain in words and add some code at the end:

I'm finding the streamreader Read() and Read(char[] buffer, int index, int count) methods to perform differently on big files. I know those two are nothing but two different ways of calling the same method (I have also tried ReadBlock) but the situation is: Read () method automatically fills the StreamReader object ByteBuffer (array) dynamically, that is when the StreamReader object Position reaches the default bufferSize parameter (which is usually 1024 or 4096) then the method automatically begins buffering the next 1024 or 4096 or whatever the buffersize is.

But Read(char[] buffer, int index, int count) doesn't do that automatically therefore it throws an exception when the StreamReader object Position reaches buffersize +1. i.e at 1025 position or 4097 position (char) (System.IndexOutofRangeException on System.Buffer.InternalBlockCopy(Array src, Int32 srcOffsetBytes, Array dst, Int32 dstOffsetBytes, Int32 byteCount)) or if I try to Peek() to see what is next (System.IndexOutofRangeException on System.IO.StreamReader.Peek()). My test file is 300 MB.

*The Question is: How do i get Read(char[] buffer, int index, int count) to automatically rebuffer the ByteBuffer (StreamReader: Non-Public members ByteBuffer) so as to effectively read a file bigger than the buffer size ? or in other words: How Do I actually read a big file with Read(buffer_search, 0, x_number_of_chars) ? *

I mean I don't know if I'd need to manually modify the ByteBuffer via System Reflection and How I'd do it. It should be automatic; Re-buffering manually would be like too much work for a simple thing.

In code: (I'm paraphrasing some code here)

doing something like:

char current_char;
using (System.IO.FileStream FSw = new FileStream(sourcePath, FileMode.Create))
{
    using (System.IO.StreamWriter SW1 = new StreamWriter(FSw, System.Text.Encoding.UTF8))
    {
        using (FileStream FSr = new FileStream(destinationPath, FileMode.Open))
        {
            using (StreamReader ofile_temp_chars = new StreamReader(fsr, System.Text.Encoding.UTF8))
            {
                while ((current_char = (char)SR1.Read()) != '\uffff')
                {
                    SW1.Write(current_char);
                }
            }
        }
    }
}

that code is successful and has no problems. The big file is read in written into a new file.

But When i try to specify the number of chars to read (I'm actually having to read an user defined number of chars, I'm just showing here some code reading just one char to simplify) then I need to use Read(char[] buffer, int index, int count), like this:

char[] buffer_search = new char[1]
using (System.IO.FileStream FSw = new FileStream(fePath, FileMode.Create))
{
    using (System.IO.StreamWriter SW1 = new StreamWriter(FSw, System.Text.Encoding.UTF8))
    {
        using (FileStream FSr = new FileStream(fPath, FileMode.Open))
        {
            using (StreamReader ofile_temp_chars = new StreamReader(fsr, System.Text.Encoding.UTF8))
            {
                while (SR1.Peek() != -1)
                {
                    SR1.Read(buffer_search, 0, 1);
                    SW1.Write(buffer_search[0]);
                }
            }
        }
    }
}

That code Will end with an exception ((System.IndexOutofRangeException on System.IO.StreamReader.Peek() ) when the streamreader object Position reaches and passes buffersize (i.e 1025, 4097) etc... It is obviously Peeking from what it has on the buffer not on the file itself and not automatically rebuffering results in peeking outside the ByteBuffer char[].

If I do something like this:

char[] buffer_search = new char[1]
using (System.IO.FileStream FSw = new FileStream(fePath, FileMode.Create))
{
    using (System.IO.StreamWriter SW1 = new StreamWriter(FSw, System.Text.Encoding.UTF8))
    {
        using (FileStream FSr = new FileStream(fPath, FileMode.Open))
        {
            using (StreamReader SR1 = new StreamReader(fsr, System.Text.Encoding.UTF8))
            {
                while (!end_of_file)
                {
                    try { SR1.Read(buffer_search, 0, 1); }
                    catch { end_of_file = true; }
                    SW1.Write(buffer_search[0]);
                }
            }
        }
    }
}

Then I will end up with a file that contains Only 1024 chars or what the buffersize is. and the exception (catched) that will be thrown will be: System.IndexOutOfRangeException on System.Buffer.InternalBlockCopy(Array src, Int32 srcOffsetBytes, Array dst, Int32 dstOffsetBytes, Int32 byteCount) on System.IO.StreamReader.Read(Char[] buffer, Int32 index, Int32 count)

So in both cases the result is the same the buffer is not getting new data from the file something that is handled automatically by Read() and ReadLine() methods.

Simple solutions like increasing the buffersize won't work as my file is on the hundreds of MB and I'm trying to be memory efficient... (or simpler like using Read() instead, as I need to use Read(buffer, 0, x_number_of_chars). This should be a simple thing and is taking longer than expected.

Thanks for your help,

c123
  • 51
  • 1
  • 6
  • What is the exception message you get. I never have any problem with streams... – Phil1970 Nov 23 '16 at 16:27
  • the exception is a System.IndexOutOfRangeException on System.IO.StreamReader.Peek() when StreamReader object Position is 1025 (I updated the body of the question). Remember that I'm using Read(char[] buffer, int index, int count) and not Read() or ReadLine(), I have no problems with those two last methods but for my purposes I'm having to specify the number of chars to read. When The streamreader Position is 1025 I take a look at byteBuffer it still has the same old 1024 chars (the first 1024) as opposed to have already buffered the Next 1024 chars in the stream. – c123 Nov 23 '16 at 16:37
  • Character in Net Library can be either one or two characters. A character has a protected property which indicates if a character is one or two characters. It looks like you StreamWriter is using UTF8 but not the StreamReader. I would set the stream reader to UTF8. UTF8 encoding will read character one byte at a time. When you don't specify encoding ASCII is normally the default which will ignore non-printable characters. So the byte count may be different between the Reader and Writer. – jdweng Nov 23 '16 at 16:50
  • > StreamReader is designed for character input in a particular encoding, whereas the Stream class is designed for byte input and output. Use StreamReader for reading lines of information from a standard text file. https://msdn.microsoft.com/en-us/library/system.io.streamreader(v=vs.110).aspx – McNets Nov 23 '16 at 16:56
  • @jdweng current .NET I'm using 4.6.1 from ms documentation https://msdn.microsoft.com/en-us/library/system.io.streamreader(v=vs.110).aspx: StreamReader defaults to UTF-8 encoding unless specified otherwise. I have checked my FileStream (opened file) and it is UTF8 encoded. – c123 Nov 23 '16 at 17:16
  • I know what msdn says and it is wrong. I left a note at the msdn webpage last month. I guess they never fixed the documentation. – jdweng Nov 23 '16 at 18:41
  • @jdweng Like I said my stream and reader were both checked to be on UTF8 be it the default or the detected codification. However I have now made sure the encoding is enforced from the beginning with: using (StreamReader ofile_temp_chars = new StreamReader(fsr, System.Text.Encoding.UTF8)) to same results... I can read up to 1024 chars with no problem but from there on the _buffer in the FileStream, the ByteBuffer and charBuffer in the StreamReader will fail to get new data and move the and thus Peek() will fail. I have updated the question body code so as to include the explicit UTF8 encoding. – c123 Nov 23 '16 at 19:35
  • 1
    Why use `Peek` when you can check how many characters `StreamReader.Read` read (the return value)? – Peter Ritchie Nov 23 '16 at 20:25
  • @PeterRitchie I could get rid of the Peek() and not have an exception altogether but then I would end up in the same situation of just having read the first 1024 chars of my big file. The Question is not about how to get rid of the Peek exception but about How to actually read a big file with Read(buffer_search, 0, x_number_of_chars) because in that way I can read in blocks instead of Read(). I'm going to make that clearer in the question. – c123 Nov 23 '16 at 20:47
  • 1
    It's still unclear what you're asking. e.g. what is `ofile_temp_chars` for, you never read from it? – Peter Ritchie Nov 23 '16 at 21:22
  • Related, where is `SR1` defined? – Eris Nov 23 '16 at 21:25
  • I'm sorry i just copy pasted it should be SR1. – c123 Nov 23 '16 at 21:47
  • You need to use two buffers you get data more than the size of buffersize. The 1st buffer is the byte array where streamreader puts the output. The 2nd bytes array for storing data older data. For example if you had a paragraph of 2000 bytes. You would read 1024 bytes from streamreader into 1st buffer. The move these bytes into 2nd buffer. Then read another 1024 bytes. Next you would search for end of paragraph. The end of par would be at 2000 - 1024 = 976. So you would move into 2nd buffer 1024 + 976 = 2000 bytes. – jdweng Nov 23 '16 at 21:52

1 Answers1

4

It's really unclear what you're asking. But, if you want to read an arbitrary number of characters from one stream reader and write them to a writer, this works:

int bytesRead;
do
{
    bytesRead = SR1.Read(buffer_search, 0, buffer_search.Length);
    if (bytesRead > 0)
    {
        // TODO: process buffer_search in some way.
        SW1.Write(buffer_search, 0, bytesRead);
    }
} while (bytesRead > 0);

That will read new characters into the internal stream writer buffer when needed.

Peter Ritchie
  • 35,463
  • 9
  • 80
  • 98
  • Thank you for the help! But No, it won't read the entire file if the file is bigger than buffersize. that's the problem. I want to read a 300MB file with Read(buffer_search, 0, buffer_search.Length). In your code you are writing only if there's something to read, that's nice coding but the question is "How Do I actually read a Big file with Read(buffer_search, 0, x_number_of_chars) ? " Or maybe you could confirm that if you throw at it a big UTF8 file it'll go through it? maybe there's something strange in my system that is preventing me from doing so. – c123 Nov 23 '16 at 21:37
  • Yes, that reads all of the source file, regardless of the size. – Peter Ritchie Nov 23 '16 at 21:39
  • It even works if buffer_search array is bigger than the stream/reader buffer. – Peter Ritchie Nov 23 '16 at 21:41
  • Excuse me, what size of a file did you use? – c123 Nov 23 '16 at 21:44
  • That reads the file, there is a shortcut that uses the length of the buffer instead of something like `x_number_of_chars` because you need a buffer at least `x_number_of_chars` big. I've confirmed this works on a UTF8 file that is 300MB (314,572,800 bytes) – Peter Ritchie Nov 23 '16 at 21:45
  • I have found the real reason why my code (which is just as functional as yours) was not working, posted it as an update before the original question. I'm labeling yours as the answer as seeing that it could be easily done pointed me to the real problem. Thank you a lot ! – c123 Nov 24 '16 at 00:50
  • Obviously with the kind of hack you have done to access private member, **the first thing you should have tried is not to mess up with internal data**! – Phil1970 Nov 24 '16 at 15:55