3

I am reading in a file, attempting to check if it is a binary file by checking the first n bytes for a NUL byte, and if it is not determined to be binary that way, it is manipulated as a string. I tried to loop over a string and check the first n indices for a NUL, but that would give false positives that checking a TBytes does not.

I use TFile.ReadAllBytes, which returns a TBytes and perform the NUL check on that. Then if no NUL is found, I use StringOf on the TBytes to get a string. I was wondering if StringOf has to make a copy of the data to make a string out of it (these are large files so I want to avoid that) and if so, what is a better way to do what I am trying to do.

Seth Carnegie
  • 73,875
  • 22
  • 181
  • 249
  • 2
    *what is a better way to do what I am trying to do* What is you goal in the first place? – Kromster Dec 23 '11 at 09:09
  • @Krom sorry, my goal is to read a file and check if a NUL byte is in the first _n_ bytes of the file. If not, convert it into a string. A cast would be the best because it would not require any duplication of the data but I don't know if strings work that way. – Seth Carnegie Dec 23 '11 at 09:12
  • 2
    How is the data encoded? StringOf does a conversion from the system ANSI locale to Unicode AFAICS and that can only be done using a copy. – Jens Mühlenhoff Dec 23 '11 at 09:12
  • @JensMühlenhoff It is not expected to be encoded in any particular encoding, I was using `TFile.ReadAllText` before (not worrying about encoding) but it didn't work to check if a 0 byte was in the first _n_ indices of the string – Seth Carnegie Dec 23 '11 at 09:14
  • IOW: If the data is ANSI you can only convert it to an AnsiString/RawByteString, but I don't know if you can use existing data to create a Delphi-managed string that way. You could append NUL to the end and treat it as PChar/PAnsiChar/PWideChar. – Jens Mühlenhoff Dec 23 '11 at 09:15
  • @JensMühlenhoff would it work to use `ReadAllText` (does it do any manipulations on the data so that you don't get exactly the data that was in the file?) and cast it to a `PChar` and check that for a 0 byte? – Seth Carnegie Dec 23 '11 at 09:17
  • I don't have a copy of Delphi XE here, so I can't tell whether `ReadAllText` does any manipulation. Loading into `TBytes` and checking for NUL then appending a NUL and casting the `TBytes` to `PWideChar` or `PAnsiChar` depending on the presence of a BOM should work fine. – Jens Mühlenhoff Dec 23 '11 at 09:21
  • You should really read the source code of the RTL to find out how things work internally. – Jens Mühlenhoff Dec 23 '11 at 09:22
  • After all the discussion I suggest that you rename your question to "How to do memory efficient search and replace on a large file?". – Jens Mühlenhoff Dec 23 '11 at 10:17
  • Related: http://stackoverflow.com/questions/5012664/fast-search-to-see-if-a-string-exists-in-large-files-with-delphi – Jens Mühlenhoff Dec 23 '11 at 10:21

3 Answers3

3

Does StringOf make a copy of the data passed to it?

Yes, according to the docs: 'Converts a byte array into a Unicode string using the default system locale.'

If you just want to access the TBytes as a string, why not cast it to a PChar (if it's Unicode) or PAnsiChar if it's an AnsiString?

Example code:

var
  MyBuffer: TBytes;
  BufferLength: integer;
  BufferAsString: PChar;
  BuferAsAnsiString: PAnsiChar;
begin
  MyBuffer:= TFile.ReadAllBytes(Filename);
  BufferLength:= SizeOf(MyBuffer);
  BufferAsString:= PChar(@MyBuffer[0]);
  BufferAsAnsiString:= PAnsiChar(@MyBuffer[0]);
  //if there's no #0 at the end, make sure not to read past the end of the buffer!

EDIT
I'm a bit puzzled, why you're not just using TFile.OpenRead to get a FileStream.
Let's assume you've got gigabyte(s) of data and you're in a hurry.
The Filestream will allow you to just read a small chunk of the data speeding things up.

This example code reads the whole file, but can easily be modified to only get a small part:

var
  MyData: TFileStream
  MyString: string;  {or AnsiString}
  FileSize: integer;
  Index: integer;
begin
  MyData:= TFile.OpenRead(Filename);
  try
    FileSize:= MyData.GetSize;
    SetLength(MyString,FileSize+1); //Preallocate the string;
    Index:= 0;
    MyData.Read(PChar(MyString[Index])^, FileSize);
  finally
    MyData.Free;
  end;
  //Do stuff with your newly read string.  

Note that the last example still reads all data from disk first (which may or may not be what your want). However you can also read the data in chunks.
All of this is simpler with AnsiStrings because 1 char = 1 byte there :-).

Johan
  • 74,508
  • 24
  • 191
  • 319
  • Why would it be better to use a `FileStream` rather than `ReadAllText`? – Seth Carnegie Dec 23 '11 at 09:55
  • @SethCarnegie, If you have 1GB of data, it will take a while to read all the data in. If you're not interested in all that data a FileStream allows you to only inspect the data upto the point that you're interested in. You can even skip data, this can speed up things, which is one of the concerns of the OP. – Johan Dec 23 '11 at 10:01
  • @seth so that you minimise the number of copies of the file that are in memory – David Heffernan Dec 23 '11 at 10:02
  • @Johan I need to look at all the text at once, that is why I was using "Read **All** Text". @DavidHeffernan why does `ReadAllText` make more copies than reading it in by `FileStream`? – Seth Carnegie Dec 23 '11 at 10:03
  • @SethCarnegie, ReadAllText produces a Unicode string. If your data is **not** Unicode Delphi will have to translate, causing it to spend time processing. – Johan Dec 23 '11 at 10:06
  • @seth when you subsequently copy the byte array to a string you will have to create a new buffer for the string. You will have two copies of the file in memory whilst you copy from byte array to string. So read directly into string type and cut out the middle man. – David Heffernan Dec 23 '11 at 10:07
  • Johan, `TFileStream` doesn't have a member named `GetSize`, and when I use the member `Size`, it always returns 0. Am I doing something wrong? `fileh := TFile.OpenRead(filename); ShowMessage(IntToStr(fileh.Size));` – Seth Carnegie Dec 24 '11 at 00:58
  • @SethCarnegie, Strange, try using `MyData:= TFileStream.Create(filename, fmOpenReads); ASize:= MyData.Size;` See also the code sample: http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/SysUtils_TEncoding.html – Johan Dec 24 '11 at 11:01
  • Johan actually it was my mistake, somehow I accidentally opened it once with write permissions so it erased the file... – Seth Carnegie Dec 24 '11 at 11:10
  • 1
    @SethCarnegie, you had me puzzled, at least now you don't have to worry about making copies anymore :-) – Johan Dec 24 '11 at 11:18
1
  1. Use TFile.ReadAllBytes
  2. Do your checking for NUL bytes (be aware that UTF-16 will contain lots of NULs)
  3. If it is a string use SetLength to grow the TBytes by 1 or 2 bytes (depending on the encoding)
  4. Append 1 or 2 NUL at the end (depending on the encoding again)
  5. Cast @Bytes[0] to PAnsiChar/PWideChar (depending on the encoding)

You could find the encoding by looking at the BOM. This depends on the way your input files are encoded of course.

However SetLength may make a copy of the data.

Jens Mühlenhoff
  • 14,565
  • 6
  • 56
  • 113
  • Won't `setlength` force Delphi to make a copy of the data? – Johan Dec 23 '11 at 09:30
  • Yes that could be a problem, using a `TStream` to do all the processing would be an alternative, but that really depends on what Seth want's to do with the data. Clearly we need some more information here. – Jens Mühlenhoff Dec 23 '11 at 09:33
  • @JensMühlenhoff please ask any questions you need to that will give you the info that you need – Seth Carnegie Dec 23 '11 at 09:36
  • @Johan: AFAIK `SetLength` only makes a copy if necessary, it depends on how the memory manager did the allocation. – Jens Mühlenhoff Dec 23 '11 at 09:36
  • @SethCarnegie: Like Krom said, it would be better if you wrote what you want to do and what the problem is than just write what you have now. – Jens Mühlenhoff Dec 23 '11 at 09:39
  • @JensMühlenhoff I thought I did that already. I am reading a file and checking if it is binary. If it is not, I need to continue on performing operations on the file contents as a string, and if it is binary, just quit. – Seth Carnegie Dec 23 '11 at 09:42
  • @SethCarnegie Ok, but you didn't say what kind of operations you do on the contents and if it has to be string operations. Like I said you could use a `TFileStream` and do checking and operations on the stream. – Jens Mühlenhoff Dec 23 '11 at 09:48
  • @JensMühlenhoff "operations on the file contents as a string" yes, it must be string operations because the files are supposed to be text files at the point that they are assumed not to be binary files. I am doing search and replace operations – Seth Carnegie Dec 23 '11 at 09:50
  • When you do replacing you have to allocate some space anyway, so the "avoid making a copy" argument becomes difficult. – Jens Mühlenhoff Dec 23 '11 at 09:56
  • @JensMühlenhoff no, the lengths of the replacements do not differ from what they are replacing – Seth Carnegie Dec 23 '11 at 09:59
  • Ok, now we're getting somewhere. You need a read-write access to the data that rules out the `TStream` solutions. – Jens Mühlenhoff Dec 23 '11 at 10:06
  • You can do searching directly on TBytes and you can do replacing directly on TBytes as well, no need to ever cast the data to a string. You don't even have to cast it to something else. It becomes more complicated when the data is UTF encoded or something like that of course. – Jens Mühlenhoff Dec 23 '11 at 10:10
1

If you think that StringOf is just an in-place typecasting, you are wrong.
StringOf treats its argument as an array of characters in default system ANSI codepage encoding and converts it to UTF16 unicode encoding. Sure you will find a lot of zero bytes in the resulting string (upper bytes of WideChar's).

kludg
  • 27,213
  • 5
  • 67
  • 118