copy part of a file into stream

Question

the global target is using a part of file to get checksum to find duplicated movie and mp3 files, for this goal i have to get a part of file and generate the md5 because whole file size is up to 25 gigs in some cases,if i found duplicates then i will do a complete md5 for avoid any mistake of wrong file deletion i dont have any problem i generating md5 from stream , it will be done with indy components so for first part i have to copy first 1mb of a file

so i maked this function

but the memory stream is empty for all checkes!

function splitFile(FileName: string): TMemoryStream;
 var
    fs: TFileStream;
    ms : TMemoryStream;
 begin
     fs := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) ;
     ms := TMemoryStream.Create;
     fs.Position :=0;
     ms.CopyFrom(fs, 1048576);
     result := ms;
 end;

how can i fix this? or where is my problem?

update1 - (dirty test) :

this code return error stream read error also memo2 show some string but memo3 is empty!!

function splitFile(FileName: string): TMemoryStream;
 var
    fs: TFileStream;
    ms : TMemoryStream;
 begin
     fs := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) ;
     ms := TMemoryStream.Create;
     fs.Position :=0;
      form1.Memo2.Lines.LoadFromStream(fs);
     ms.CopyFrom(fs,1048576);
     ms.Position := 0;
      form1.Memo3.Lines.LoadFromStream(ms);
     result := ms;
 end;

the complete code

function splitFile(FileName: string): TMemoryStream;
 var
    fs: TFileStream;
    ms : TMemoryStream;
    i,BytesToRead : integer;
 begin

     fs := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite);
     ms := TMemoryStream.Create;
     fs.Position :=0;
     BytesToRead := Min(fs.Size-fs.Position, 1024*1024);
      ms.CopyFrom(fs, BytesToRead);
     result := ms;
    // fs.Free;
    // ms.Free;
 end;

function streamFile(FileName: string): TFileStream;
 var
    fs: TFileStream;
    ms : TMemoryStream;
 begin
     fs := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) ;
     result := fs;
 end;

 function GetFileMD5(const Stream: TStream): String; overload;
var MD5: TIdHashMessageDigest5;
begin
    MD5 := TIdHashMessageDigest5.Create;
    try
       Result := MD5.HashStreamAsHex(Stream);
    finally
       MD5.Free;
    end;
end;

function getMd5HashString(value: string): string;
var
    hashMessageDigest5 : TIdHashMessageDigest5;
begin
    hashMessageDigest5 := nil;
    try
        hashMessageDigest5 := TIdHashMessageDigest5.Create;
        Result := IdGlobal.IndyLowerCase ( hashMessageDigest5.HashStringAsHex ( value ) );
    finally
        hashMessageDigest5.Free;
    end;
end;

procedure TForm1.Button1Click(Sender: TObject);
var
  Path,hash    : String;
  SR      : TSearchRec;
begin
   if od1.Execute then
  begin
    Path:=ExtractFileDir(od1.FileName); //Get the path of the selected file
    DirList:=TStringList.Create;
    try
          if FindFirst(Path+'\*.*', faArchive  , SR) = 0 then
          begin
            repeat
              if (SR.Size>10240) then
              begin
                hash := GetFileMD5(splitFile(Path+'\'+SR.Name));
              end
              else
              begin
                hash := GetFileMD5(streamFile(Path+'\'+SR.Name));
              end;
                memo1.Lines.Add(hash+' | '+SR.Name +' | '+inttostr(SR.Size));
                application.ProcessMessages;
            until FindNext(SR) <> 0;
            FindClose(SR);
          end;
   finally
     DirList.Free;
   end;
  end;
end;

output:

D41D8CD98F00B204E9800998ECF8427E | eslahat.docx | 13338
D41D8CD98F00B204E9800998ECF8427E | EXT-3000-Data-Sheet.pdf | 682242
D41D8CD98F00B204E9800998ECF8427E | faktor khate ekhtesasi firoozpoor.pdf | 50091
D41D8CD98F00B204E9800998ECF8427E | FileZilla_3.9.0.5_win32-setup.exe | 6057862
D41D8CD98F00B204E9800998ECF8427E | FileZilla_3.9.0.6_win32-setup.exe | 6126536
11210486C9E54E12DA9DF687792257EA | get_stats_of_all_members_of_mu(1).php | 6227
11210486C9E54E12DA9DF687792257EA | get_stats_of_all_members_of_mu.php | 6227
D41D8CD98F00B204E9800998ECF8427E | GOMAUDIOGLOBALSETUP.EXE | 6855616
D41D8CD98F00B204E9800998ECF8427E | harvester-master(1).zip | 54255
D41D8CD98F00B204E9800998ECF8427E | harvester-master.zip | 54180

You're not freeing these memory streams? That's a major, major, major memory leak. — Jerry Dodge, Nov 21 '14 at 23:52
so i wrote what i have to do,i just need to copy first 1mb of the file — peiman F., Nov 22 '14 at 00:01
what is unclear about my asking!!!! is there anybody cant understand my question!?! — peiman F., Nov 22 '14 at 00:49
The reason why Memo2 in your updated code only shows some text seems to be the cause of the TMemo itself. It is posible that TMemo stops reading from or parsing from memo when it encounters certain character which probably can't be shown in it. As for why your Memo3 doesen't contain anything: You need to set fs.Position to zero before using ms.CopyFrom reading file stream into Memo2 has set its position to be on the end of file. — SilverWarior, Nov 22 '14 at 00:52
you can see `ms.Position := 0;` before `form1.Memo3.Lines.LoadFromStream(ms);`, thank you — peiman F., Nov 22 '14 at 00:56
if you call ms.SaveToFile after ms.Position := 0, does it contain what you're expecting?\ — Jason, Nov 22 '14 at 01:04
As SilverWarrior told you, you need to reset fs.Position to 0 AFTER Memo2.Lines.LoadFromStream(fs) and BEFORE ms.CopyFrom(fs, 1048576) — Tom Brunberg, Nov 22 '14 at 01:40
Based on your updated code and the output it surely seems that there is a problem with your slitFile method. Since I'm not on a development machine now I can't test what is wrong with it. But based on the fact that CopyFrom returns the number of copied bytes I suggest you use that information to see if the copy operation is sucsessfull or not and how many data is being copied. This way we will rule out posible problems with reading data from files. And once I get to my development machine alter today I will test your entire code to see where the problem might resisde. — SilverWarior, Nov 24 '14 at 08:17
yes, i do some other check and now im sure CopyFrom dont working,is problem related to converting between TFileStream and TMemoryStream ? — peiman F., Nov 24 '14 at 13:06
Not converting from TFileStream to TMemoryStream but by parameter of TStream not properly accepting the TMemoryStream handle. I have updated my answer with more information and a solution with which you can get rid of the use of TMemoryStreams altogether. — SilverWarior, Nov 24 '14 at 19:42
@SilverWarior its working like a charm, thank so much for your helps :) i have a question , you said in installer files the first part of all files maybe be like each other,so,what is your idea about cut the end pary of the file , or randomly from middle of file, for example for files larger that 10 mb cut down 20% of file from 30% after beginning of the file! — peiman F., Nov 25 '14 at 09:47
@peimanF. That is hard to say as it actually difers from file to file. Some files (instalers, self aextracting archives) can have same data on the begingin, some toher files might have same data on the end of file (not so common) and some archive files where the content wasn't compressed but only stored could have same data in the midle as one of other files becouse un sich archives data from different archived files is simply stiched together. — SilverWarior, Nov 25 '14 at 13:54

SilverWarior · Accepted Answer · 2014-11-24T19:39:35.757

4

Here is a procedure that I quickly wrote for you which would alow you to read part of file (chunk) into a memory stream.

The reason why I made this into a procedure and not function is so that it is posible to reuse same memory stream for diferent chunks. This way you avoid all those memory alocations/dealocations and also reduce the chance of introducing the memory leak.

In order to be able to do so you need to pas the memory stream handle to the procedure as variable parameter.

I also adad two more parameters. One for specifying the chunk size (amount of data that you want to read from file) and chunk number.

I also made some rudimentatry safeguards to tell you when you want to read a chunk that is beyond file scope. And also the ability to automatically reduce the size of the last chunk since not all file sizes are multiples of oyur chunk size (in your case not all files are exactly X megabytes in size where X is any valid integer).

procedure readFileChunk(FileName: string; var MS: TMemoryStream; ChunkNo: Integer; ChunkSize: Int64);
var fs: TFileStream;
begin
  fs := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite);
  if ChunkSize * (ChunkNo-1) <= fs.Size then
  begin
    fs.Position := ChunkSize * (ChunkNo-1);
    if fs.Position + ChunkSize <= fs.Size then
      ms.CopyFrom(fs, ChunkSize)
    else
      ms.CopyFrom(fs, fs.Size - fs.Position);
  end
  else
    MessageBox(Form2.WindowHandle, 'File does not have so many chunks', 'WARNING!', MB_OK);
  fs.Free;
end;

You use this procedure by calling:

readFileChunk(FileName,MemoryStream,ChunkNumber,ChunkSize);

Make sure you have already created the memory stream before calling this procedure.
Also if you want to reuse the same memory stream multiple times don't forget to set its postion to 0 before calling this procedure othevise new data will be added to the end of the stream and in turn keep increasing the memory stream size.

UPDATE:

After doing some trials I found out that the problem resides in your GetFileMD5 method.

I can't explain exactly why this is happening but if you pass a TMemoryStream to TStream parameter, the TStream parameters simply doesent accept it so the MD5 hashing algorithm the treats it as empty handle.
When I went and changed the parameter type to TMemoryStream instead the code worked but you no longer could pass a TFileStream to GetFileMD5 method anymore so it broke hash generation from entire files that worked before.

SOLUTION:

So after doing some more digging I have a GREAT news for you.

You don't even need to use TMemoryStreams. The "HashStreamAsHex" function can accept two optional parameters which alows you to define the starting point of your data and the size of data block from which you wanna generate the MD5 hash string. And this also works with TFileStream.

So in order to generate MD5 hash string from just small part of your file call this:

MD5.HashStreamAsHex(Stream,StartPosition,DataSize);

StartPositon specifies the inital offset into the stream for the hashing operation. When StartPosition contains a positive non-zero value, the stream position is moved to the specified offset prior to calculating the hash value. When StartPosition contains the value -1, the current position of the stream is used as the initial offset into the specified stream.

DataSize indicates the number of bytes from the stream to include in the hashing operation. When DataSize contains a negative value (<0), the bytes remaining from the current stream position are used for the hashing operation. Otherwise, the number of bytes in DataSize is used. If DataSize is larger than the size of the stream, the smaller of the two values is used for the operation.

In your case for getting the MD5 hash from the first MegaByte you would call:

MD5.HashStreamAsHex(Stream,0,1024*1024);

Now I belive you can modify the rest of your code to get this working as you want to. If not do tell where it stopped and I will help you.

edited Nov 24 '14 at 19:39

answered Nov 22 '14 at 01:47

SilverWarior

7,372
2
16
22

Would be even better if the `TFileStream` was already instantiated as well :-) – Jerry Dodge Nov 22 '14 at 02:10
1

@JerryDodge Yes that would reduce the nuber of times that you open and close the file. – SilverWarior Nov 22 '14 at 03:04
Other problems with the code are that it uses a `var` parameter for the memory stream and indeed that it forces the caller to use a specific type of stream. – David Heffernan Nov 22 '14 at 07:52
1

@DavidHeffernan The reason why I used TMemoryStream here is becouse OP has expressed desire to be able to read part of file into memory stream, becouse that is where the hashing algorithm he uses expects data to be. – SilverWarior Nov 22 '14 at 12:26
Use TStream and so make the procedure more flexible. And why did you use var? Do you understand the consequences? And what is the point of the chunks? Compare with the single call to CopyFrom. – David Heffernan Nov 22 '14 at 12:31
1

As for chunks. It is true that OP expressed desire to only be able to read first 1 MB of data into memory stream. But since he is using this for some hashing algorithm I asumed that later he would like to read the rest of the file in 1 MB chunks. So I have incorporated such functionality right in to the method. I also aded the ability for him to pass the chunk size as parameter instead of it being harcoded into method. This would come usefull if he decides to use a different hashing algorithm which would expect different sized input data block. – SilverWarior Nov 22 '14 at 12:34
The chunks serve no purpose at all and obscure what's really going on. – David Heffernan Nov 22 '14 at 12:49
1

@DavidHeffernan Maybe maybe not. It depends on how is his hashing algorithm implemented. It would be best if the hashing algorithm itself would be bale to read the data directly from the file itself. But since we don't know which algorithm he uses we have no idea of its capabilities and its actual demmands. – SilverWarior Nov 22 '14 at 13:06
1

It was mentioned in a comment of another deleted answer. – SilverWarior Nov 22 '14 at 14:11
1

@SilverWarior your ques is actually true,i have to use this method for a md5 file algorithm, but not all file,i have to use just 2-3 mb of first of each file and use checksum to compare the files,im working on code but i dont know why some different files have same checksum! – peiman F. Nov 22 '14 at 16:59
@peiman there should be no need to load it all in to memory a good hasher will operate on a stream. However, what matters here is the question you asked. Which is not related to hashing. – David Heffernan Nov 22 '14 at 18:55
1

@peimanF. Calculating hash from just part of a file is a bad idea. Why? Becouse it is posible that you have multiple files whose first part is the same while the files as whole are compleetly different. For instance take a look at installers or self extracting archives. Both of these are comrised from two parts. First part is the program that extracts the files from the installer or self executeble file. And the second part is where the comressed data is stored. So if you are caculating has from just first part you will get same hash string even thou the files are compeetly different. – SilverWarior Nov 23 '14 at 05:13
1

Anywhay if you are insterested in creating a MD5 hash from files in particular you should just use Hashing capability that is included in Indy component pack as it suports creating MD5 hash string from entire files and doesen't require you to load particular parts into memory. All you need is just provide it with a File Stream handle. You can read on how to use it here: http://delphi.about.com/od/objectpascalide/a/delphi-md5-hash.htm – SilverWarior Nov 23 '14 at 05:19
1

@DavidHeffernan if you think its not related because you dont have information about my idea also i dont understand the -1.just read my next comment for SilverWarior. – peiman F. Nov 23 '14 at 09:18
1

@SilverWarior i have use a part of file to get checksum to find duplicated movie and mp3 files,for this i get a part of file and generate the md5 because whole file size is up to 25 gigs in some cases,if i found duplicates then i will do a complete md5 for avoid any mistake and wrong file deletion ,now you can see my roadmap is true :) – peiman F. Nov 23 '14 at 09:23
@SilverWarior yes i use indy hash stream functions for that side but the problem is the captured stream.it seems it is empty and not copied well! – peiman F. Nov 23 '14 at 09:24
1

@peimanF. OK I can understand why you are reluctant on calculating MD5 checksum for the whole 25 gigs file as it would take quite some time. But I don't understand why you also want to calculate checksum from parts of mp3 files. So far the biggest mp3 file I have seen was about 250 MB and it was a live recording from a concert which lasted almost three hours. – SilverWarior Nov 23 '14 at 13:10
1

@peimanF. Now I seriously hope you are doing some peliminary check to see if two files could be the same even before you start calculating MD5 checksums. For instance if you compare the size of the two file and they are not the same then the files are not the same. And since MD5 hashing algorithm isn't perfect it can actuall generate same hash string for two different files. Not verry likely when these two files are of the same size but much more likely when they are of different size. – SilverWarior Nov 23 '14 at 13:12
1

And now to your problem directly. You claim that you still have problems with your captured stream to be empty. Does this still happens with the code I posted and does it happens on all files? – SilverWarior Nov 23 '14 at 13:15
1

@SilverWarior not for all files,but some outputs have same checksum for different files , but file types arnt same for example a 12 mb image and a mp3 file and a ziro size file have the same checksome this have one meaning and its all are empty,,im doing some more checks and will update the topic, also do to -1s for this tread i will move my question to embarcadero.com forum – peiman F. Nov 23 '14 at 17:08
1

@peimanF. Please update the question with your current code. Make sure to include both the method you use to read part of the file into memory aswell the code that passes that data to MD5 checksum generator. Also I strongly recomend you read about how HD5 checksum algorithm works. This will give you better understanding of how different files can still have same cheksum. – SilverWarior Nov 23 '14 at 19:05
@SilverWarior the code and output added, i know about the hash algorithms , i worked with them at last, in this project the problem is before hash algorithm,also for avoid any mistake , i will change the select file part from middle of file or end of it!... – peiman F. Nov 23 '14 at 23:18

David Heffernan · Answer 2 · 2014-11-22T09:53:57.690

I'm assuming that your code does not raise an exception. If it did you surely would have mentioned that. I also assume that the file is large enough for your attempted read.

Your code does copy. If the call to CopyFrom does not raise an exception then the memory stream contains the first 1024000 bytes of the file.

However, after the call to CopyFrom, the memory stream's pointer is at the end of the stream so if you read from it you will not be able to read anything. Perhaps you need to move the stream pointer to the beginning:

ms.Position := 0;

And then read from the memory stream.

1MB = 1024*1024, FWIW.

Update

Probably my assumptions above were incorrect. It seems likely that your code raises an exception because you attempt to read beyond the end of the file.

What you really seem to be wanting to do is to read as much of the first part of the file as possible. That's a two-liner.

BytesToRead := Min(Source.Size-Source.Position, 1024*1024);
Dest.CopyFrom(Source, BytesToRead);

What I said in my answer is correct. I'm not sure what you did not understand. — David Heffernan, Nov 22 '14 at 06:22
first i check the file size and if it is bigger then 1mb send it to function but now changed the cod to your solution.. TNX — peiman F., Nov 22 '14 at 17:19

copy part of a file into stream

2 Answers2