2

I need to read the last line in some very large textfiles (to get the timestamp from the data). TStringlist would be a simple approach but it returns an out of memory error. I'm trying to use seek and blockread, but the characters in the buffer are all nonsense. Is this something to do with unicode?

    Function TForm1.ReadLastLine2(FileName: String): String;
    var
      FileHandle: File;
      s,line: string;
      ok: 0..1;
      Buf: array[1..8] of Char;
      k: longword;
      i,ReadCount: integer;
    begin
      AssignFile (FileHandle,FileName);
      Reset (FileHandle);           // or for binary files: Reset (FileHandle,1);
      ok := 0;
      k := FileSize (FileHandle);
      Seek (FileHandle, k-1);
      s := '';
      while ok<>1 do begin
        BlockRead (FileHandle, buf, SizeOf(Buf)-1, ReadCount);  //BlockRead ( var FileHandle : File; var Buffer; RecordCount : Integer {; var RecordsRead : Integer} ) ;
        if ord (buf[1]) <>13 then         //Arg to integer
          s := s + buf[1]
        else
          ok := ok + 1;
        k := k-1;
        seek (FileHandle,k);
      end;
      CloseFile (FileHandle);

      // Reverse the order in the line read
      setlength (line,length(s));
      for i:=1 to length(s) do
        line[length(s) - i+1 ] := s[i];
      Result := Line;
    end;

Based on www.delphipages.com/forum/showthread.php?t=102965

The testfile is a simple CSV I created in excel ( this is not the 100MB I ultimately need to read).

    a,b,c,d,e,f,g,h,i,j,blank
    A,B,C,D,E,F,G,H,I,J,blank
    1,2,3,4,5,6,7,8,9,0,blank
    Mary,had,a,little,lamb,His,fleece,was,white,as,snow
    And,everywhere,that,Mary,went,The,lamb,was,sure,to,go
Hamish_Fernsby
  • 558
  • 1
  • 7
  • 28
  • 1
    Is this something to do with Unicode? Maybe. Hard to know what encoding your file is. Likely the issue is that you are reading 8 bit text into a UTF-16 buffer. Your code is quite a mess though. Using legacy I/O is not great. `SizeOf(Buf)-1` is bogus. That -1 is just wrong. What if you encounter something other than 8 bit text, admittedly probably not a valid concern for CSV. Some revision of encodings is needed methinks. – David Heffernan Apr 13 '16 at 11:39
  • 7
    The key is to use `TFileStream` and then `.Seek(-BufSize, soFromEnd)`; `.Read(Buf, BufSize)`, and locate LF in the buffer. – kobik Apr 13 '16 at 12:18
  • 1
    Reading that small buffer is the sure way to make your program super-slow. You should read 4KB at least, better perhaps 8 or 16 KB ( I don't recall what is AMD64 virtual page size). – Arioch 'The Apr 13 '16 at 15:30
  • I decided to follow kobiks advice and learn to use Tfilestream. When I have something working I'll upload it here for future searchers. – Hamish_Fernsby Apr 13 '16 at 15:45
  • @SolarBrian there is a question what is "large file". DOS/Windows line up limited file sizes to 2Gb ( `High(LongInt)` ) and early WinNT to 4GB (`High(Cardinal)`). But today really long files go beyond 4Gb and 32-bits is no more enough. With `TFileStream` in modern Delphi you have most important properties as `Int64`- but not in the old Pascal file API you tried to use there. `BlockRead`, `Seek` - AFAIR they are still 32-bit only. That being said there are more method to access files. I suggest you to google memory-mapped files and David Heffernan's fast file reader – Arioch 'The Apr 14 '16 at 12:26
  • Other guys make their own implementations of fast readers too. I did not used those approaches for two reasons: 1: you only need to read the file tail, so classes for fast reading of the entirety of the file are not the tool here; 2: you tell you work with huge files that overflow the memory of your process, that makes MMF approach shaky as well. https://msdn.microsoft.com/en-us/library/ms810613.aspx – Arioch 'The Apr 14 '16 at 12:30
  • However instead of my jockeying with ever-growing buffers copying back and forth you perhaps could just MAP the very tail of your file into the memory and scan it instead of the buffer. There were MMF classes for Delphi on tory.net and on stackoverflow and everywhere. You can also use Windows API directly. My code does not need extra non-standard libraries and using dynamic arrays instead of stack variables and pointers I hope is somewhat safe against access violations. That is what I *hope*. MMF approach would make you use pointers, but would instantly reduce the complex what-if logic of mine – Arioch 'The Apr 14 '16 at 12:33
  • Spend some of your time and google for cons and pros of MMF approach (there are both) then try to reimplement my code using MMF with varying Windows offsets and sizes. You would instantly remove the whole "small file" special case (array-oriented `ReadLastLine` - would be removed), for example, you would only have to have pointers-oriented `FindLastLine` and MMF-oriented `ReadLastLine`. MMF would hopefully be a handy tool for some situations... – Arioch 'The Apr 14 '16 at 12:36

4 Answers4

5

You really have to read the file in LARGE chunks from the tail to the head. Since it is so large it does not fit the memory - then reading it line by line from start to end would be very slow. With ReadLn - twice slow.

You also has to be ready that the last line might end with EOL or may not.

Personally I would also account for three possible EOL sequences:

  • CR/LF aka #13#10=^M^J - DOS/Windows style
  • CR without LF - just #13=^M - Classic MacOS file
  • LF without CR - just #10=^J - UNIX style, including MacOS version 10

If you are sure your CSV files would only ever be generated by native Windows programs it would be safe to assume full CR/LF be used. But if there can be other Java programs, non-Windows platforms, mobile programs - I would be less sure. Of course pure CR without LF would be the least probable case of them all.

uses System.IOUtils, System.Math, System.Classes;

type FileChar = AnsiChar; FileString = AnsiString; // for non-Unicode files
// type FileChar = WideChar; FileString = UnicodeString;// for UTF16 and UCS-2 files
const FileCharSize = SizeOf(FileChar);
// somewhere later in the code add: Assert(FileCharSize = SizeOf(FileString[1]);

function ReadLastLine(const FileName: String): FileString; overload; forward;

const PageSize = 4*1024; 
// the minimal read atom of most modern HDD and the memory allocation atom of Win32
// since the chances your file would have lines longer than 4Kb are very small - I would not increase it to several atoms.

function ReadLastLine(const Lines: TStringDynArray): FileString; overload;
var i: integer;
begin
  Result := '';
  i := High(Lines);
  if i < Low(Lines) then exit; // empty array - empty file

  Result := Lines[i];
  if Result > '' then exit; // we got the line

  Dec(i); // skip the empty ghost line, in case last line was CRLF-terminated
  if i < Low(Lines) then exit; // that ghost was the only line in the empty file
  Result := Lines[i];
end;

// scan for EOLs in not-yet-scanned part
function FindLastLine(buffer: TArray<FileChar>; const OldRead : Integer; 
     const LastChunk: Boolean; out Line: FileString): boolean;
var i, tailCRLF: integer; c: FileChar;
begin
  Result := False;
  if Length(Buffer) = 0 then exit;

  i := High(Buffer);    
  tailCRLF := 0; // test for trailing CR/LF
  if Buffer[i] = ^J then begin // LF - single, or after CR
     Dec(i);
     Inc(tailCRLF);
  end;
  if (i >= Low(Buffer)) and (Buffer[i] = ^M) then begin // CR, alone or before LF
     Inc(tailCRLF);
  end;

  i := High(Buffer) - Max(OldRead, tailCRLF);
  if i - Low(Buffer) < 0 then exit; // no new data to read - results would be like before

  if OldRead > 0 then Inc(i); // the CR/LF pair could be sliced between new and previous buffer - so need to start a bit earlier

  for i := i downto Low(Buffer) do begin
      c := Buffer[i];
      if (c=^J) or (c=^M) then begin // found EOL
         SetString( Line, @Buffer[i+1], High(Buffer) - tailCRLF - i);
         exit(True); 
      end;
  end;  

  // we did not find non-terminating EOL in the buffer (except maybe trailing),
  // now we should ask for more file content, if there is still left any
  // or take the entire file (without trailing EOL if any)

  if LastChunk then begin
     SetString( Line, @Buffer[ Low(Buffer) ], Length(Buffer) - tailCRLF);
     Result := true;
  end;
end;


function ReadLastLine(const FileName: String): FileString; overload;
var Buffer, tmp: TArray<FileChar>; 
    // dynamic arrays - eases memory management and protect from stack corruption
    FS: TFileStream; FSize, NewPos: Int64; 
    OldRead, NewLen : Integer; EndOfFile: boolean;
begin
  Result := '';
  FS := TFile.OpenRead(FileName);
  try
    FSize := FS.Size;
    if FSize <= PageSize then begin // small file, we can be lazy!
       FreeAndNil(FS);  // free the handle and avoid double-free in finally
       Result := ReadLastLine( TFile.ReadAllLines( FileName, TEncoding.ANSI )); 
          // or TEncoding.UTF16
          // warning - TFIle is not share-aware, if the file is being written to by another app
       exit;
    end;

    SetLength( Buffer, PageSize div FileCharSize);
    OldRead := 0;
    repeat
      NewPos := FSize - Length(Buffer)*FileCharSize;
      EndOfFile := NewPos <= 0;
      if NewPos < 0 then NewPos := 0; 
      FS.Position := NewPos;

      FS.ReadBuffer( Buffer[Low(Buffer)], (Length(Buffer) - OldRead)*FileCharSize);

      if FindLastLine(Buffer, OldRead, EndOfFile, Result) then 
         exit; // done !

      tmp := Buffer; Buffer := nil; // flip-flop: preparing to broaden our mouth

      OldRead := Length(tmp); // need not to re-scan the tail again and again when expanding our scanning range
      NewLen := Min( 2*Length(tmp), FSize div FileCharSize );

      SetLength(Buffer, NewLen); // this may trigger EOutOfMemory...
      Move( tmp[Low(tmp)], Buffer[High(Buffer)-OldRead+1], OldRead*FileCharSize);
      tmp := nil; // free old buffer
    until EndOfFile;
  finally
    FS.Free;
  end;
end;

PS. Note one extra special case - if you would use Unicode chars (two-bytes ones) and would give odd-length file (3 bytes, 5 bytes, etc) - you would never be ble to scan the starting single byte (half-widechar). Maybe you should add the extra guard there, like Assert( 0 = FS.Size mod FileCharSize)

PPS. As a rule of thumb you better keep those functions out of the form class, - because WHY mixing them? In general you should separate concerns into small blocks. Reading file has nothing with user interaction - so should better be offloaded to an extra UNIT. Then you would be able to use functions from that unit in one form or 10 forms, in main thread or in multi-threaded application. Like LEGO parts - they give you flexibility by being small and separate.

PPPS. Another approach here would be using memory-mapped files. Google for MMF implementations for Delphi and articles about benefits and problems with MMF approach. Personally I think rewriting the code above to use MMF would greatly simplify it, removing several "special cases" and the troublesome and memory copying flip-flop. OTOH it would demand you to be very strict with pointers arithmetic.

Arioch 'The
  • 15,799
  • 35
  • 62
  • A small but important oversight: In `FindLastLine()` you want to set `i` before usage. – Tom Brunberg Apr 14 '16 at 10:33
  • @TomBrunberg asolutely true! Initialyl that part was set before the for-loop, when I realized it does not make sense there after `OldRead` gp applied, so I copypasted it above - but forgot to add initializer. My bad. Thanks for alerting me. – Arioch 'The Apr 14 '16 at 12:13
  • @SolarBrian - you really have to give it a testing though, I was drafting it in notepad, so I did not do real tests. Generate few test files - with last line ended by CRLF and with the last line being at the file end directly, with the last lie less than 4 KB, between 4 and 8 KB, above 8 KB, make the file consisting of 1 single long line of varying lengths (2 KB, 6KB, 10KB...), of several empty lines... - test those functions at all the possible special corner cases. Because I did not - I just gave you a general skeleton out of my mind). If it works on the simpliest case - not enough warranty! – Arioch 'The Apr 14 '16 at 12:19
1

Your char type is two byte, so that buffer is 16 byte. Then with blockread you read sizeof(buffer)-1 byte into it, and check the first 2 byte char if it is equal to #13.

The sizeof(buffer)-1 is dodgy (where does that -1 come from?), and the rest is valid, but only if your input file is utf16.

Also your read 8 (or 16) characters each time, but compare only one and then do a seek again. That is not very logical either.

If your encoding is not utf16, I suggest you change the type of a buffer element to ansichar and remove the -1

Marco van de Voort
  • 25,628
  • 5
  • 56
  • 89
  • Ah, I get it now, CSV is by definition NOT unicode, changing the buffer filetype to ansichar and removing the -1 works. I think it's nearly working now, the only thing I don't get now is why trying to read 's' in s := s + buf[1] causes an access violation error, it seems that the 's' variable becomes inaccessible after the blockread command. – Hamish_Fernsby Apr 13 '16 at 12:16
  • This is really weird, the location where the access violation error occurs depends on the size of Buf: array[1..8] of Char; the larger the array the further down the code before it throws the error. Some kind of heap corruption? – Hamish_Fernsby Apr 13 '16 at 12:20
  • "CSV is by definition NOT unicode" That may be literally true, but there's nothing to stop an application writing a CSV file using Unicode characters. – MartynA Apr 13 '16 at 17:00
0

Just thought of a new solution.

Again, there could be better ones, but this one is the best i thought of.

function GetLastLine(textFilePath: string): string;
var
  list: tstringlist;
begin
  list := tstringlist.Create;
  try
    list.LoadFromFile(textFilePath);
    result := list[list.Count-1];
  finally
     list.free;
  end;
end;
  • But if the current line isn't the last one in the file, SeekEoln will go to the end of that line but not the end of the file? – MartynA Apr 13 '16 at 11:22
  • `Add IOUtils to the uses clause.` - why? u use old Wirth's Pascal file functions, not DotNet clones. The types and functions you use are declared in `System` unit, not in `IOUtils` – Arioch 'The Apr 13 '16 at 15:26
  • Sorry, I should have mentioned that I already tried ReadLn from the start of the file but it was too slow for my large files – Hamish_Fernsby Apr 13 '16 at 15:47
  • @SolarBrian you should not. It is pretty obvious because your file is at very very least 1GB size. And ReadLn is very slow anyway - just like BlockRead with 8-bytes "buffer" :-) – Arioch 'The Apr 13 '16 at 15:47
0

In response to kopiks suggestion, I figured out how to do it with TFilestream, it works ok with the simple test file, though there may be some further tweeks when I use it on a variety of csv files. Also, I don't make any claims that this is the most efficient method.

    procedure TForm1.Button6Click(Sender: TObject);
    Var
      StreamSize, ApproxNumRows : Integer;
      TempStr : String;
    begin
      if OpenDialog1.Execute then begin
        TempStr := ReadLastLineOfTextFile(OpenDialog1.FileName,StreamSize, ApproxNumRows);
    //    TempStr := ReadFileStream('c:\temp\CSVTestFile.csv');
        ShowMessage ('approximately '+ IntToStr(ApproxNumRows)+' Rows');
        ListBox1.Items.Add(TempStr);
      end;
    end;

      Function TForm1.ReadLastLineOfTextFile(const FileName: String; var StreamSize, ApproxNumRows : Integer): String;
        const
          MAXLINELENGTH = 256;
        var
          Stream: TFileStream;
          BlockSize,CharCount : integer;
          Hash13Found : Boolean;
          Buffer : array [0..MAXLINELENGTH] of AnsiChar;
        begin
          Hash13Found := False;
          Result :='';
          Stream      := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite);
          StreamSize := Stream.size;

          if StreamSize < MAXLINELENGTH then
            BlockSize := StreamSize
          Else
            BlockSize := MAXLINELENGTH;

        //  for CharCount := 0 to Length(Buffer)-1 do begin
        //    Buffer[CharCount] := #0;                         // zeroing the buffer can aid diagnostics
        //  end;

          CharCount := 0;
          Repeat
            Stream.Seek(-(CharCount+3), 2);         //+3 misses out the #0,#10,#13 at the end of the file
            Stream.Read( Buffer[CharCount], 1);
            Result := String(Buffer[CharCount]) + result;
            if Buffer[CharCount] =#13 then
              Hash13Found := True;
            Inc(CharCount);
          Until Hash13Found OR (CharCount = BlockSize);

          ShowMessage(Result);
          ApproxNumRows := Round(StreamSize / CharCount);
        end;
Hamish_Fernsby
  • 558
  • 1
  • 7
  • 28
  • 1
    `Stream.Read( Buffer[CharCount], 1);` - that is not just slow, that is the typical novice error. How many bytes did that function read? It may read any random number of bytes between 0 and Count parameter, so you should check how many bytes were actually read by it. You set Count=1, so the function can read 0 or 1 byte - but i fail to see where you check for it. – Arioch 'The Apr 14 '16 at 10:29