1

I'm dealing with large text files (bigger than 100MB). I need the total number of lines as fast as possible. I'm currently using the code below (update: added try-finally):

var
  SR: TStreamReader;
  totallines: int64;
  str: string;
begin
  SR:=TStreamReader.Create(myfilename, TEncoding.UTF8);
  try
    totallines:=0;
    while not SR.EndOfStream do
    begin
      str:=SR.ReadLine;
      inc(totallines);
    end;
  finally
    SR.Free;
  end;
end;

Is there any faster way to get totallines?

Xel Naga
  • 826
  • 11
  • 28
  • 3
    You MUST use a `try..finally` block to protect your object. But to answer your question: probably the fastest way is to read it as a binary file and then iterate over its bytes and count the number of CRLF sequences you find. Your code above is slow because you not only count the lines, but also extract them as strings. – Andreas Rejbrand Dec 14 '20 at 10:06
  • 1
    Bigger than 100MB isn't really saying a lot. How you approach this will depend heavily on *just how much bigger* than 100MB you're talking about. How you manage a 1GB file is much different than a 100GB file. – J... Dec 14 '20 at 11:09
  • Let's say 10GB is the maximum. The funny thing is task manager displays heavy CPU usage, but low disk usage when the code above is executing. – Xel Naga Dec 14 '20 at 11:16
  • Yes, because you are allocating memory to copy every single line of the file into a string while you count. – J... Dec 14 '20 at 11:31
  • re counting CR/LF: Is it save to assume each of these occurrences is a line end? Or can they be part of some Unicode sequence? (I don't know, that's why I'm asking.) If the latter, he actually needs to decode the data to strings in order to count the lines. – dummzeuch Dec 14 '20 at 13:27
  • 2
    @dummzeuch: The OP is using UTF-8, so if you find a byte 10, you know it is a LF. Similarly for CR. See the table at https://en.wikipedia.org/wiki/UTF-8. – Andreas Rejbrand Dec 14 '20 at 13:31
  • @dummzeuch Yes, it can be part of "some Unicode sequence" - UTF-16 and UTF-32. But not UTF-8. – AmigoJack Dec 14 '20 at 18:48

2 Answers2

1
Program LineCount;

{$APPTYPE CONSOLE}
{$WEAKLINKRTTI ON}
{$RTTI EXPLICIT METHODS([]) PROPERTIES([]) FIELDS([])}
{$SetPEFlags 1}

{ Compile with XE8 or above... }

USES
  SysUtils,
  BufferedFileStream;

VAR
  LineCnt: Int64;
  Ch: Char;
  BFS: TReadOnlyCachedFileStream;

function Komma(const S: string; const C: Char = ','): string;
{ About 4 times faster than Comma... }
var
  I: Integer; // loops through separator position
begin
  Result := S;
  I := Length(S) - 2;
  while I > 1 do
  begin
    Insert(C, Result, I);
    I := I - 3;
  end;
end; {Komma}

BEGIN
  writeln('LineCount - Copyright (C) 2020 by Walter L. Chester.');
  writeln('Counts lines in the given textfile.');
  if ParamCount <> 1 then
    begin
      writeln('USAGE:  LineCount <filename>');
      writeln;
      writeln('No file size limit!  Counts lines: takes 4 minutes on a 16GB file.');
      Halt;
    end;
  if not FileExists(ParamStr(1)) then
    begin
      writeln('File not found!');
      halt;
    end;
  writeln('Counting lines in file...');
  BFS := TReadOnlyCachedFileStream.Create(ParamStr(1), fmOpenRead);
  try
    LineCnt := 0;
    while BFS.Read(ch,1) = 1 do
      begin
        if ch = #13 then
          Inc(LineCnt);
        if (LineCnt mod 1000000) = 0 then
          write('.');
      end;
    writeln;
    writeln('Total Lines: ' + Komma(LineCnt.ToString));
  finally
    BFS.Free;
  end;
END.
Walterc
  • 26
  • 3
  • Thank you very much! Interestingly Int64 didn't work. So I changed it to Integer. Also, I found out Delphi has a built-in TBufferedFileStream class which slightly works better. Anyway, your answer is correct and opened my horizons. – Xel Naga Dec 16 '20 at 19:22
  • 1
    This is the code for the entire program. Seems to work just fine and is quite fast: – Walterc Dec 19 '20 at 15:47
0

The answer is simply: No. Your algorithm is the fastest but the implementation isn't. You must read the whole file and count the lines. At least if lines are not fixed size.

How you read the file may impact the global performance. Read the file block by block in a binary buffer (Array of bytes) as large as possible. Then count the lines in the buffer and loop with the block in same buffer.

fpiette
  • 11,983
  • 1
  • 24
  • 46
  • 3
    The answer is in fact yes. It can be quicker than this code because stream reader is slow. – David Heffernan Dec 14 '20 at 13:24
  • I would simply allocate a fixed buffer, or memory-map the file, and scan the raw file data in chunks counting the LF bytes, and not even bother with decoding the UTF-8. – Remy Lebeau Dec 14 '20 at 18:32
  • @RemyLebeau actually LF is by far not enough, see [Unicode line terminators](https://en.wikipedia.org/wiki/Newline#Unicode). Even without UTF-8 context I'd also expect the [old Mac newline of CR alone](https://en.wikipedia.org/wiki/Newline#Representation). – AmigoJack Dec 14 '20 at 18:55
  • @AmigoJack the OP is using `TStreamReader.ReadLine()` which supports only bare-CR, bare-LF, and CRLF line breaks. Not that hard to handle all three when scanning for line breaks manually. It is vary rare that the other Unicode line breaks are used in real-world data nowadays. But if you needed to handle them all, it really isn't that hard. – Remy Lebeau Dec 14 '20 at 19:04