-4

Looking at the sample implementation of wc.c when counting number of lines, it loop through the file, one character at a time and accumulating the '\n' to count the number of newlines:

#define COUNT(c)       \
      ccount++;        \
      if ((c) == '\n') \
        lcount++;
  • Is there a way to just seek the file for '\n' and keep jumping to the newline characters and do a count?

  • Would seeking for '\n' be the same as just reading characters one at a time until we see '\n' and count it?

463035818_is_not_an_ai
  • 109,796
  • 11
  • 89
  • 185
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 2
    you already asked this, and the answer is no, otherwise wc would do it – Neil Butterworth Nov 04 '22 at 10:48
  • 2
    How would "seek" know where the '\n' are if not looking for them ? – Fareanor Nov 04 '22 at 10:50
  • How can you possibly jump from one '\n' to another '\n' without knowing how much to jump? In other words without knowing what is in between. – Jason Nov 04 '22 at 10:50
  • 5
    What makes you think that `\n` is special? Would you ask the same question if the task was to count occurences of the letter `a` ? – 463035818_is_not_an_ai Nov 04 '22 at 11:00
  • Split the question up because the other was "not good" since it mix the `wc` question and the seek question, so I deleted that and created this question. – alvas Nov 04 '22 at 11:06
  • 3
    Unfortunately file is not represented as some multidimensional structure and `\n` is just another character. All the algorithms (known to me) counting the number of occurrences of element in an array have linear complexity. E.g. https://en.cppreference.com/w/cpp/algorithm/count – pptaszni Nov 04 '22 at 12:43
  • 1
    In the end it is always a tradeof between memory usage and speed. Assuming you are on a CPU with avx512. You could map the whole file in memory and then divide into as many memory segments as you have cores. Make sure the divisions align at 512bits. Then spin up a thread for each core (and give it a thread affinity to a specific core, to utilize caching optimally, (MIMD). And then vectorize the search for '\n' on each thread so can use avx512 to check 64bytes in parallel (SIMD). And then you probably still have to profile to optimize. – Pepijn Kramer Nov 09 '22 at 14:47
  • 1
    Anyway any algorithm would still be O(n) – Pepijn Kramer Nov 09 '22 at 15:02
  • In the title, you search for a `"string"` (containing only a `\n`) but in the question it's the actual character `'\n'`. When searching for a single character you'll actually have to go through each and every character and look at it. If you search for strings (with length > 1) it can be done smarter. – Ted Lyngmo Nov 09 '22 at 16:35

4 Answers4

5

Well, all characters are not '\n', except for one. A branch-less algorithm is likely to be faster.
Have you tried std::count, though?

#include <string>
#include <algorithm>

int main() {
  const auto s = std::string("Hello, World!\nfoo\nbar\nbaz");
  const auto lines_in_s = std::count(s.cbegin(), s.cend(), '\n');
  return lines_in_s;
}

Compiler Explorer

Or with a file:

#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>

int main() {
    if (std::ifstream is("filename.txt"); is) {
        const auto lines_in_file =
            std::count(std::istreambuf_iterator<char>(is),
                       std::istreambuf_iterator<char>{}, '\n');

        std::cout << lines_in_file << '\n';
    }
}

Compiler Explorer

Ted Lyngmo
  • 93,841
  • 5
  • 60
  • 108
viraltaco_
  • 814
  • 5
  • 14
  • I agree with the idiomatic solution (except it can probably be done quicker with a memory mapped file and using execution policies) - but what do you mean by "branch-less"? `std::count` isn't branch-less afaik. – Ted Lyngmo Nov 11 '22 at 16:04
  • `std::count` doesn't have to be brancheless. I don't believe it has to be branching, though. An example of a brancheless algorithm is simply `lcount += int(c == '\n');`. Note that this is still O(n) in time. The idea is not to reduce the time complexity of the algorithm but to remove the possibility for a branch miss to occur on some cpu architectures (notably x86) where a branch miss *could* drastically impact throughput. It's impossible to tell if this would improve performance without running benchmarks, however. See: [branch predictor](https://en.wikipedia.org/wiki/Branch_predictor) – viraltaco_ Nov 18 '22 at 10:38
3

The only way you could skip looking at every character would be if you had domain knowledge about the string you're currently looking at:

If you knew that you're handling a text with continuous paragraphs of at least 50 words or so, you could, after each '\n', advance by 100 or 200 chars, thus saving some time. You'd need to test and refine that jump length, of course, but then you wouldn't need to check every single char.

For a general-purpose counting function you're stuck with looking at every possible char.

Christian Severin
  • 1,793
  • 19
  • 38
0

Q: Is there a faster way to count the number of lines in a file than reading one character at a time?
A: The quick answer is no, but one can parallelize the counting which might shorten the runtime but the program would still have to run through every byte once. Such a program may by IO bound and so it depends on the hardware involved as to how useful parallelization is in this case.
Q: Is there a way to skip from one newline character to the next without having to read through all the bytes in between?
A: The quick answer is no, but if one had a really large text file for example, what one could do is make an 'index' file of offsets. One would still have to make one pass over the file in order to generate such a file, but once it was made, one could find the nth line by reading the nth offset in the index and then 'seek'-ing to it. The index would have to maintained or regenerated though every time the file changed. If one used fixed width offsets, one could seek straight to the offset required with some simple arithmetic, read the index for the offset, then seek to the correct position in the file. A line count can be obtained at the same time as generating the index. Once the index is generated, a line count could quickly be determined from the size of the index file if it has to be computed again.

It probably should be mentioned that the number of lines in a text file might not be derived from the number of '\n' bytes because of multi-byte character encoding. To count the number of lines, one needs to scan the file character by character rather than just byte by byte, and to do that, one needs to know what character encoding scheme is being used.

Simon Goater
  • 759
  • 1
  • 1
  • 7
-1

You can use strchr function to "jump" to next '\n' in string, and it will be faster on some platforms, because strchr usually implemented in assembly language and use processor instructions that can scan memory faster where such instructions are available. something like this:

#include <string.h>

unsigned int count_newlines(const char *str) {
   unsigned result = 0;
   const char s = str;
   while ((s = strchr(s, '\n')) != NULL) {
      ++result; // found one '\n'
      ++s; // and start searching again from the next character
   }
   return result;
} 
Konstantin Svintsov
  • 1,607
  • 10
  • 25
  • _"it will be faster on some platforms"_ -- how could it possibly be faster on any platform? Do you have any references to back up that claim? I don't see how such a thing could be possible at all – Human-Compiler Nov 15 '22 at 22:34
  • strchr, memchr, etc. are written in assembly language and optimized for machine architecture. It will still be O(n) anyway, of course. – Konstantin Svintsov Nov 16 '22 at 05:23