1

For a C assignment, I am supposed to break up words in a large text file and process one by one. Basically, a word is any linear sequence of alphabets. Since, this will be the bottleneck of my program, I want to make this process as fast as possible.

My idea is to scan words from file into a string buffer using the scan functions format specifier ([a-zA-z]). If the buffer is filled, I check if there more alphabets in the file (based on where file pointer is located at). If there are, then I increase buffer size and continue copying more alphabets into the buffer until I hit a non-alphabet.

The problem is whether I use fscanf or sscanf (copy the whole file into a string). Is one faster than the other or is there a better alternative to my idea?

MSalters
  • 173,980
  • 10
  • 155
  • 350
1729
  • 175
  • 2
  • 18
  • Data is in fixed format in file ? – ameyCU Oct 21 '15 at 04:05
  • @ameyCU It is just a text file. I don't know what fixed format is? – 1729 Oct 21 '15 at 04:07
  • 4
    Most of the time will be lost in reading data from the disk, not in process the information(if you really want, fgets() or fgetc() are faster, but not there is your bottleneck). Take in consideration the possibility to use a ramdisk, if it useable. – Heto Oct 21 '15 at 04:10
  • 1
    @Heto If I use fgets or fgetc, I will need to process each character (check if they are alphabets). I am assuming the fscanf or sscanf implementation would be superior in giving me strings of alphabets. – 1729 Oct 21 '15 at 04:14
  • 4
    Assume away…but until you've done the measurements to prove it, you're guessing. It is unlikely that there will be a significant difference in the performance using `fgets()` and then splitting the line, or using `getc()` (there might be a measurable difference if you insist on using `fgetc()`). But the disk I/O time will still dominate and it is unlikely you'll be able to spot the difference. – Jonathan Leffler Oct 21 '15 at 04:18
  • stackoverflow is not here to do your homework for you. Post the code you have so far and how its' output differs from what you are expecting. We can examine your code and make suggestions on speed up possibilities. – user3629249 Oct 21 '15 at 04:20
  • yes, if you use something you have to pay for it. But As I said, the bottleneck is in reading the file from the disk, there you spend >90% of your time. As using fscanf, sscanf -> doesn't really matter. Personally I read the whole file in memory, but not for gain performance, this way I can make it clear in the program: "here I read data" and "here I process data". – Heto Oct 21 '15 at 04:21
  • `fixed format` means each line in the file contains the same fields (usually each field starts in a known column. The contents/width of each field could vary. if the starting column is not fixed for each field, then it could be a fixed series of fields, separated by a delimiter, like the /etc/passwd file in linux, – user3629249 Oct 21 '15 at 04:29
  • please post your code otherwise, we are just guessing and guessing(or opinion) is not what stackoverflow is for. – user3629249 Oct 21 '15 at 04:31
  • In the time it took to write this question, you could have written it both ways and timed it. And then you'd know for sure. – Caleb Oct 21 '15 at 04:41
  • I've found that on a Linux system with a magnetic hard drive, the fastest way to get at the data in a file is to `mmap()` it. – Throw Away Account Oct 21 '15 at 04:44
  • Can you share your conclusions with us? What did you write? What performance did you measure? – chqrlie Oct 22 '15 at 09:22
  • I used fgetc to get a character at a time. Then, I checked if the character is an alphabet and added it to a buffer if it is. – 1729 Oct 25 '15 at 17:12

2 Answers2

3

Your question is almost off topic because it calls for opinion based answers.

The only way to know how fast one method will be compared to another is to try both and measure performance of the resulting executables on real data.

With todays computing power available in regular PCs, it will take a very large file to measure actual performance differences.

So go ahead and implement your ideas. You seem to have a good understanding of potential performance bottlenecks, turn these ideas into actual C code. Providing 2 different but correct programs for this problem along with a performance analysis should get you an A+. I as an employer value such an approach in a test.

PS: IMHO most of the time will be spent in getting the data from the file system. If the file is larger than available memory, that should be your bottleneck. If the file can fit in the operating system file system cache, subsequent benchmarks should give you much better performance than the first...

If you are allowed to write system specific code, try using mmap and simple for loops with explicit tests via look up tables over the mmapped char array.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 2
    Actually, history shows, some years on, that it takes a mere 10MiB of JSON to show that some C library implementations of `sscanf()` are quadratic with respect to the length of the string, and others are not. https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ https://news.ycombinator.com/item?id=26297612 https://github.com/biojppm/rapidyaml/issues/40 – JdeBP Mar 01 '21 at 14:22
  • @JdeBP: amazing find! Using `sscanf()` iteratively on this long string is a bit lame, but for `sscanf()` to call `strlen()` to figure the length of the string because the common routine shared with `fscanf()` needs some buffer size is inefficient and risky. I am not even sure if the `sscanf()` input string must be null terminated if enough bytes are available to perform all required conversions. Such behavior is definitely not expected for `strtol()`. Your workaround is not fully correct as the string could be modified between successive calls to `strlen()`, changing the actual length. – chqrlie Mar 01 '21 at 20:10
  • 1
    @JdeBP: to be precise, `sscanf()` itself does not have quadratic time complexity, but unnecessary linear time complexity with respect to the string length, which produces quadratic times when used repeatedly to parse the long JSON string. – chqrlie Mar 01 '21 at 20:14
2

As Heto points out in the comments, the main bottleneck here is probably going to be reading the file from disk, not whichever scanf function variant you decide to use.

If you really want to speed up your application, you should try to build a pipeline. As you're describing the application now, you'd basically be working in 2 phases: reading the file into a buffer, and parsing words from the buffer.

Here's what the activity might look like if you decide to read the whole file into a string, and then use sscanf on the string:

reading: ████████████████
parsing:                 ████████████████

You get something a little different if you use fscanf directly on the file, since you're constantly switching between reading and parsing:

reading: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █
parsing:  █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █

In both cases, you end up taking about the same amount of time.

However, if you can do your file i/o asynchronously, then you can overlap the time waiting for data from the disk with the time used to compute. Ideally, you'd end up with something like this:

reading: ████████████████
parsing:  ████████████████

My diagrams might not be that accurate (we already pointed out that pointed out that parsing should take much less time than the i/o, so the two bars really shouldn't be the same length)—but you should get the general idea. If you can set up a pipeline where data is read in asynchronously from the processing, then you can get a big speedup by overlapping the communication (reads from disk) and computation (parsing).

You could achieve an asynchronous pipeline like this using POSIX asynchronous I/O (aio), or just doing a simple producer/consumer setup with two threads (where one reads from the file and other other does the parsing).


Honestly though, unless you're processing massive text files, you're probably barely even going to be able to measure the difference in speed among any of the possible approaches you might choose...

This pipelining approach is more applicable when you're doing something more compute intensive (not just scanning characters), and your communication delay is higher (like when the data is coming over the network instead of from a local disk). However, it would still be a good exercise to explore the different options. After all, the assignment is contrived anyway—the point is to learn something useful that you might be able to use in a real project sometime later, right?


On a separate note, using any of the scanf will probably be slower than just looping over your buffers to extract strings of characters [A-Za-z]. This is because, with any of the scanf functions, the code first needs to parse your format string to figure out what you're looking for, and then actually parse the input. Sometimes compilers can do smart things—like how gcc usually changes a printf with no format specifiers into a puts instead—but I don't think there are optimizations like that for scanf and friends, especially if you're using something special like %[A-Za-z] instead of a standard format specifiers like %d.

DaoWen
  • 32,589
  • 6
  • 74
  • 101
  • 1
    The whole async idea would be relevant if teh reading/parsing was interleaved as you suggest it is. It isn't, on modern Operating Systems. The OS will read ahead for you. – MSalters Oct 21 '15 at 08:35
  • @MSalters - Could you elaborate? I think the *aio* suggestion might be a bit off (apparently it isn't all that helpful when your disk reads/writes are buffered, which I think might be what you're getting at)—but I would think the threading suggestion could potentially speed things up. If one thread is reading a 1GB file from disk into string buffers, and another thread is processing the string buffers, then (assuming multiple hardware threads) I would think that would be faster. – DaoWen Oct 21 '15 at 13:10
  • That would be slower, in fact. A modern OS will spot the read pattern and read bytes from disk before you actually request them, These end up in the file cache. As a result, your read operation completes immediately. A separate read thread would just incur thread synchronization overhead. – MSalters Oct 21 '15 at 14:46