3

I am using C to parse a large flat file and output relevant lines into an output file. The output file should be around 70,000 lines.

If I open the file in gedit, it displays exactly as expected, with the correct number of lines and line lengths.

However, running wc -l <file> returns 13,156. So does grep -c "" <file>.

tail <file> returns the last 10 lines that I see in gedit. head <file> returns the first 10 lines. But tail -n +8000 | head -n 1, which should return the 8,000th line, returns the text that I see on line 34,804 in gedit.

I'd expect these results if I was missing newline characters in the file. But gedit doesn't seem to have a problem with it. Additionally, wc -L <file>, which displays the maximum line length, returns 142 bytes, as expected. The size of the file is a little over 9,000,000 bytes, as also expected.

If wc -L <file> = 142, and wc -c <file> = 9046609, then how can can wc -l <file> = 13156?

Does anyone know what I did wrong when writing to this file?

Fred Olsen
  • 33
  • 2

1 Answers1

3

It's probably some odd combination of return ('\r') and linefeed ('\n') characters.

Assuming you have the GNU Coreutils version of "tr", you can use these commands to count the number of each character in the file:

tr -d -c '\n' FILE | wc -c

tr -d -c '\r' FILE | wc -c

For a normal Unix-style text file, the second command should print 0. For a Windows-style text file, both should print the same number.

The "file" command will also probably tell you something useful.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • Yes, the issue turned out to be with the LF and CR symbols. The lines of the source files ended with CR LF. However, I made an error with my read() calls and was truncating certain lines after the CR. Therefore, not all the new lines were being read in the output. Gedit was smart enough to display it properly, and apparently the max line length option in wc stops after the CR. But there were only 13,156 LFs in the file. – Fred Olsen Jul 23 '11 at 20:33
  • @Keith: FWIW, some Mac files have '\r' as line separators. – Rudy Velthuis Jul 23 '11 at 20:47
  • 2
    @Rudy: Yes, you're right. As I recall, MacOS prior to MacOS X used '\r'; MacOS X is Unix-based, so it uses '\n'. A number of other text file formats are possible, including fixed-width records, but you're unlikely to run into something like that unless you're using an old mainframe. – Keith Thompson Jul 23 '11 at 21:01
  • @Keith: the fixed size records variety sounds antique, indeed. I've never seen it in 30 years. – Rudy Velthuis Jul 23 '11 at 21:09