106

I would like to print the number of characters in each line of a text file using a unix command. I know it is simple with powershell

gc abc.txt | % {$_.length}

but I need unix command.

jaypal singh
  • 74,723
  • 23
  • 102
  • 147
vikas368
  • 1,408
  • 2
  • 10
  • 13

5 Answers5

196

Use Awk.

awk '{ print length }' abc.txt
Cory Klein
  • 51,188
  • 43
  • 183
  • 243
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 3
    This is several orders of magnitude faster than applying wc -c to each line! – aerijman Jul 17 '18 at 06:00
  • @aerijman for this type of problems the number of process creations is typically what makes the most performance difference. – MarcH Dec 04 '18 at 21:39
  • If a line in the file contains emojis this will not produce the expected length. – user5507535 Apr 08 '19 at 08:48
  • @user5507535, it depends on which “length” you actually expect. There are many possible definitions for Unicode (mawk uses bytes, didn't check gawk). – Jan Hudec Apr 08 '19 at 13:49
19
while IFS= read -r line; do echo ${#line}; done < abc.txt

It is POSIX, so it should work everywhere.

Edit: Added -r as suggested by William.

Edit: Beware of Unicode handling. Bash and zsh, with correctly set locale, will show number of codepoints, but dash will show bytes—so you have to check what your shell does. And then there many other possible definitions of length in Unicode anyway, so it depends on what you actually want.

Edit: Prefix with IFS= to avoid losing leading and trailing spaces.

Jan Hudec
  • 73,652
  • 13
  • 125
  • 172
  • +1, but...this will fail if the input contains '\'. Use read -r – William Pursell Jan 09 '12 at 13:27
  • If a line in the file contains emojis this will not produce the expected length. – user5507535 Apr 08 '19 at 08:48
  • @user5507535, actually, it depends on what “length” you expect. There are many possible definitions for Unicode (but in this case, different shells will actually do different thing). – Jan Hudec Apr 08 '19 at 13:46
  • Always set `IFS=` on the `read` command when wanting to read in arbitrary data. So `IFS= read -r`. `read` uses the `IFS` to do word splitting, and even though all the split words then get pasted back together into the one available variable (`line`), there is no guarantee that they get pasted back together with all the original separator characters they had or just one potentially different ones. For example, with the default IFS, the line `foo bar` could become `foo bar`, losing 7 spaces. (Like how Stack Overflow lost the adjacent spaces in that example string in this comment). – mtraceur Dec 30 '19 at 07:11
  • 1
    @mtraceur, the documentation explicitly says that “remaining words and their intervening delimiters are assigned to the last name,” so they do get pasted back together with the original separator. That, however, does not take care of the *leading* and *trailing* delimiters, which are indeed lost. So you are right, `IFS` should be set, but the problem when it isn't is more subtle. – Jan Hudec Dec 30 '19 at 12:43
4

Here is example using xargs:

$ xargs -d '\n' -I% sh -c 'echo % | wc -c' < file
kenorb
  • 155,785
  • 88
  • 678
  • 743
  • 1
    This "echo %" doesn't handle unsafe characters that need quoting from the shell. Additionally "xargs" is going to be splitting your file by spaces and newlines, not just newlines as the original poster requested. – bovine Mar 06 '15 at 23:15
4

I've tried the other answers listed above, but they are very far from decent solutions when dealing with large files -- especially once a single line's size occupies more than ~1/4 of available RAM.

Both bash and awk slurp the entire line, even though for this problem it's not needed. Bash will error out once a line is too long, even if you have enough memory.

I've implemented an extremely simple, fairly unoptimized python script that when tested with large files (~4 GB per line) doesn't slurp, and is by far a better solution than those given.

If this is time critical code for production, you can rewrite the ideas in C or perform better optimizations on the read call (instead of only reading a single byte at a time), after testing that this is indeed a bottleneck.

Code assumes newline is a linefeed character, which is a good assumption for Unix, but YMMV on Mac OS/Windows. Be sure the file ends with a linefeed to ensure the last line character count isn't overlooked.

from sys import stdin, exit

counter = 0
while True:
    byte = stdin.buffer.read(1)
    counter += 1
    if not byte:
        exit()
    if byte == b'\x0a':
        print(counter-1)
        counter = 0
Samuel Liew
  • 76,741
  • 107
  • 159
  • 260
  • 2
    The question was for a "text" file. I don't think 4GB per line fits any reasonable definition of a text file. – MarcH Nov 27 '18 at 06:13
1

Try this:

while read line    
do    
    echo -e |wc -m      
done <abc.txt    
Rahul
  • 76,197
  • 13
  • 71
  • 125
  • You meant `echo -e | wc -m`, didn't you? It's useless use of commands; shell can count characters in a variable. Plus `echo -e` is totally incompatible and works in half of the shells while starting with some escape sequence works in some other and nothing in the rest. – Jan Hudec Jan 09 '12 at 13:46