Linux: counting spaces and other characters in file

Question

Problem:

I need to match an exact format for a mailing machine software program. It expects a certain format. I can count the number of new lines, carriage returns, tabs ...etc. using tools like

cat -vte

and

od -c

and

wc -l ( or wc -c )

However, I'd like to know the exact number of leading and trailing spaces between characters and sections of text. Tabs as well.

Question:

How would you go about analyzing then matching a template exactly using common unix tools + perl or python? One-liners preferred. Also, what's your advice for matching a DOS encoded file? Would you translate it to NIX first, then analyze, or leave, as is?

UPDATE

Using this to see individual spaces [ assumes no '%' chars in file ]:

sed 's/ /%/g' filename.000

Plan to build a script that analyzes each line's tab and space content.

Using @shiplu's solution with a nod to the anti-cat crowd:

while read l;do echo $l;echo $((`echo $l |  wc -c` - `echo $l | tr -d ' ' | wc -c`));done<filename.000

Still needs some tweaks for Windows but it's well on it's way.

SAMPLE TEXT

Key for reading:

newlines marked with \n

Carriage returns marked with \r

Unknown space/tab characters marked with [:space:] ( need counts on those )

\r\n
\n
[:space:]Institution Anon LLC\r\n
[:space:]123 Blankety St\r\n
[:space:]Greater Abyss, AK  99999\r\n
\n
\n
[:space:]                                10/27/2011\r\n
[:space:]Requested materials are available for pickup:\r\n
[:space:]e__\r[:space:]                     D_ \r[:space:]   _O\r\n
[:space:]Bathtime for BonZo[:space:]       45454545454545[:space:]  10/27/2011\r\n
[:space:]Bathtime for BonZo[:space:]       45454545454545[:space:]  10/27/2011\r\n
\n
\n
\n
\n
\n
\n
[:space:]                             Pantz McManliss\r\n
[:space:]                             Gibberish Ave\r\n
[:space:]                             Northern Mirkwood, ME  99999\r\n
( untold variable amounts of \n chars go here )

UPDATE 2

Using IFS with read gives similar results to the ruby posted by someone below.

while IFS='' read -r line
 do 
     printf "%s\n" "$line" | sed 's/ /%/g' | grep -o '%' | wc -w
 done < filename.000

Yes. I think I'm on the right track. Make the space and tab characters visible, unique ...then count. I think I see someone below with the same idea. — Bubnoff, Dec 30 '11 at 20:27
The number of tabs and spaces added or separate? Do you have 4-5 demolines + output? — user unknown, Dec 30 '11 at 20:43
In other words, do you want to count all whitespace (without the line endings -\n or \r\n-- is that what you mean by DOS enconding?), per line, and write this number per line? — inger, Dec 30 '11 at 22:44
Correct. I'm not showing any tab characters, but numerous spaces. The ruby achieves this as does the newest while version above. Not sure why the ruby and bash yield slightly different results, but it's nearly time to start testing with a printer. Thanks everyone! — Bubnoff, Dec 30 '11 at 22:49
@inger Because of all the newline characters I'm not sure what format this is in. Some lines with DOS endings "\r" some with "\n" and others with both. See my key-code for the sample text above. Currently, both the Ruby and the newest bash loop seem to work ...so not sure what to make of that. — Bubnoff, Dec 30 '11 at 22:52
Thanks for the clarification. It may be useful to see the expected output for your sample. See my reply below. — inger, Dec 30 '11 at 23:25
Either echo the line then put the number of spaces below, or, just the number of spaces is fine. Your solutions work great ...output = good. I just wonder why all three solutions produce slightly different results. Ruby and bash produce the same results on some lines but not others ...same with perl. Weird. — Bubnoff, Dec 31 '11 at 00:13

ikegami · Answer 1 · 2011-12-31T00:48:03.300

5

perl -nlE'say 0+( () = /\s/g );'

Unlike the currently accepted answer, this doesn't split the input into fields, discarding the result. It also doesn't needlessly create an array just to count the number of values in a list.

Idioms used:

0+( ... ) imposes scalar context like scalar( ... ), but it's clearer because it tells the reader a number is expected.
List assignment in scalar context returns the number of elements returned by its RHS, so 0+( () = /.../g ) gives the number of times () = /.../g matched.
-l, when used with -n, will cause the input to be "chomped", so this removes line feeds from the count.

If you're just interested in spaces (U+0020) and tabs (U+0009), the following is faster and simpler:

perl -nE'say tr/ \t//;'

In both cases, you can pass the input via STDIN or via a file named by an argument.

edited Dec 31 '11 at 00:48

answered Dec 31 '11 at 00:42

ikegami

367,544
15
269
518

Nice! Works great. Still wonder why the results differ by one or two depending on solution used. Within the scope of this project, it doesn't matter much, but still ...a curious side note. – Bubnoff Dec 31 '11 at 00:51
@Bubnoff, I suspect some include CR and/or LF in the count. Mine will include CR. Why do your lines have CR in unix? Start by fixing your files using `dos2unix`. – ikegami Dec 31 '11 at 00:53
That sounds like a reasonable explanation. The printer expects DOS ( or whatever it's in ) and I don't want to do any surgery to the file if at all possible. I don't want to count the CR or the LF. So actually, the bash loop may be the most accurate since it traps only the spaces? There are no tabs in this file that I can find, I think it's just spaces. – Bubnoff Dec 31 '11 at 01:00
@Bubnoff, Replacing `\s` with `[ \t]` will do the trick, as well as using the `tr/ \t//` solution, since those specifically look for spaces and tabs, not whitespace in general. – ikegami Dec 31 '11 at 01:03
perl -nE'say tr/ \t//;' produces the same results as the bash loop. – Bubnoff Dec 31 '11 at 01:06
re: \s vs [ \t]. Facepalm. D,oh ...I knew that. Thanks! – Bubnoff Dec 31 '11 at 01:14

score 4 · Answer 2 · answered Dec 30 '11 at 19:59

4

Regular expressions in Perl or Python would be the way to go here.

Yes, it may take an initial time investment to learn "perl, schmerl, zwerl" but once you've gained experience with an extremely powerful tool like Regular Expressions, it can save you an enormous amount of time down the road.

enter image description here

answered Dec 30 '11 at 19:59

Jonathon Reinhart

132,704
33
254
328

1

I know regular expressions. Looking for a few one-liners. Should've clarified that at beginning. Thanks for the XKCD though. – Bubnoff Dec 30 '11 at 20:22

score 2 · Answer 3 · answered Dec 30 '11 at 20:06

2

counting blanks:

sed 's/[^ ]//g' FILE | tr -d "\n" | wc -c

before, behind and between text. Do you want to count newlines, tabs, etc. in the same go and sum them up, or as separate step?

answered Dec 30 '11 at 20:06

user unknown

35,537
11
75
121

This is very close ...where I was heading [ see update ]. But needs to count spaces or tabs per line. Your tr bit is an interesting solution. – Bubnoff Dec 30 '11 at 20:30

score 2 · Accepted Answer · answered Dec 31 '11 at 01:04

2

perl -nwE 'print; for my $s (/([\t ]+)/g) { say "Count: ", length $s }' input.txt

This will count individual groups of tab or space, instead of counting all the whitespace in the entire line. For example:

    foo        bar

Will print

    foo        bar
Count: 4
Count: 8

You may wish to skip single spaces (spaces between words). I.e. don't count the spaces in Bathtime for BonZo. If so, replace + with {2,} or whatever minimum you think is appropriate.

answered Dec 31 '11 at 01:04

TLP

66,756
10
92
149

Rock unt Roll ( umlauts all around ). I need to do some testing, but I think this may be the best one yet! – Bubnoff Dec 31 '11 at 02:28
@Bubnoff If I knew exactly what you were trying to do, I could probably provide a better answer. But I suppose you can tailor this to suit your needs. – TLP Dec 31 '11 at 03:20
We are moving to a new system that produces a different format than my sample text above. I need to get the new system to match the above format -- I'm analyzing the expected format in hopes of matching it in the new system. I may have to start a new question as to the best way to do this now that the analysis is nearly complete. – Bubnoff Dec 31 '11 at 17:25

Shiplu Mokaddim · Answer 5 · 2011-12-30T22:29:00.017

1

If you want to count the number of spaces in pm.txt, this command will do,

 cat pm.txt | while read l; 
 do echo $((`echo $l |  wc -c` - `echo $l | tr -d ' ' | wc -c`));
 done;

If you want to count the number of spaces, \r, \n, \t use this,

cat pm.txt | while read l;
do echo $((`echo $l |  wc -c` - `echo $l | tr -d ' \r\n\t' | wc -c`));
done;

read will strip any leading characters. If you dont want it, there is a nasty way. First split your file so that only 1 lines are there per file using

`split -l 1 -d pm.txt`.

After that there will be bunch of x* files. Now loop through it.

for x in x*; do echo $((`cat $x |  wc -c` - `cat $x | tr -d ' \r\n\t' | wc -c`)); done;

Remove the those files by rm x*;

edited Dec 30 '11 at 22:29

answered Dec 30 '11 at 20:06

Shiplu Mokaddim

56,364
17
141
187

I can hear the 'anti-using-cat' crowd wailing and gnashing their teeth, but I think this is on the right track. I need a per line analysis however. – Bubnoff Dec 30 '11 at 20:34
Brilliant ...I was getting to loops in my experiments but you beat me. This is it. I just need to play with translating this from dos to unix in that middle part and I'm done. Thanks man! – Bubnoff Dec 30 '11 at 20:49
The while loop strips the leading spaces somehow. Only showing 3 spaces in lines that contain as many as 50. – Bubnoff Dec 30 '11 at 21:53
If I do a line by line analysis without a loop they show. eg., grep "regex" -m 1 bkct.000 | sed 's/ /%/g' | grep -o '%' | wc -w – Bubnoff Dec 30 '11 at 21:56
"# while IFS='' read -r line" fixes this. Found here http://stackoverflow.com/questions/1648055/preserving-leading-white-space-while-readingwriting-a-file-line-by-line-in-bash – Bubnoff Dec 30 '11 at 22:37

inger · Answer 6 · 2012-01-01T01:43:49.563

1

In case Ruby counts (it does count :)

ruby -lne 'puts scan(/\s/).size'

and now some Perl (slightly less intuitive IMHO):

perl -lne 'print scalar(@{[/(\s)/g]})'

edited Jan 01 '12 at 01:43

answered Dec 30 '11 at 22:11

inger

19,574
9
49
54

Concise and nice. There are other constraints and I don't know Ruby at all, but I'll upvote for the concise fulfillment to one of the requirements. I'll have to check out Ruby. Thanks! – Bubnoff Dec 30 '11 at 22:34
Thanks. Added some Perl.. sorry for missing some requirements.. let me read again. – inger Dec 30 '11 at 22:37
Awesome! This works as well. Here's a puzzler though. All three solutions on this page ( bash, ruby, perl ) differ in their results by one or two. Perl counts one less space per line than bash and some lines but more spaces on other lines. This is going to be a fairly manual process I think. – Bubnoff Dec 30 '11 at 22:58
well, I must admit I'm not sure I understand your requirements exactly - others may not either.. So, to clarify: you just need 1 number per line: the number of whitespace chars, ignoring line endings. Both of the above ruby/perl 1-liners seem to satisfy that. Are there any other nuances/requirements/constraints? Running the bash code against your sample seems to dump the lines too, with leading whitespace stripped(is that a requirement?). The ruby&perl here give me the same results against the sample in the question. What differences do you see there? – inger Dec 30 '11 at 23:15
If you look at the last bash sample, it preserves the space and gives similar results. The ruby and perl work just as well, but all three give slightly different results on different lines. It's as if all three translate spaces slightly different. All fit the bill well enough to complete the project ...I was just wondering why the slightly different results. – Bubnoff Dec 31 '11 at 00:10
Yeah, I used \s which means 'any whitespace', while the bash 3 liner seems to look for spaces only? – inger Dec 31 '11 at 10:12

score -1 · Answer 7 · answered Dec 30 '11 at 19:59

-1

If you ask me, I'd write a simple C program to do the counting and formatting all in one go. But that's just me. By the time I got finished fiddle-farting around with perl, schmerl, zwerl I'd have wasted half a day.

answered Dec 30 '11 at 19:59

Pete Wilson

8,610
6
39
51

Linux: counting spaces and other characters in file

7 Answers7