changing every non letter character to \n in a file using unix utilities

Question

i was watching a tutorial about using unix utilities the guy was using it on a MAC i had a windows laptop so i downloaded Gnuwin32 Package then came a part where i want to replace any non letter character in a file with a newline "\n"

the command line in the tutorial was :

tr -sc 'A-Za-z' '\n'  < filename.txt |less

it worked with him but when i tried it it put a singleqoute "'" character after character

'S'h'a'k'e's'p'e'a'r'e'T'H'E'T'E'M'P'E'S'T'f'r'o'm'O'n'l'i'n'e'L'i'b'r'a'r'y'o'f'L'i'b'e'r't'y'h't't'p'o'l'l'l'i'b'e'r't'y'f'u'n'd'o'r'g'

i tried

tr -sc "A-Za-z" "\n"  < filename.txt |less

it added a new line after each character

n
e
L
i
b
r
a

i tried to remove the compliment option and add ^ in the regex

tr "[^A-Za-z]" "\n"  < filename.txt |less

the result was replacing every letter with a newline

the Question is does Command line options in UNIX utilities of GNUwin32 differ than others ? and does putting the regex between single quotes like 'A-Z' differ than "A-Z" and if so what would be the best answer to replace every non-letter character with a newline , other than the failed trials above

the source of the text i was trying on

@shellter thank you:) , actually i'm learning and i could have searched for another alternative but i guess i'm interested to make it works using the tr command ,. thankyou again — Hady Elsahar, Mar 08 '12 at 19:49
If you have single quote `'` after each letter then obviously you're going to get a new line after every letter as you're replacing every not letter by `\n`. — anubhava, Mar 08 '12 at 19:51
@anubhava u can check the text i'm using @ the end of the Question i don't have comma after each letter — Hady Elsahar, Mar 08 '12 at 19:54
@HadyElsahar: what shell/command interpreter are you using? Is that windows' `cmd` (which has different escaping rules to Unix shells)? — ninjalj, Mar 08 '12 at 19:57
I am not talking about comma. Your command `tr -sc 'A-Za-z' '\n'` is fine since in the provided input you have single quote `'` after each letter. If you use `echo "Shakes'peare" | tr -sc 'A-Za-z' '\n'` you will get only 1 new line between strings `Shakes` and `peare`. — anubhava, Mar 08 '12 at 19:58
I'm quite sure it's `cmd` escape rules, as I get the apostrophes if I do: `tr -sc A-Za-z "'\n'"`. About character classes, it's not a POSIX requirement to support them. Indeed, supporting them would be incompatible with POSIX. Can you try `tr -sc A-Za-z \n` or `tr -sc A-Za-z \\n`? If that fails, the best I can think of is `tr -s "[:blank:][:digit:][:cntrl:][:punct:]" "\n"`. — ninjalj, Mar 08 '12 at 20:17
@HadyElsahar: in that case, try `tr -sc A-Za-z \\n < filename.txt |less`, or possibly `tr -sc A-Za-z \n < filename.txt |less ` — ninjalj, Mar 08 '12 at 20:39

score 1 · Answer 1 · answered Mar 09 '12 at 18:35

I tested your examples in my tr --version (GNU coreutils) 8.5 and

1) using single or double quotes makes no difference 2) looks like there is no way to negate characters by using ^

When you write [^A-Za-z] all these chars are treated literally:

echo "abc abd [hh] d^o 1976" | tr '[^A-Za-z]' '.'

or with double quotes

echo "abc abd [hh] d^o 1976" | tr "[^A-Za-z]" '.'

produces the following output

... ... .... ... 1976

Which proves that all aphabetic chars, the caret and square brackets have been treated literally and replaced.

This leads us to the conclusion that to split by non-alphabetic chars you have to use -c with a range 'A-Za-z', exactly as you did in the first example.

score 0 · Answer 2 · answered Mar 08 '12 at 19:48

0

Hm..

$ tr -sc '[A-Za-z]' "\n" < getCokeInfo_viaFinger_cmu.awk
bin
gawk
f
BEGIN
wisc
edu
finger

....

Note that I used char-class ( [A-Za-z] ). Maybe your tr requires that too.

I hope this helps.

answered Mar 08 '12 at 19:48

shellter

36,525
7
83
90

not working with me it puts a newline after each character , another noticable thing is that it removes the non letter characters – Hady Elsahar Mar 08 '12 at 19:53
Do you actually need the character class? _c-c_ is from POSIX, and should work everywhere (at least for the POSIX locale, other locales have undefined behaviour, for obvious reasons). – ninjalj Mar 08 '12 at 19:55
I got the same error message as O.P. when I left them out. Good luck to all! – shellter Mar 08 '12 at 19:56
@HadyElsahar : sorry about prev message, I missed that you **are** the O.P. Your message says 'replace any non letter character in a file with a newline "\n"', so to me that donesn't mean keep the non-text chars. If you want to keep them, then `tr` **is definitely not** to tool for this, you'll need to use `sed`, `awk` or ...? Good luck. – shellter Mar 08 '12 at 20:06
@HadyElsahar: does your `tr` respond to `tr --version`? I'm using `tr (GNU coreutils) 5.97`. Good luck. – shellter Mar 08 '12 at 20:08
@HadyElsahar : and finally, what is the output you get for `set | grep 'LC.*'` ? – shellter Mar 08 '12 at 20:18
@shellter, the tr utility does not require square brackets for character ranges, simple `A-Za-z` is enough. The square brackets are interpreted literally. `echo "abc abd [hh] do" | tr -sc '[A-Za-z]' '\n'` is equivalent to `echo "abc abd [hh] do" | tr -sc 'A-Za-z[]' '\n'` and produces `abc abd [hh] do` – Nik O'Lai Mar 09 '12 at 18:20
@NikO'Lai: I am aware of this concept, but have you ever used `tr` on a old-line Unix platform, that was not Posix conformant? And can you explain why OPs version doesn't work? What is the difference? Possibly my use of "\n" vs '\n'? As to OP has modified what he want for output, but couldn't be bothered to edit his question, at this point I don't care. Good luck to all. – shellter Mar 09 '12 at 22:18

score 0 · Answer 3 · answered Mar 10 '12 at 19:53

0

cat file.txt | sed -re 's/[^a-zA-Z]/\n/g'

;)

answered Mar 10 '12 at 19:53

danechkin

1,306
8
15

changing every non letter character to \n in a file using unix utilities

3 Answers3