0

I've a plain text file.

> Input: इंजेक्शन इंटरनॅशनल इंटिग्रेटेड इंटिरिअर इंडस्ट्री

All words are separated by one or more spaces. I want to collect all unique chars from the text file. I'm looking for a unix command; the order of the result chars is not important.

> Expected result: इं जे क्श न ट र नॅ श ल इ्रे टे ड टि रिअ र ड स्ट्री

With the command Klaus has provided

cat <file>|sed -e 's/\(.\)/\1\n/g'|sort -u|tr -d '\n'

Result comes as:

ं अ इ क ग ज ट ड न र ल श सिीॅे्

I don't want to separate horizontal or vertical conjuncts or dependent vowels from its base character.

I just want to separate complete characters in a word from each other.

Can we achieve this with UNIX commands?

"base character" + "dependent vowel" = "complete character"

 -  क                   ा                        का 
 -  क                   ि                        कि

Klaus's command works for English text only. But, It doesn't work with indic languages such as Hindi.

Input: hi1 hello-2 how!3 "are4 ?you5

result: h i e l o w a r y u 1 2 3 4 5 - ! "

Note:- You have to install Indic support in your OS. Also, download Mangal font from http://hindi-fonts.com/fonts/Mangal

Community
  • 1
  • 1
user1
  • 4,031
  • 8
  • 37
  • 66

1 Answers1

2

Try this:

cat <file>|sed -e 's/\(.\)/\1\n/g'|sort -u|tr -d '\n'

or simplified ( stolen from fedorqui comment, thanks! Never seen & before in the replacement part. Good to learn something new! )

sed 's/./&\n/g' <file> | sort -u | tr -d '\n'
Klaus
  • 24,205
  • 7
  • 58
  • 113
  • 1
    I like this approach. I would suggest some improvements: get rid of `cat | sed`, because `sed '...' file` does the same. Also, this alone would make, no need to catch a group: `sed 's/./&\n/g' file | sort -u | tr '\n' ' '` – fedorqui Aug 12 '14 at 10:56
  • 1
    Thanks! Never seen `&` in the replacement part before :-) – Klaus Aug 12 '14 at 11:14
  • Klaus, fedorqui: Thanks. command works fine for english text. I have unicode text: "इंजेक्शन इंटरनॅशनल इंटिग्रेटेड इंटिरिअर इंडस्ट्री" I am expecting result as: इं जे क्श न ट र नॅ श ल इ्रे टे ड टि रिअ र ड स्ट्री – user1 Aug 12 '14 at 11:14
  • Sorry, my local desktop installation is not able to handle these chars you are provided. So I couldn't reproduce the results for you. Seems to be that the charset is not complete. – Klaus Aug 12 '14 at 12:04