0

I'm trying to generate a list of word frequency from a .txt file, I do not want certain ASCII printable characters and all the Extended ASCII characters to contribute to the word frequency list. Here is my generalized code:

cat file.txt | tr -d '[:punct:]' | tr -d '[:digit:]' | tr -d '\33-\64\91-\96\123-\255' | tr ' ' '\n' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn > Freq.list

Also, I had originally tried the segment: tr -d '[:special:]' but received the error: tr: invalid character class special

A key part of the code I want is to also make sure that symbols next to each other are deleted, such as: «•

Lastly, is there a way to delete single quotations attached to a word? Such that "word or 'word can contribute to word. I've tried tr -d "\"" and tr -d '\33-\64' for that but don't seem to work.

Here is an example of the file.txt:
£, is the specific heat per unit volume, «•„ and cr,, are respectively the thermal and electrical conductivity of the normal region"

Which I want output as:
3 the
2 and
1 volume
1 unit
1 thermal
1 specific
1 respectively
1 region
1 per
1 of
1 normal
1 is
1 heat
1 electrical
1 conductivity
1 are

SL3_88
  • 3
  • 2

1 Answers1

0

Given this file:

$ cat file
My hovercraft is full of eels
Min luftpudebåd er fyldt med ål
Mon aéroglisseur est plein d'anguilles
โฮเวอร์คราฟท์ของผมเต็มไปด้วยปลาไหล
Iyéčhiŋkiŋyaŋka čha kiŋyáŋ mitȟáwa kiŋ hoká ožúla!

You can remove all the non-ascii with iconv -ct ascii:

$ iconv -ct ascii < file 
My hovercraft is full of eels
Min luftpudebd er fyldt med l
Mon aroglisseur est plein d'anguilles

Iyhikiyaka ha kiy mitwa ki hok ola!

Or transliterate them into unaccented ascii if any with iconv -t ascii//translit:

$ iconv -t ascii//translit < file
My hovercraft is full of eels
Min luftpudebad er fyldt med al
Mon aeroglisseur est plein d'anguilles
??????????????????????????????????
Iyechi?ki?ya?ka cha ki?ya? mithawa ki? hoka ozula!
that other guy
  • 116,971
  • 11
  • 170
  • 194
  • OP also wants to delete punctuation and digits. – Thomas Dickey Apr 08 '15 at 23:09
  • My .txt files are 10's of thousands of lines long and there's a bunch of random pairings of letters and hyphened words which may be casualties of the code. But: iconv -ct ascii < file then performing my code seems to take care of rogue quotations, hyphens, and non ascii symbols – SL3_88 Apr 09 '15 at 09:34