I'm trying to generate a list of word frequency from a .txt file, I do not want certain ASCII printable characters and all the Extended ASCII characters to contribute to the word frequency list. Here is my generalized code:
cat file.txt | tr -d '[:punct:]' | tr -d '[:digit:]' | tr -d '\33-\64\91-\96\123-\255' | tr ' ' '\n' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn > Freq.list
Also, I had originally tried the segment: tr -d '[:special:]'
but received the error: tr: invalid character class special
A key part of the code I want is to also make sure that symbols next to each other are deleted, such as: «•
Lastly, is there a way to delete single quotations attached to a word? Such that "word or 'word can contribute to word. I've tried tr -d "\""
and tr -d '\33-\64'
for that but don't seem to work.
Here is an example of the file.txt:
£, is the specific heat per unit volume, «•„ and cr,,
are respectively the thermal and electrical conductivity of the normal region"
Which I want output as:
3 the
2 and
1 volume
1 unit
1 thermal
1 specific
1 respectively
1 region
1 per
1 of
1 normal
1 is
1 heat
1 electrical
1 conductivity
1 are