I'm trying to create a dictionary of words from a collection of files. Is there a simple way to print all the words in a file, one per line?
-
Do you need a certain programming language? – mkmurray Jul 14 '09 at 05:29
-
2@mkmurray, "shell", "scripting" and "unix" usually means any of the tools available on standard UNIX boxes - awk, grep, sed, perl, cut and so on. – paxdiablo Jul 14 '09 at 05:34
5 Answers
You could use grep
:
-E '\w+'
searches for words-o
only prints the portion of the line that matches
% cat temp Some examples use "The quick brown fox jumped over the lazy dog," rather than "Lorem ipsum dolor sit amet, consectetur adipiscing elit" for example text. # if you don't care whether words repeat % grep -o -E '\w+' temp Some examples use The quick brown fox jumped over the lazy dog rather than Lorem ipsum dolor sit amet consectetur adipiscing elit for example text
If you want to only print each word once, disregarding case, you can use sort
-u
only prints each word once-f
tellssort
to ignore case when comparing words
# if you only want each word once % grep -o -E '\w+' temp | sort -u -f adipiscing amet brown consectetur dog dolor elit example examples for fox ipsum jumped lazy Lorem over quick rather sit Some text than The use

- 87,131
- 49
- 199
- 315
-
thanks! I've been google for an hour for this. Strangely, using "... from a document" instead of "... from a text file" brought me this question as the first match – davka Jan 14 '13 at 13:48
-
you can use ----> grep -o -E '\w+' testfile.txt | sort -u -f |tee 5.txt <----to wirte output to a file – sixsixsix May 02 '17 at 12:03
-
jack yang: or just use a normal shell redirect `grep ... | sort -u -f > 5.txt` – rampion May 02 '17 at 12:04
A good start is to simply use sed
to replace all spaces with newlines, strip out the empty lines (again with sed
), then sort
with the -u
(uniquify) flag to remove duplicates, as in this example:
$ echo "the quick brown dog and fox jumped
over the lazy dog" | sed 's/ /\n/g' | sed '/^$/d' | sort -u
and
brown
dog
fox
jumped
lazy
over
quick
the
Then you can start worrying about punctuation and the likes.

- 854,327
- 234
- 1,573
- 1,953
assuming words separated by white spaces
awk '{for(i=1;i<=NF;i++)print $i}' file
or
tr ' ' "\n" < file
if you want uniqueness:
awk '{for(i=1;i<=NF;i++)_[$i]++}END{for(i in _) print i}' file
tr ' ' "\n" < file | sort -u
with some punctuations removed.
awk '{
gsub(/["*^&()#@$,?~]/,"")
for(i=1;i<=NF;i++){ _[$i] }
}
END{ for(o in _){ print o } }' file

- 327,991
- 56
- 259
- 343
-
uniqueness without waiting: `awk '{for(i=1;i<=NF;i++) if(!_[$i]) { print $i; _[$i]=1} }' file` – rampion Mar 17 '23 at 19:59
Ken Church's "Unix(TM) for Poets" (PDF) describes exactly this type of application - extracting words out of text files, sorting and counting them, etc.
The tr command can do this...
tr [:blank:] '\n' < test.txt
This asks the tr program to replace white space with a new line. The output is stdout, but it could be redirected to another file, result.txt:
tr [:blank:] '\n' < test.txt > result.txt
Refer here.

- 1
- 1

- 285
- 4
- 19