extract words from a file

Question

I'm trying to create a dictionary of words from a collection of files. Is there a simple way to print all the words in a file, one per line?

@mkmurray, "shell", "scripting" and "unix" usually means any of the tools available on standard UNIX boxes - awk, grep, sed, perl, cut and so on. — paxdiablo, Jul 14 '09 at 05:34

rampion · Accepted Answer · 2009-07-14T06:39:00.253

33

You could use grep:

-E '\w+' searches for words
-o only prints the portion of the line that matches

% cat temp
Some examples use "The quick brown fox jumped over the lazy dog,"
rather than "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
for example text.
# if you don't care whether words repeat
% grep -o -E '\w+' temp
Some
examples
use
The
quick
brown
fox
jumped
over
the
lazy
dog
rather
than
Lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
for
example
text

If you want to only print each word once, disregarding case, you can use sort

-u only prints each word once
-f tells sort to ignore case when comparing words

# if you only want each word once
% grep -o -E '\w+' temp | sort -u -f
adipiscing
amet
brown
consectetur
dog
dolor
elit
example
examples
for
fox
ipsum
jumped
lazy
Lorem
over
quick
rather
sit
Some
text
than
The
use

edited Jul 14 '09 at 06:39

answered Jul 14 '09 at 06:21

rampion

87,131
49
199
315

thanks! I've been google for an hour for this. Strangely, using "... from a document" instead of "... from a text file" brought me this question as the first match – davka Jan 14 '13 at 13:48
you can use ----> grep -o -E '\w+' testfile.txt | sort -u -f |tee 5.txt <----to wirte output to a file – sixsixsix May 02 '17 at 12:03
jack yang: or just use a normal shell redirect `grep ... | sort -u -f > 5.txt` – rampion May 02 '17 at 12:04

score 3 · Answer 2 · answered Jul 14 '09 at 05:31

A good start is to simply use sed to replace all spaces with newlines, strip out the empty lines (again with sed), then sort with the -u (uniquify) flag to remove duplicates, as in this example:

$ echo "the quick brown dog and fox jumped
over the lazy   dog" | sed 's/ /\n/g' | sed '/^$/d' | sort -u

and
brown
dog
fox
jumped
lazy
over
quick
the

Then you can start worrying about punctuation and the likes.

ghostdog74 · Answer 3 · 2009-07-14T06:32:07.290

3

assuming words separated by white spaces

awk '{for(i=1;i<=NF;i++)print $i}' file

or

 tr ' ' "\n" < file

if you want uniqueness:

awk '{for(i=1;i<=NF;i++)_[$i]++}END{for(i in _) print i}' file

tr ' ' "\n" < file | sort -u

with some punctuations removed.

awk '{
    gsub(/["*^&()#@$,?~]/,"")
    for(i=1;i<=NF;i++){  _[$i]  }
}
END{    for(o in _){ print o }  }' file

edited Jul 14 '09 at 06:32

answered Jul 14 '09 at 05:32

ghostdog74

327,991
56
259
343

uniqueness without waiting: `awk '{for(i=1;i<=NF;i++) if(!_[$i]) { print $i; _[$i]=1} }' file` – rampion Mar 17 '23 at 19:59

score 1 · Answer 4 · edited Aug 24 '16 at 09:29

1

Ken Church's "Unix(TM) for Poets" (PDF) describes exactly this type of application - extracting words out of text files, sorting and counting them, etc.

edited Aug 24 '16 at 09:29

ckujau

225
1
15

answered Jul 14 '09 at 07:15

Yuval F

20,565
5
44
69

score 0 · Answer 5 · edited May 23 '17 at 12:02

0

The tr command can do this...

tr [:blank:] '\n' < test.txt

This asks the tr program to replace white space with a new line. The output is stdout, but it could be redirected to another file, result.txt:

tr [:blank:] '\n' < test.txt > result.txt

Refer here.

edited May 23 '17 at 12:02

Community

1
1

answered Aug 26 '15 at 14:48

Malaka Gunawardhana

285
4
19

extract words from a file

5 Answers5

Linked

Related