29

I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into other file with a word per line. Example my file cotains:

Hola mundo, hablo español y no sé si escribí bien la
pregunta, ojalá me puedan entender y ayudar
Adiós.

The output file should contain:

Hola
mundo
hablo
español
...

Thank!

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
jaundavid
  • 385
  • 1
  • 5
  • 16

11 Answers11

58

Using tr:

tr -s '[[:punct:][:space:]]' '\n' < file
Guru
  • 16,456
  • 2
  • 33
  • 46
  • 1
    Simple and clean. Nice solution. – jsageryd Mar 19 '13 at 14:43
  • 3
    +1 as I think this is probably closest to what the poster wants but he did say that `O'Hara` and `X-ray` and some other combinations that include `[:punct:]` characters should be considered as one word which this solution would not do. He'd probably also want the output piped to "sort" so he just gets each word once in the output but now I'm guessing. – Ed Morton Mar 19 '13 at 16:33
  • 2
    Perhaps expand `[:punct:]` and remove `-` and `'`, making: `tr -s '[*!"#\$%&\(\)\+,\\\.\/:;<=>\?@\[\\\\]^_\`\{|\}~][:space:]]' '\n' < file`; optionally as Ed Morton also suggests sort and maybe add frequency: `tr -s '[*!"#\$%&\(\)\+,\\\.\/:;<=>\?@\[\\\\]^_\`\{|\}~][:space:]]' '\n' < file | sort | uniq -c | sort -nr`. A bit tangled but perhaps good. Also think about character case. Proper tokenizing can be tricky :) – jsageryd Mar 20 '13 at 10:08
  • You can save the result to a file using: `tr -s '[[:punct:][:space:]]' '\n' < file > temp && mv temp file` , supposing that filename is `file` – Bagghi Daku Jul 01 '20 at 11:00
14

The simplest tool is fmt:

fmt -1 <your-file

fmt designed to break lines to fit the specified width and if you provide -1 it breaks immediately after the word. See man fmt for documentation. Inspired by http://everythingsysadmin.com/2012/09/unorthodoxunix.html

geekQ
  • 29,027
  • 11
  • 62
  • 58
4

Using sed:

$ sed -e 's/[[:punct:]]*//g;s/[[:space:]]\+/\n/g' < inputfile

basically this deletes all punctuation and replaces any spaces with newlines. This also assumes your flavor of sed understands \n. Some do not -- in which case you can just use a literal newline instead (i.e. by embedding it inside your quotes).

FatalError
  • 52,695
  • 14
  • 99
  • 116
4

grep -o prints only the parts of matching line that matches pattern

grep -o '[[:alpha:]]*' file
umi
  • 3,782
  • 2
  • 15
  • 12
  • 1
    Can you explain me more please? I'dont understand the pattern, thank you. – jaundavid Mar 19 '13 at 14:23
  • Its a standard named classes for symbols that `grep` can use. This one, `[:alpha:]`, for example, means "all alphabet characters". Just like `[A-Za-z]` except it is aware of the current locale. Also, it is `[:alpha:]`, not `:alpha:` - brackets are a part of named class. – umi Mar 19 '13 at 14:30
  • 1
    `*` means `zero or more repetitions`. Probably don't want to include words with zero characters :-). A BRE for 1-or-more would be `[[:alpha:]][[:alpha:]]*` while an ERE would be `[[:alpha:]]+` – Ed Morton Mar 19 '13 at 14:42
  • This only matches the first word per line in the input file. Not a solution. Also, while 'word' is not defined, perhaps it would be a good thing to assume that a word can contain other characters than those in the alphabet, such as digits, apostrophes...? – jsageryd Mar 19 '13 at 14:48
  • `grep` with `-o` option will just omit empty matches so it's completely legal. Still, in other utilities/languages it could be significant, thanks for correction. – umi Mar 19 '13 at 14:50
1

Using perl:

perl -ne 'print join("\n", split)' < file

jsageryd
  • 4,234
  • 21
  • 34
  • No ponctuation handling :/ – Gilles Quénot Mar 19 '13 at 14:21
  • Nothing about special treatment of punctuation was requested. One definition of 'word' is anything separated by a space character. Different languages have different punctuation. Sometimes punctuation is **important information** to retain when tokenizing. Hence, simple implementation which is easy to extend, if needed. – jsageryd Mar 19 '13 at 14:33
1
cat input.txt | tr -d ",." | tr " \t" "\n" | grep -e "^$" -v

tr -d ",." deletes , and .

tr " \t" "\n" changes spaces and tabs to newlines

grep -e "^$" -v deletes empty lines (in case of two or more spaces)

Martin Geisler
  • 72,968
  • 25
  • 171
  • 229
kyticka
  • 604
  • 8
  • 19
  • I'm using ubuntu, is there tr in ubuntu?, what package should I install? – jaundavid Mar 19 '13 at 14:35
  • I'm using debian stable and cat, tr and grep are there by default, it is the same with ubuntu imho. tr is part of "coreutils" package in both debian and ubuntu. – kyticka Mar 19 '13 at 14:40
  • 1
    @jaundavid You picked a solution which will consider "stop!" and "stop?" as 2 different "words". I doubt if that is what you would want and there are MANY other issues with this solution. If you can just tell us in words what distinguishes "word"s from "word-separators" in your mind then we can give probably you a solution. – Ed Morton Mar 19 '13 at 16:27
1

this awk line may work too?

awk 'BEGIN{FS="[[:punct:] ]*";OFS="\n"}{$1=$1}1'  inputfile
Imagination
  • 596
  • 1
  • 5
  • 12
1

Based on your responses so far, I THINK what you probably are looking for is to treat words as sequences of characters separated by spaces, commas, sentence-ending characters (i.e. "." "!" or "?" in English) and other characters that you would NOT normally find in combination with alpha-numeric characters (e.g. "<" and ";" but not ' - # $ %). Now, "." is a sentence ending character but you said that $27.00 should be considered a "word" so . needs to be treated differently depending on context. I think the same is probably true for "-" and maybe some other characters.

So you need a solution that will convert this:

I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo@bar.com".

into this:

I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at 
foo@bar.com

Is that correct?

Try this using GNU awk so we can set RS to more than one character:

$ cat file
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo@bar.com".

$ gawk -v RS="[[:space:]?!]+" '{gsub(/^[^[:alnum:]$#]+|[^[:alnum:]%]+$/,"")} $0!=""' file
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
foo@bar.com

Try to come up with some other test cases to see if this always does what you want.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Yes Ed Morton, I had not thought about this cases, is important for me solve this problem now and I have not ideas of rules that could work. – jaundavid Mar 19 '13 at 20:26
  • Heh. Covered a lot of cases there. But there are probably a zillion more... not to mention differences between languages. But a good solution demands a good understanding of the requirements. Question needs to be more detailed for someone to give a good solution. At this stage I'd recommend having a look at what libraries are available for natural language parsing. Perhaps there is a good tokenizer out there that already covers many of the common pitfalls. Have a look at Ruby, Python, Perl maybe. – jsageryd Mar 20 '13 at 10:22
  • agreed. you can't do this job robustly with a quick script as so much in natural language depends on context so best the OP can hope for is a solution that's "good enough" for their needs. – Ed Morton Mar 20 '13 at 11:35
0

A very simple option would first be,

sed 's,\(\w*\),\1\n,g' file

beware it doens't handle neither apostrophes nor punctuation

jpmuc
  • 1,092
  • 14
  • 30
0

Using :

perl -pe 's/(?:\p{Punct}|\s+)+/\n/g' file

Output

Hola
mundo
hablo
español
y
no
sé
si
escribí
bien
la
pregunta
ojal�
me
puedan
entender
y
ayudar
Adiós
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

perl -ne 'print join("\n", split)'

Sorry @jsageryd

That one liner does not give correct answer as it joins last word on line with first word on next.

This is better but generates a blank line for each blank line in src. Pipe via | sed '/^$/d' to fix that

perl -ne '{ print join("\n",split(/[[:^word:]]+/)),"\n"; }'

Fred Gannett
  • 111
  • 1
  • 4