2

I want to find the number of 8 letter words that do not contain the letter "e" in a number of text files (*.txt). In the process I ran into two issues: my lack of understanding in quantifiers and how to exclude characters.

I'm quite new to the Unix terminal, but this is what I have tried:

cat *.txt | grep -Eo "\w+" | grep -i ".*[^e].*"

I need to include the cat command because it otherwise includes the names of the text files in the pipe. The second pipe is to have all the words in a list, and it works, but the last pipe was meant to find all the words that do not have the letter "e" in them, but doesn't seem to work. (I thought "." for no or any number of any character, followed by a character that is not an "e", and followed by another "." for no or any number of any character.)

cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]"

This command works to find the words that contain 8 characters, but it is quite ineffective, because I have to repeat "[a-z]" 8 times. I thought it could also be "[a-z]{8}", but that doesn't seem to work.

cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]" | grep -i ".*[^e].*"

So finally, this would be my best guess, however, the third pipe is ineffective and the last pipe doesn't work.

doelie247
  • 124
  • 8
  • 1
    Thanks for sharing your efforts in your question, keep it up. Could you please also share samples of input and expected output in your question for more clarity of question. – RavinderSingh13 Nov 20 '20 at 09:42
  • 1
    `[a-z]` - so exclude `e`, like `[a-df-z]`. `"[a-z]{8}", but that doesn't seem to work.` I am always confused between basic and extended regex. In plain grep do `[a-z]\{8\}`, in `grep -E` then `{8}` would work. – KamilCuk Nov 20 '20 at 09:47
  • 2
    @doelie247: Instead of using `wc` a `regex` tag would be more appropriate as you are not looking for `wc` but a regular expression. Please edit your question accordingly. – anubhava Nov 20 '20 at 10:07
  • 1
    **You could write some program** in C (see [n1570](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf) then [syscalls(2)](https://man7.org/linux/man-pages/man2/syscalls.2.html) and [regex(7)](https://man7.org/linux/man-pages/man7/regex.7.html)...), or in [C++](https://en.cppreference.com/w/cpp) -compiled by [GCC](http://gcc.gnu.org/)- in [Python](http://python.org/), in [Ocaml](http://ocaml.org/) **finding them**. If you have *many* files, it could be more appropriate – Basile Starynkevitch Nov 20 '20 at 10:24
  • 2
    You might also be interested by [glob(7)](https://man7.org/linux/man-pages/man7/glob.7.html), [readdir(3)](https://man7.org/linux/man-pages/man3/readdir.3.html) and [nftw(3)](https://man7.org/linux/man-pages/man3/nftw.3.html). Read also [*Advanced Linux Programming*](https://mentorembedded.github.io/advancedlinuxprogramming/) – Basile Starynkevitch Nov 20 '20 at 10:31

3 Answers3

7

You may use this grep:

grep -hEiwo '[a-df-z]{8}' *.txt

Here:

  • [a-df-z]{8}: Matches all letters except e
  • -h: Don't print filename in output
  • -i: Ignore case search
  • -o: Print matches only
  • -w: Match complete words
anubhava
  • 761,203
  • 64
  • 569
  • 643
2

In case you are ok with GNU awk and assuming that you want to print only the exact words and could be multiple matches in a line if this is the case one could try following.

awk -v IGNORECASE="1" '{for(i=1;i<=NF;i++){if($i~/^[a-df-z]{8}$/){print $i}}}' *.txt

OR without the use of IGNORCASE one could try:

awk '{for(i=1;i<=NF;i++){if(tolower($i)~/^[a-df-z]{8}$/){print $i}}}' *.txt

NOTE: Considering that you want exact matches of 8 letters only in lines. 8 letter words followed by a punctuation mark will be excluded.

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

Here is a crazy thought with GNU awk:

awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{c+=NF}END{print c}' file

Or if you want to make it work only on a select set of characters:

awk 'BEGIN{FPAT="\\<[a-df-z]{8}\\>"}{c+=NF}END{print c}' file

What this does is, it defines the fields, to be a set of 8 characters (\w as a word-constituent or [a-df-z] as a selected set) which is enclosed by word-boundaries (\< and \>). This is done with FPAT (note the Gory details about escaping).

Sometimes you might also have words which contain diatrics, so you have to expand. Then this might be the best solution:

awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{for(i=1;i<=NF;++i) if($i !~ /e/) c++}END{print c}' file
kvantour
  • 25,269
  • 4
  • 47
  • 72