6

I'm trying to figure out how to find certain words in a file that start with the letters air and end with the letters ne. I'd like to print the words that it matches with into a new file called "excluded". I'm very new to this environment of command lines so i'm a bit lost. I've read the manual and cannot find a solution.

I was thinking something along the lines of

grep "air" | "ne" textfile.txt

but obviously it's not working out.

edit: I think I can use the ^ and $ operators to find letters at the beginning and end of a word, however i'm unsure as to how to make it one command so I can simply paste the output into a new file.

Barmar
  • 741,623
  • 53
  • 500
  • 612
Vno
  • 71
  • 2
  • 2
  • 8
  • `grep '\bair.*?ne\b'`, basically. "[word boundary]air[random chars]ne[word boundary]". your sample grep is wrong. `|` isn't in your pattern, therefore it's a shell pipe, so you're running grep, it's waiting for input, and its output would be piped to a (presumably existing) command called `ne` – Marc B Sep 09 '16 at 19:11
  • `^` and `$` are for the beginning and end of a line, not a word. Is your file just one word per line? – Barmar Sep 09 '16 at 19:17
  • yes, one word per line – Vno Sep 09 '16 at 19:21

2 Answers2

5

In order to print the words into a new file, you'll want to use the ">" operator to send the output of grep into a file, so the command would be:

grep '^air.*ne$' textfile.txt > excluded.txt

or, if you prefer to use pipes, something along the lines of:

cat textfile.txt | grep '^air.*ne$' > excluded.txt

would also work. Of course, this assumes that you're in the folder containing textfile.txt.

For test data

airkinglyne\nairlamne\nhelloworld\nairfatne

the output is:

airkinglyne\nairlamne\nairfatne

  • This will find multiple words on the same line, if the first word begins with `air` and the second word ends with `ne`. – Barmar Sep 09 '16 at 19:18
  • It isn't finding anything. I tried the same command without even printing the words into the new file. Nothing returns – Vno Sep 09 '16 at 19:29
  • since you only have one word per line, try grep ^air.*ne$ textfile.txt > excluded.txt – Anthony C. Nguyen Sep 09 '16 at 19:32
  • Still nothing is returning. I do not understand why grep is not reutrning any words. There are plenty of words that have this certain prefix and suffix, and the file is just a million lines of 1 word lines. – Vno Sep 09 '16 at 19:36
  • okay, try one more update, i just did another edit grep ^air.*ne$ textfile.txt > excluded.txt – Anthony C. Nguyen Sep 09 '16 at 19:37
  • One way to test debugging is to use: https://regex101.com/ and copy and paste part of your text file into the "test string" section and then type your regular expression into the "regular expression" part at the top and see if it works there. This allows you to isolate linux/grep issues from regular expression issues – Anthony C. Nguyen Sep 09 '16 at 19:39
  • What is the '.' and the '*' doing in there? I wouldn't have thought to add those. – Vno Sep 09 '16 at 19:44
  • The '.' and '*' together mean "any number of characters that is not whitespace" – Anthony C. Nguyen Sep 09 '16 at 19:45
  • Ah ok. Still getting nothing though. grep is not returning anything back to me even when I tell it to not write to a new file. – Vno Sep 09 '16 at 19:48
  • the total expression reads '^' must be starting at the beginning of the line, 'a' must be the letter "a", 'i' must be the letter "i", 'r' must be the letter "r", '.' any single character not whitespace, '*' repeat the previous rule 0 or more times, 'n' must be the letter "n", 'e' must be the letter "e", '$' must be the end of the line – Anthony C. Nguyen Sep 09 '16 at 19:48
  • did you try the website i suggested you test at? – Anthony C. Nguyen Sep 09 '16 at 19:49
  • also you said there's one word per line, is there anything like a comma or a tab or something at the end of each line? or is it just a word then a newline character immediately after? that would affect the search – Anthony C. Nguyen Sep 09 '16 at 19:51
  • the entire file is just words. no special characters or anything. all lowercase letters – Vno Sep 09 '16 at 19:57
  • *words separated by a new line, so you have 1 word per line – Vno Sep 09 '16 at 19:57
  • and just to be sure, you're replacing "textfile.txt" with the name of the file you're trying to search right? – Anthony C. Nguyen Sep 09 '16 at 20:00
  • Put the pattern in quotes, since it contains characters that have special meaning to the shell. – Barmar Sep 09 '16 at 20:25
  • I did that as well. no return. I must be missing something big here. like something obvious – Vno Sep 09 '16 at 20:34
  • try this, using cat and pipes in case there's some issue with grep not being able to read the textfile: cat textfile.txt | grep ^air.*ne$ > excluded.txt – Anthony C. Nguyen Sep 09 '16 at 20:39
  • I got no return from that – Vno Sep 09 '16 at 20:44
  • hmm okay. If you just do cat textfile.txt, does the entire million lines spit out? – Anthony C. Nguyen Sep 09 '16 at 20:44
  • Do you think I should be using one of the options of grep? such as -d, -f, etc – Vno Sep 09 '16 at 20:45
  • Yes when I do cat textfile.txt, I get about 500 lines of words – Vno Sep 09 '16 at 20:45
  • no, you shouldn't need any more options. I tried to make a sample of my own and the command works for me – Anthony C. Nguyen Sep 09 '16 at 20:49
  • the most likely thing is that the contents of the file textfile.txt are somehow different than what you're describing – Anthony C. Nguyen Sep 09 '16 at 20:50
  • could you give me the exact syntax including quotes and all? – Vno Sep 09 '16 at 20:50
  • I did not type quotes when i did it – Anthony C. Nguyen Sep 09 '16 at 20:52
  • I did just update my original solution to have quotes though. – Anthony C. Nguyen Sep 09 '16 at 20:53
  • still absolutely nothing. I'm getting pretty frustrated. No idea what i'm doing wrong – Vno Sep 09 '16 at 20:56
  • try making a copy of the file first with cat textfile.txt > textfile2.txt, then grepping on the copy – Anthony C. Nguyen Sep 09 '16 at 21:07
  • I got it worked out. I ended up doing: grep -n -e '\' textfile.txt >> excluded.txt – Vno Sep 09 '16 at 21:12
2
grep -o '\bair[^[:space:]]*ne\b' textfile | sort | uniq > excluded

From the man page, the -o flag "Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line."

The pattern is composed as follow: match a word edge (\b) then the string 'air' then something that is not a space, multiple times then the string 'ne' then the other word edge

Then we sort so we can uniq (could use sort -u)

The idea is that a word is a word edge followed by multiple non space characters followed by another word edge.

This is not perfect because it matches characters that are usually not parts of words like "airfoo_ne", "air.barne", etc, but you can improve it once you get the idea.

yarl
  • 161
  • 1
  • 6