1

I am using grep to extract lines from file 1 that matches with string in file2. The string in file 2 has both alphabets and numbers. eg;

MSTRG.18691.1
MSTRG.18801.1

I used sed to write word boundaries for all the strings in the file 2.

file 2
\<MSTRG.18691.1\>
\<MSTRG.18801.1\>

and used grep -f file2 file1

but output has

MSTRG.18691.1.2
MSTRG.18801.1.3 also..

I want lines that matches exactly,

MSTRG.18691.1
MSTRG.18801.1

and not,

MSTRG.18691.1.2
MSTRG.18801.1.3

Few lines from my file1
t_name gene_name FPKM TPM
MSTRG.25.1 . 0 0
rna71519 . 93.398872 194.727926057583
gene34024 ND1 2971.72876 6195.77694943117
MSTRG.28.1 . 0 0
MSTRG.28.2 . 0 0
rna71520 . 33.235409 69.2927240732149

2 Answers2

1

Updating the answer

You can use start with ^ and end with $ operator to match start with and begin with. To match exactly MSTRG.18691.1 you can add ^ & $ at both ends and remove the word boundaries, additionally . has special meaning in regex to match exactly . we need to escape that with a backslash \

Example pattern:

^MSTRG\.18691\.1$
^MSTRG\.18801\.1$

file1

MSTRG.18691.1
MSTRG.1311.1
MSTRG.18801.2
MSTRG.18801.3
MSTRG.18801.1.2
MSTRG.18801.1.1
MSTRG.18801.1
PrefixMSTRG.18801.1

Just create a normal file named file1 and paste the above content into it.

file2 (pattern file)

^MSTRG\.18801\.1$

Just create a normal file named file2 and paste the above content into it.

Run the below command from commandline

grep -i --color -f file2 file1

Result:

MSTRG.18801.1

Sed to add changes to the pattern file

Here is the sed command to escape . and add ^ and $ at the beginning and end of the pattern file you already have.

sed -Ee 's/\./\\./g' -e 's/^/\^/g' -e 's/$/\$/g' file2 > file2_updated

-E to support extended regex on BSD sed, you may need to replace -E with -r based on your system's sed

Updated patterns will be saved to file2_updated. Need to use the new pattern file in grep like this

grep -i -f file2_updated file1
minhazur
  • 4,928
  • 3
  • 25
  • 27
  • Also there are strings like `MSTRG.13.1` in file 2 but grep returns `MSTRG.1311.1` which is not in file2. – Kousalya Devi Jun 11 '19 at 09:53
  • Please, check the updated answer. A dot (.) has special meaning to regex, we need to escape that to not match any single character. – minhazur Jun 11 '19 at 10:11
  • I used `sed 's/^/^/' file 2 | sed 's/$/$/' | sed 's/./\\./'` to make this `^MSTRG\.18691\.1$` pattern. But the whole file is repalced with `\.`. How to add back slash before the dot? – Kousalya Devi Jun 11 '19 at 10:50
  • why you need sed? you can use plain grep to match words or line. Still, if you need to use sed to do some intermediate stream conversion, you can do that then you can grep to match patterns. You only need backslash in pattern file. – minhazur Jun 11 '19 at 10:59
  • I am at a very beginner level in linux. I don't know how to create the pattern file `^MSTRG\.18801\.1$` to use with `grep -f`. I know to use `sed` to add `^` at start and `$` at end of the line. I dont know how to add back slash before dot in pattern file (i.e. file2) – Kousalya Devi Jun 11 '19 at 11:24
  • I created file2 with pattern `^MSTRG\.18801\.1$` it works, with and without `-i --color`. Thank you. But my original `file2` has around 5500 lines. How can I edit that file such that it has `^` as the start; `$` as the end and backslash before dot `.` ? – Kousalya Devi Jun 11 '19 at 15:59
  • Great that you have made it this far. :) I will post the sed command to replace those, for the time being, you can use sublime or some other text editor to find and replace those. I used -i to ignore case and --color to show matching color. – minhazur Jun 11 '19 at 16:05
  • `sed` worked perfectly, thank you. But `grep` outputs nothing. The `updated_file2` works well on the custom created file1, but not on my original file1. My original file1 is a tab-delimited file with 3 columns and I am applying `grep` to the first column. – Kousalya Devi Jun 11 '19 at 18:47
  • Can you please add part of the actual file1? – minhazur Jun 12 '19 at 02:52
  • t_name gene_name FPKM TPM gene34025 ND2 3560.135498 7422.5500 MSTRG.22.3 . 831.061035 1732.6846 MSTRG.22.4 . 0 0 MSTRG.28.3 . 435.921539 908.8557 This is a file with 4 columns. Many entries in the `2nd column` are a `dot (.)` – Kousalya Devi Jun 12 '19 at 04:40
  • Can you please add this in you question as sample to keep formatting? – minhazur Jun 12 '19 at 04:45
  • I replaced `.` in the 2nd column as `UNKNOWN`. Modified the pattern file as `MSTRG\.18801\.1` (removed `^` at start and `$` at end ). Now it works fine. Is the correct way? Thankyou. I have updated the qstn with file1 – Kousalya Devi Jun 12 '19 at 05:11
  • That's correct if you want to match any string from any part of a line that matches `MSTRG\.18801\.1`. If a line is like this `MSTRG.18801.1.1 hello world!`, that pattern will match `MSTRG.18801.1` and show the whole line as output. You can accept my answer if it was helpful. Thanks :) – minhazur Jun 12 '19 at 05:47
  • But I need exact match of **`MSTRG.18801.1` only, not `MSTRG.18801.1.1`**. Will editing the pattern to `\` give only `MSTRG.18801.1` as output? – Kousalya Devi Jun 12 '19 at 05:58
  • What you can do is use `ggrep` that supports print only matching. Here is the ggrep command for that `ggrep -o -f file2_updated file1` – minhazur Jun 12 '19 at 06:20
0

The flag you're looking for is -F. From man grep:

-F, --fixed-strings

Interpret PATTERN as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched.

You can use this quite comfortably in conjunction with -f:

grep -Ff file2 file1

To be clear, this will treat every line of file2 as an exact match against file1.

vintnes
  • 2,014
  • 7
  • 16
  • @KousalyaDevi What? – vintnes Jun 11 '19 at 10:25
  • I used `grep -Ff file2 file1` It works for `MSTRG.13.1` (there is no MSTRG.1311.1 in output), but `file2` has `MSTRG.11443.1` but output has `MSTRG.11443.10` and `MSTRG.11443.13`. I used this with file without word boundaries – Kousalya Devi Jun 11 '19 at 10:26
  • @KousalyaDevi It won't work if you still have word boundaries added to file2 – vintnes Jun 11 '19 at 10:29
  • `file1` is a tab delimited file with 3 columns. In which, some entries in second column are `.` I am using `grep` on first column. – Kousalya Devi Jun 11 '19 at 10:33