Extracting lines from a file based on prefix

Question

I need to extract all lines from file2 that begin with an id # prefix contained in file1.

File 1 is single column like:

File 2 is multi-column like:

1 [tab] 108 [tab] Anarchist [tab] 103985
...
324 [tab] 309 [tab] Melodies [tab] 230498

What's the quickest easiest way to extract just these lines from File2?

quanta · Accepted Answer · 2011-10-23T06:44:51.480

1

$ while read p; do awk '$1 == "'$p'"' file2; done < file1

or:

$ awk -F'\t' 'FNR==NR { a[$0]; next } $1 in a' file1 file2

FNR: the number of records read from the current file being processed
NR: total number of input records
FNR==NR: is only true when awk is reading the file1
a[$0]: create an array element indexed by $0 (from file1)
$1 in a: check whether each line being read from file2 exists as an index in the array a

edited Oct 23 '11 at 06:44

answered Oct 22 '11 at 23:19

quanta

51,413
19
159
217

I tried this first, but received this error: awk: program limit exceeded: maximum number of fields size=32767 FILENAME="file2" FNR=10788 NR=10790 – Poe Oct 23 '11 at 00:18
How many columns in `file2`? – quanta Oct 23 '11 at 00:37
file2 is a 28G file, 11 columns – Poe Oct 23 '11 at 03:18
missing the FS declaration: `awk -F '\t' '...' file1 file2 – glenn jackman Oct 23 '11 at 03:59
1

@Poe, with that much data, you should be using a real database. – glenn jackman Oct 23 '11 at 04:02
@jackman, thanks for the fix and yes I'm extracting only the lines I need for the database table. I don't want the entire 28G in the database. – Poe Oct 23 '11 at 04:55

score 1 · Answer 2 · edited Oct 23 '11 at 00:35

1

bash code to do this:

for i in $(cat file1); do egrep "^$i\s" file2; done

edited Oct 23 '11 at 00:35

quanta

51,413
19
159
217

answered Oct 22 '11 at 23:39

n8whnp

1,326
7
9

i'm more familiar with grep, so this worked i just went with grep -m 1 -P '$i\t' – Poe Oct 23 '11 at 00:30

score 1 · Answer 3 · answered Oct 23 '11 at 00:45

1

This is probably the fastest:

grep -f <( sed 's/.*/^&\t/' file1) file2

The answers using for and while loops are going to be very slow.

The awk answer by quanta should work. I don't know why it wouldn't unless your line endings are non-Unix or file1 is very big.

answered Oct 23 '11 at 00:45

Dennis Williamson

62,149
16
116
151

the other method is definitely slow, but the file is very big. – Poe Oct 23 '11 at 03:19

score 0 · Answer 4 · answered Oct 23 '11 at 00:20

0

1) We can use some OR-logic of grep. For example

$> grep -P "^(324|399|408|135236|321590).*" file2
324 [tab] 309 [tab] Melodies [tab] 230498

So question is - how we can get this variable to grep?

2) We can echo file1 in single line and substitute delimiters with |, than add brackets.

$> echo `cat file1` | sed -r -e 's/([0-9])\ ([0-9])/\1,\2/g'
324,399,408,135236,321590

So, finally we have a variant without a for-while loops.

grep -P "^($( echo `cat file1` | sed -r -e 's/([0-9])\ ([0-9])/\1|\2/g'  )).*" file2

answered Oct 23 '11 at 00:20

ДМИТРИЙ МАЛИКОВ

238
3
15

`echo ``cat file1`` | sed ...` -- really? How about `sed -r -e '...' file1` – glenn jackman Oct 23 '11 at 03:58
@glennjackman: "echo file1 in single line" – Dennis Williamson Oct 24 '11 at 01:20

score 0 · Answer 5 · answered Oct 23 '11 at 07:45

The join command GNU coreutils server just this purpose, but it is picky about its input.

$ sort file1 > sorted1
$ sort file2 > sorted2
$ join -t"      " sorted1 sorted2 | sort -n

The join command requires its input files to be sorted lexicographically, not numerically. Thus, all that sorting of the inputs and the output.

To specify that the output from join should be tab delimited, use -t"tab character", which you would type as Ctrl-V Tab at the Bash prompt.

Extracting lines from a file based on prefix

5 Answers5