1

I need to extract all lines from file2 that begin with an id # prefix contained in file1.

File 1 is single column like:

324
399
408
135236
321590

File 2 is multi-column like:

1 [tab] 108 [tab] Anarchist [tab] 103985
...
324 [tab] 309 [tab] Melodies [tab] 230498

What's the quickest easiest way to extract just these lines from File2?

HopelessN00b
  • 53,795
  • 33
  • 135
  • 209
Poe
  • 321
  • 1
  • 5
  • 18

5 Answers5

1
$ while read p; do awk '$1 == "'$p'"' file2; done < file1

or:

$ awk -F'\t' 'FNR==NR { a[$0]; next } $1 in a' file1 file2
  • FNR: the number of records read from the current file being processed
  • NR: total number of input records
  • FNR==NR: is only true when awk is reading the file1
  • a[$0]: create an array element indexed by $0 (from file1)
  • $1 in a: check whether each line being read from file2 exists as an index in the array a
quanta
  • 51,413
  • 19
  • 159
  • 217
1

bash code to do this:

for i in $(cat file1); do egrep "^$i\s" file2; done
quanta
  • 51,413
  • 19
  • 159
  • 217
n8whnp
  • 1,326
  • 7
  • 9
  • i'm more familiar with grep, so this worked i just went with grep -m 1 -P '$i\t' – Poe Oct 23 '11 at 00:30
1

This is probably the fastest:

grep -f <( sed 's/.*/^&\t/' file1) file2

The answers using for and while loops are going to be very slow.

The awk answer by quanta should work. I don't know why it wouldn't unless your line endings are non-Unix or file1 is very big.

Dennis Williamson
  • 62,149
  • 16
  • 116
  • 151
0

1) We can use some OR-logic of grep. For example

$> grep -P "^(324|399|408|135236|321590).*" file2
324 [tab] 309 [tab] Melodies [tab] 230498

So question is - how we can get this variable to grep?

2) We can echo file1 in single line and substitute delimiters with |, than add brackets.

$> echo `cat file1` | sed -r -e 's/([0-9])\ ([0-9])/\1,\2/g'
324,399,408,135236,321590

So, finally we have a variant without a for-while loops.

grep -P "^($( echo `cat file1` | sed -r -e 's/([0-9])\ ([0-9])/\1|\2/g'  )).*" file2
0

The join command GNU coreutils server just this purpose, but it is picky about its input.

$ sort file1 > sorted1
$ sort file2 > sorted2
$ join -t"      " sorted1 sorted2 | sort -n

The join command requires its input files to be sorted lexicographically, not numerically. Thus, all that sorting of the inputs and the output.

To specify that the output from join should be tab delimited, use -t"tab character", which you would type as Ctrl-V Tab at the Bash prompt.

200_success
  • 4,771
  • 1
  • 25
  • 42