As requested by @Masi, I tried to work out a solution using sed.
My first attempt uses two passes; the first transforms file1
into a sed script that is used in the second pass to filter file2
.
sed 's/\([^ \t]*\).*/\/^\1\t\/p;t/' file1 > sed1
sed -nf sed1 file2 > out2
With big input files, this is s-l-o-w; for each line from file2
, sed has to process an amount of patterns that equals the number of lines in file1
. I haven't done any profiling, but I wouldn't be surprised if the time complexity is quadratic.
My second attempt merges and sorts the two files, then scans through all lines in search of pairs. This runs in linear time and consequently is a lot faster. Please note that this solution will ruin the original order of the file; alphabetical sorting doesn't work too well with this date notation. Supplying files with a different date format (y-m-d) would be the easiest way to fix that.
sed 's/^[^ \t]\+/&@1/' file1 > marked1
sed 's/^[^ \t]\+/&@2/' file2 > marked2
sort marked1 marked2 > sorted
sed '$d;N;/^\([^ \t]\+\)@1.*\n\1@2/{s/\(.*\)\n\(.*\)/\2\n\1/;P};D' sorted > filtered
sed 's/^\([^ \t]\+\)@2/\1/' filtered > out2
Explanation:
- In the first command,
s/^[^ \t]\+/&@1/
appends @1
to every date. This makes it possible to merge the files, keep equal dates together when sorting, and still be able to tell lines from different files apart.
- The second command does the same for
file2
; obviously with its own marker @2
.
- The
sort
command merges the two files, grouping equal dates together.
- The third sed command returns all lines from
file2
that have a date that also occurs in file1
.
- The fourth sed command removes the
@2
marker from the output.
The third sed command in detail:
$d
suppresses inappropriate printing of the last line
N
reads and appends another line of input to the line already present in the pattern space
/^\([^ \t]\+\)@1.*\n\1@2/
matches two lines originating from different files but with the same date
{
starts a command group
s/\(.*\)\n\(.*\)/\2\n\1/
swaps the two lines in the pattern space
P
prints the first line in the pattern space
}
ends the command group
D
deletes the first line from the pattern space
The bad news is, even the second approach is slower than the awk approach made by @John1024. Sed was never designed to be a merge tool. Neither was awk, but awk has the advantage of being able to store an entire file in a dictionary, making @John1024's solution blazingly fast. The downside of a dictionary is memory consumption. On huge input files, my solution should have the advantage.