Unix: find all lines having timestamps in both time series?

Question

I have time-series data where I would like to find all lines matching each another but values can be different (match until the first tab)! You can see the vimdiff below where I would like to get rid of days that occur only on the other time series.

I am looking for the simplest unix tool to do this!

enter image description here

Timeserie here and here.

Simple example

Input

Left file                            Right File    
------------------------             ------------------------
10-Apr-00     00:00    0     ||      10-Apr-00     00:00     7
20-Apr 00     00:00    7     ||      21-Apr-00     00:00     3

Output

Left file                           Right File    
------------------------            ------------------------
10-Apr-00     00:00    0    ||      10-Apr-00     00:00     7

Basically it is this easy: transpose both data sesies, find unique entries in the first line, transpose back, remove the unique dates from both -- done! What is the easiest tool for this? — hhh, Nov 17 '14 at 02:21
I'm confused. You say your input is `10-Apr-00 00:00 0 10-Apr-00 00:00 7` but your screenshots show two different files — Adam Smith, Nov 17 '14 at 02:22
@AdamSmith file1: `10-Apr-00 00:00 0` and file2 `10-Apr-00 00:00 7`, now clear? I added there a separator || to make it more clear. — hhh, Nov 17 '14 at 02:29
@shellter excluding because 20-Apr is not matching with 21-Apr — hhh, Nov 17 '14 at 03:41

John1024 · Answer 1 · 2014-11-17T07:42:57.600

Let's consider these sample input files:

$ cat file1
10-Apr-00       00:00   0
20-Apr-00       00:00   7
$ cat file2
10-Apr-00       00:00   7
21-Apr-00       00:00   3

To merge together those lines with the same date:

$ awk 'NR==FNR{a[$1]=$0;next;} {if ($1 in a) print a[$1]"\t||\t"$0;}' file1 file2
10-Apr-00       00:00   0       ||      10-Apr-00       00:00   7

Explanation

NR==FNR{a[$1]=$0;next;}

NR is the number of lines read so far and FNR is the number of lines read so far from the current file. So, when NR==FNR, we are still reading the first file. If so, save this whole line, $0, in array a under the key of the first field, $1, which is the date. Then, skip the rest of the commands and jump to the next line.
if ($1 in a) print a[$1]"\t||\t"$0

If we get here, then we are reading the second file, file2. If the first field on this line, $1 is a date that we already saw in file1, in other words, if $1 in a, then print this line out together with the corresponding line from file1. The two lines are separated by tab-||-tab.

Alternative Output

If you just want to select lines from file2 whose dates are also in file1, then the code can be simplified:

$ awk 'NR==FNR{a[$1]++;next;} {if ($1 in a) print;}' file1 file2
10-Apr-00       00:00   7

Or, still simpler:

$ awk 'NR==FNR{a[$1]++;next;} ($1 in a)' file1 file2
10-Apr-00       00:00   7

score 2 · Answer 2 · answered Nov 07 '15 at 20:11

There is the relatively unknown unix command join. It can join sorted files on a key column.

To use it in your context, we follow this strategy (left.txt and right.txt are your files):

add line numbers (to put everything in the original sequence in the last step)
```
nl left.txt > left_with_lns.txt
nl right.txt > right_with_lns.txt
```

sort both files on the date column

sort left_with_lns.txt -k 2 > sl.txt
sort right_with_lns.txt -k 2 > sr.txt

join the files using the date column (all times are 0:00) (this would merge all columns of both files with correponding key, but we provide a output template to write the columns from the first file somewhere and the columns from the second file somewhere else (but only those line with a matching key will end in the result fl.txt and fr.txt)
```
join -j 2 -t $'\t' -o 1.1 1.2 1.3 1.4 sl.txt sr.txt > fl.txt
join -j 2 -t $'\t' -o 2.1 2.2 2.3 2.4 sl.txt sr.txt > fr.txt
```

sort boths results on the linenumber column and output the other columns

sort -n fl |cut -f 2- > left_filtered.txt
sort -n fr.txt | cut -f 2- > right_filtered.txt

Tools used: cut, join, nl, sort.

Can you estimate the efficiency of your commands. I think your approach is like Merge in one step. — Léo Léopold Hertz 준영, Nov 07 '15 at 20:21
Wow! For huge files, this may well be the most efficient solution possible; provided the files are pre-sorted (log files typically are) and formatted `yyyy-mm-dd...` (meaning you can eliminate all the sort steps; a potential bottleneck for huge files). — Ruud Helderman, Nov 07 '15 at 20:31
@Masi Sorry I could find information on the performance on large files and I have not looked into the source. Since join works with presorted files, I suppose that join is quite efficient. That means all depends on the sorting of the input files. — Lars Fischer, Nov 07 '15 at 20:35

Ruud Helderman · Answer 3 · 2015-11-07T14:11:57.603

As requested by @Masi, I tried to work out a solution using sed.

My first attempt uses two passes; the first transforms file1 into a sed script that is used in the second pass to filter file2.

sed 's/\([^ \t]*\).*/\/^\1\t\/p;t/' file1 > sed1
sed -nf sed1 file2 > out2

With big input files, this is s-l-o-w; for each line from file2, sed has to process an amount of patterns that equals the number of lines in file1. I haven't done any profiling, but I wouldn't be surprised if the time complexity is quadratic.

My second attempt merges and sorts the two files, then scans through all lines in search of pairs. This runs in linear time and consequently is a lot faster. Please note that this solution will ruin the original order of the file; alphabetical sorting doesn't work too well with this date notation. Supplying files with a different date format (y-m-d) would be the easiest way to fix that.

sed 's/^[^ \t]\+/&@1/' file1 > marked1
sed 's/^[^ \t]\+/&@2/' file2 > marked2

sort marked1 marked2 > sorted

sed '$d;N;/^\([^ \t]\+\)@1.*\n\1@2/{s/\(.*\)\n\(.*\)/\2\n\1/;P};D' sorted > filtered
sed 's/^\([^ \t]\+\)@2/\1/' filtered > out2

Explanation:

In the first command, s/^[^ \t]\+/&@1/ appends @1 to every date. This makes it possible to merge the files, keep equal dates together when sorting, and still be able to tell lines from different files apart.
The second command does the same for file2; obviously with its own marker @2.
The sort command merges the two files, grouping equal dates together.
The third sed command returns all lines from file2 that have a date that also occurs in file1.
The fourth sed command removes the @2 marker from the output.

The third sed command in detail:

$d suppresses inappropriate printing of the last line
N reads and appends another line of input to the line already present in the pattern space
/^$[^ \t]\+$@1.*\n\1@2/ matches two lines originating from different files but with the same date
{ starts a command group
s/$.*$\n$.*$/\2\n\1/ swaps the two lines in the pattern space
P prints the first line in the pattern space
} ends the command group
D deletes the first line from the pattern space

The bad news is, even the second approach is slower than the awk approach made by @John1024. Sed was never designed to be a merge tool. Neither was awk, but awk has the advantage of being able to store an entire file in a dictionary, making @John1024's solution blazingly fast. The downside of a dictionary is memory consumption. On huge input files, my solution should have the advantage.

Excellent answer! Is there any other tool to do the merging better? My interest is particularly for huge input files. — Léo Léopold Hertz 준영, Nov 07 '15 at 17:52
There is the join command, which merges columns from different files using a common key column, see my answer for an example in the situation of the question. join could be better suited when it is ok to get one file with the values merged from both input files. In my answer I use two consecutive joins to reconstruct the input files. I have not tested the commands on huge files. — Lars Fischer, Nov 07 '15 at 20:22
@Masi Many alternatives exist, ranging from importing everything in a database, to rolling your own solution in Perl or C. It depends on two things. (1) The input files. Are both files huge, or is one of them small enough to hold in memory? Are the files already sorted (e.g. log files)? How is each line formatted? Is the same data also available in a database, or through some log analytics tool? (2) The required output. Are you interested in the full line, or just the common part (i.e. the date)? And are your requirements subject to change? — Ruud Helderman, Nov 07 '15 at 20:35

Unix: find all lines having timestamps in both time series?

3 Answers3

Explanation

Alternative Output