Merge files with scientific notation data in the first column and how to use uniq

Question

Two questions concerning using uniq command, please help.

First question

Say I have two files;

$ cat 1.dat
0.1 1.23
0.2 1.45
0.3 1.67

$ cat 2.dat
0.3 1.67
0.4 1.78
0.5 1.89

Using cat 1.dat 2.dat | sort -n | uniq > 3.dat, I am able to merge two files into one. results is:

0.1 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89

But if I have a scientific notation in 1.dat file,

$ cat 1.dat
1e-1 1.23
0.2 1.45
0.3 1.67

the result would be:

0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89
1e-1 1.23

which is not what I want, how can I let uniq understand 1e-1 is a number, not a string.

Second question

Same as above, but this time, let the second file 2.dat's first row be slightly different (from 0.3 1.67 to 0.3 1.57)

$ cat 2.dat
0.3 1.57
0.4 1.78
0.5 1.89

Then the result would be:

0.1 1.23
0.2 1.45
0.3 1.67
0.3 1.57
0.4 1.78
0.5 1.89

My question is this, how could I use uniq just based on the value from the first file and find repetition only from the first column, so that the results is still:

0.1 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89

Thanks

A more complex test cases

$ cat 1.dat
1e-6 -1.23
0.2 -1.45
110.7 1.55
0.3 1.67e-3

score 2 · Answer 1 · answered Feb 14 '13 at 22:01

2

The first part only:

cat 1.dat 2.dat | sort -g -u

1e-1 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89

man sort

  -g, --general-numeric-sort
          compare according to general numerical value

 -u, --unique
          with -c, check for strict ordering; without -c, output only the first of an equal run

answered Feb 14 '13 at 22:01

sotapme

4,695
2
19
20

FYI, I found this works too: `cat 1.dat 2.dat | sort -g -u | awk '{ printf "%.6f %s\n", $1, $2 }'`, changing to `cat 2.dat 1.dat | sort -g -u | awk '{ printf "%.6f %s\n", $1, $2 }'` will keep 2.dat's value. So thanks a lot! – Daniel Feb 14 '13 at 23:30

Kent · Accepted Answer · 2013-02-14T23:13:10.923

2

one awk (gnu awk) one-liner solves your two problems

  awk '{a[$1*1];b[$1*1]=$0}END{asorti(a);for(i=1;i<=length(a);i++)print b[a[i]];}' file2 file1

test with data: Note, I made file1 unsorted and 1.57 in file2, as you wanted:

kent$  head *
==> file1 <==
0.3 1.67
0.2 1.45
1e-1 1.23

==> file2 <==
0.3 1.57
0.4 1.78
0.5 1.89

kent$  awk '{a[$1*1];b[$1*1]=$0}END{asorti(a);for(i=1;i<=length(a);i++)print b[a[i]];}' file2 file1
1e-1 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89

edit

display 0.1 instead of 1e-1:

kent$  awk '{a[$1*1];b[$1*1]=$2}END{asorti(a);for(i=1;i<=length(a);i++)print a[i],b[a[i]];}' file2 file1
0.1 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89

edit 2

for the precision, awk default (OFMT) is %.6g you could change it. but if you want to display different precision by lines, we have to a bit trick:

(I added 1e-9 in file1)

kent$  awk '{id=sprintf("%.9f",$1*1);sub(/0*$/,"",id);a[id];b[id]=$2}END{asorti(a);for(i=1;i<=length(a);i++)print a[i],b[a[i]];}'  file2 file1 
0.000000001 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89

if you want to display same number precision for all lines:

kent$  awk '{id=sprintf("%.9f",$1*1);a[id];b[id]=$2}END{asorti(a);for(i=1;i<=length(a);i++)print a[i],b[a[i]];}'  file2 file1 
0.000000001 1.23
0.200000000 1.45
0.300000000 1.67
0.400000000 1.78
0.500000000 1.89

edited Feb 14 '13 at 23:13

answered Feb 14 '13 at 22:07

Kent

189,393
32
233
301

One more question, could also awk be able to convert `1e-1` to `0.1` in the output? Thanks a lot! Or I can start a new thread. – Daniel Feb 14 '13 at 22:24
@Daniel - yes, check this amazing answer http://stackoverflow.com/a/11378022/297323 (that's why I used python) – Fredrik Pihl Feb 14 '13 at 22:26
I have to say that this `one line` work-out is very useful for me, as I have numerous files to merge, and also I have already written a shell scripts, so I appreciate the python solution here (I guess python would be more powerful than awk??), but I will use this awk solution. – Daniel Feb 14 '13 at 22:32
@Danie yes awk certainly can. check my edit in answer. btw, I will be happy if you accept my answer. :D – Kent Feb 14 '13 at 22:42
@Kent Wait, why change `1e-1` to `1e-9` and your solution doesn't work any longer? And isn't there a way to output %.7f format for all the first column? – Daniel Feb 14 '13 at 22:47
Okay, I use this finally `awk '{a[$1*10];b[$1*10]=$2}END{asorti(a);for(i=1;i<=length(a);i++)print a[i]/10,b[a[i]];}' 1.dat 2.dat` – Daniel Feb 14 '13 at 23:04
FYI, I found this works too: `cat 1.dat 2.dat | sort -g -u | awk '{ printf "%.6f %s\n", $1, $2 }'` – Daniel Feb 14 '13 at 23:29

score 1 · Answer 3 · answered Feb 14 '13 at 22:15

1

To change the scientific notation to decimal I resorted to python

#!/usr/bin/env python

import sys
import glob

infiles = []

for a in sys.argv:
    infiles.extend(glob.glob(a))

for f in infiles[1:]:
    with open(f) as fd:
        for line in fd:
            data = map(float, line.strip().split())
            print data[0], data[1]

output:

$ ./sn.py 1.dat 2.dat
0.1 1.23
0.2 1.45
0.3 1.67
0.3 1.67
0.4 1.78
0.5 1.89

answered Feb 14 '13 at 22:15

Fredrik Pihl

44,604
7
83
130

nice one, however you didn't sort. if the input files are not sorted, the output would be different, right? also the `uniq` part is not done either.. – Kent Feb 14 '13 at 22:49

Merge files with scientific notation data in the first column and how to use uniq

First question

Second question

A more complex test cases

3 Answers3