Closest value different files, with different number of lines and other conditions ( bash awk other)

Question

I have to revive and old question with a modification for long files.

I have the age of two stars in two files (File1 and File2). The column of the age of the stars is $1 and the rest of the columns up to $13 are information that I need to print at the end.

I am trying to find an age in which the stars have the same age or the closest age. Since the files are too large (~25000 lines) I don't want to search in the whole array, for speed issues. Also, they could have a big difference in number of lines (let say ~10000 in some cases)

I am not sure if this is the best way to solve the problem, but in a lack of a better one, this is my idea. (If you have a faster and more efficient method, please do it)

All the values are with 12 decimals of precision. And for now I am only concern in the first column (where the age is).

And I need different loops.

Let's use this value from file 1:

2.326062371284e+05

First the routine should search in file2 all the matches that contain

2.3260e+05

(This loop probably will search in the whole array, but if there is a way to stop the search as soon it reaches 2.3261 then it will save some time)

If it finds just one, then the output should be that value.

Usually, it will find several lines, maybe even up to 1000. It this is the case, it should search again against

2.32606e+05

between the lines founded before. (It is a nested loop I think) Then the number of matches will decrease up to ~200

At that moment, the routine should search the best difference with certain tolerance X between

2.326062371284e+05

and all these 200 lines.

This way having these files

File1

1.833800650355e+05 col2f1 col3f1 col4f1
1.959443501406e+05 col2f1 col3f1 col4f1
2.085086352458e+05 col2f1 col3f1 col4f1
2.210729203510e+05 col2f1 col3f1 col4f1
2.326062371284e+05 col2f1 col3f1 col4f1
2.441395539059e+05 col2f1 col3f1 col4f1
2.556728706833e+05 col2f1 col3f1 col4f1

File2

2.210729203510e+05 col2f2 col3f2 col4f2
2.354895663228e+05 col2f2 col3f2 col4f2
2.499062122946e+05 col2f2 col3f2 col4f2
2.643228582664e+05 col2f2 col3f2 col4f2
2.787395042382e+05 col2f2 col3f2 col4f2
2.921130362004e+05 col2f2 col3f2 col4f2
3.054865681626e+05 col2f2 col3f2 col4f2

Output File3 (with tolerance 3000)

2.210729203510e+05 2.210729203510e+05 col2f1 col2f2 col4f1 col3f2
2.326062371284e+05 2.354895663228e+05 col2f1 col2f2 col4f1 col3f2

Important condition:

The output shouldn't contain repeated lines (the star 1 can't have at a fixed age, different ages for the star 2, just the closest one.

How would you solve this?

super thanks!

ps: I've change completely the question, since it was showed to me that my reasoning had some errors. Thanks!

Is heres a max tolerance ? (i.e, if we find a row wich diffrence is under X, stop there even if it's not exactly the closest one ?) — Tensibai, Feb 08 '16 at 13:35
I think there should be a tolerance, I am not sure how to define it. It should not find an answer if the difference between the best closest value is too big. (what is big? 10 yr. The time column is in years). — Nikko, Feb 08 '16 at 13:43
When you say `the nearest 100 rows (up and down)` - does that mean 100 rows before plus 100 after, or 50 before plus 50 after? If the current line is 10 from the start does that mean the 10 lines before and 50 (or 100) after or the 10 lines before and 90 after or something else? Basically, edit your question to explain precisely what you mean by that statement, with examples. Also, to make it easy for us to test against, edit your question to reduce the size of the files to 10 or less lines and show THE expected output given that sample input if your window is, say, 4 lines instead of 100. — Ed Morton, Feb 08 '16 at 13:46
If the `NR` column isn't present in your input, don't include it in the example. We can count. Ditto for the first 2 lines. Make your files TESTABLE - we don't want to have to edit your files and guess at what rows and/or columns we need to remove to create your real input format - just post your real input and output format. — Ed Morton, Feb 08 '16 at 14:07
Thanks, I just want to make it the easiest way possible I can modify specifically for my files, but if you think is better the original input I can do it. — Nikko, Feb 08 '16 at 14:11
But, I cannot make it fully testable without adding more lines, and the position also matters even when I don't have it in my files. That is why is better this way if you agree. — Nikko, Feb 08 '16 at 14:22
Certainly don't add anything that is not present in your real files. After that, reduce the example you post to the smallest possible testable example that is truly representative of your real data and a solution for which you are confident you can enhance as necessary to apply to your real files. If it's just field 1 vs field 27 and/or 10 lines vs 1000 lines then that should be trivial for you to deal with. With what you currently have posted we'd have to write useless code to skip the first 2 lines and the first column - a complete waste of time that obfuscates the solution. — Ed Morton, Feb 08 '16 at 14:27
@Nikko yes, sorry, been busy whole day and I did not had time to tackle it — Tensibai, Feb 08 '16 at 17:00
no need to apologize =) I appreciate that all of you take time to help others. By the way I am modifying the question completely. Since due to the questions and doubts, I've found some problems with my reasoning. — Nikko, Feb 08 '16 at 18:01
@EdMorton I've updated the question maybe now is clearer why I need some conditions. Thanks! — Nikko, Feb 08 '16 at 19:36
It's usually better to ask a new question as a new question. — choroba, Feb 08 '16 at 20:34

Tensibai · Accepted Answer · 2016-02-09T16:28:52.293

4

Not an awk solution, comes a time when other solutions are great too, so here is an answer using R

New answer with different datas, not reading from file this time to bake an example:

# Sample data for code, use fread to read from file and setnames to name the colmumns accordingly
set.seed(123)
data <- data.table(age=runif(20)*1e6,name=sample(state.name,20),sat=sample(mtcars$cyl,20),dens=sample(DNase$density,20))
data2 <- data.table(age=runif(10)*1e6,name=sample(state.name,10),sat=sample(mtcars$cyl,10),dens=sample(DNase$density,10))

setkey(data,'age') # Set the key for joining to the age column
setkey(data2,'age') # Set the key for joining to the age column

# get the result
result=data[ # To get the whole datas from file 1 and file 2 at end
         data2[ 
           data, # Search for each star of list 1
           .SD, # return columns of file 2
           roll='nearest',by=.EACHI, # Join on each line (left join) and find nearest value
          .SDcols=c('age','name','dens')]
       ][!duplicated(age) & abs(i.age - age) < 1e3,.SD,.SDcols=c('age','i.age','name','i.name','dens','i.dens') ] # filter duplicates in first file and on difference
# Write results to a file (change separator for wish):
write.table(format(result,digits=15,scientific=TRUE),"c:/test.txt",sep=" ")

Code:

# A nice package to have, install.packages('data.table') if it's no present
library(data.table)
# Read the data (the text can be file names)
stars1 <- fread("1.833800650355e+05
1.959443501406e+05
2.085086352458e+05
2.210729203510e+05
2.326062371284e+05
2.441395539059e+05
2.556728706833e+05")

stars2 <- fread("2.210729203510e+05
2.354895663228e+05
2.499062122946e+05
2.643228582664e+05
2.787395042382e+05
2.921130362004e+05
3.054865681626e+05")

# Name the columns (not needed if the file has a header)
colnames(stars1) <- "age"
colnames(stars2) <- "age"

# Key the data tables (for a fast join with binary search later)
setkey(stars1,'age')
setkey(stars2,'age')

# Get the result (more datils below on what is happening here :))
result=stars2[ stars1, age, roll="nearest", by=.EACHI]

# Rename the columns so we acn filter whole result
setnames(result,make.unique(names(result)))

# Final filter on difference
result[abs(age.1 - age) < 3e3]

So the interesting parts are the first 'join' on the two stars ages list, searching for each in stars1 the nearest in stars2.

This give (after column renaming):

> result
        age    age.1
1: 183380.1 221072.9
2: 195944.4 221072.9
3: 208508.6 221072.9
4: 221072.9 221072.9
5: 232606.2 235489.6
6: 244139.6 249906.2
7: 255672.9 249906.2

Now we have the nearest for each, filter those close enough (on absolute difference above 3 000 here):

> result[abs(age.1 - age) < 3e3]
        age    age.1
1: 221072.9 221072.9
2: 232606.2 235489.6

edited Feb 09 '16 at 16:28

answered Feb 09 '16 at 09:51

Tensibai

15,557
1
37
57

Thanks @Tensibai, installing R and trying to run the code. – Nikko Feb 09 '16 at 10:39
Take care of comments, specially the first one to install the package :) I'm unsure of what you want to do at end, but I've the feeling R could be an interresting tool for you if you wish to graph the stars according to their relative distance or to make some statistical analysis on the data – Tensibai Feb 09 '16 at 10:41
I reckon I wanted to learn R this year...but sadly I haven't found the time. I've just installed, and tryed to install the package, I get this: Warning message: package ‘data.table’ is not available (for R version 3.2.3) And guess what version I have (3.2.3). – Nikko Feb 09 '16 at 10:53
I'm on ubuntu 14.04. Fixed, I've change the mirror and it works. The code compiles and work super with the example. Now I'll try with the real files. – Nikko Feb 09 '16 at 11:01
Can you please modify it to save the result into a file? @Tensibai – Nikko Feb 09 '16 at 11:10
I think it is perfect! thanks to everyone! ( and so fast!) – Nikko Feb 09 '16 at 11:27
as Ed morton predicted, it would be better to say the exact problem. How can I print the specific columns of file1 and file2 also in result? @Tensibai – Nikko Feb 09 '16 at 11:48
I meant, the thing is file1 and file2 have about 20 columns. The firstone is time and the other information about the star, I want to print column, 2,5,8,11 of file1 and file2 for each row in result. for example. Enjoy lunch! – Nikko Feb 09 '16 at 11:57
So it is not as simple as I thought. The things is the whole point of get this close age is to compare variables from the stars at that age, that information is in the other columns, and for some calculation I need to use both columns to calculate another variable... – Nikko Feb 09 '16 at 13:17
and also, as I wrote in the question, how can I get rid off the repeated lines, and as soon a set the tolerance >0 it finds (probably obviously) more than one match, how can I say that then it should choose the best difference between those lines?. Although this has less priority I could work with tolerance = 0. Although the number of lines decreases almost to the half. – Nikko Feb 09 '16 at 13:24
@Nikko updated the answer with another example and multiples columns, syntax is a little different but I hope I commented enough on what does what. – Tensibai Feb 09 '16 at 13:53
Thank you very much! Super nice answer, sorry for all the changes. I think this routine also work, I am trying to assign the name to the columns as you suggest I am learning how, I have no idea about R. Thanks! – Nikko Feb 09 '16 at 14:25
Ok, last hour I couldn't find how. Can you please tell me how to do this: "use fread to read from file and setnames to name the colmumns accordingly" I've tried what I've read searching how to do that, but without success.. I'm sure is just one line. =/ – Nikko Feb 09 '16 at 15:15
@Nikko See the first code (under the break bar) `data <-fread('file1'); setnames(data,c('age','col2f1','col3f1',...))` – Tensibai Feb 09 '16 at 15:18
ok perfect. It works, I've tried to obtain the results in scientific notation, but is not working. I've used "options( digits = 15 )" and "result2= format(result, scientific=TRUE);" which worked for me in the 1 column example. But when I do this, it is written everything in one column ) – Nikko Feb 09 '16 at 15:35
Both ways are working with the same output, just one column in test.txt Why is that? – Nikko Feb 09 '16 at 15:48
Ahh perfect, I thought it was my fault. if I change .SDcols=1,4 it will print just those two columns right? – Nikko Feb 09 '16 at 16:05
@Nikko I've updated with your desired output from my example, names can be replaced by index position. Hope this gives a proper insight to extend it :) – Tensibai Feb 09 '16 at 16:29
@Nikko side note, I think we should clean up comments above, they are useless now IMO. – Tensibai Feb 09 '16 at 16:30
Excellent, everything is working as I wanted. thanks for the patience!. – Nikko Feb 09 '16 at 18:51
@Nikko you're welcome. I hope your learned enough to write a good question at first try next time ;) (We're all happy to help people showing efforts to learn by trying and accepting criticism to improve ;)) – Tensibai Feb 09 '16 at 19:14
Oh yes I've learned a lot! thanks! (I need to keep learning I guess because I have a lot of questions, probably I'll make new ones =) ) thanks again. – Nikko Feb 09 '16 at 19:17
I'm getting this: rror in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 106284 rows; more than 80108 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. Calls: [ -> [.data.table -> vecseq Execution halted – Nikko Mar 09 '16 at 16:12
how can I fix it? (I already try with allow.cartesian=TRUE, but nothing happens =/) – Nikko Mar 09 '16 at 16:12
@Nikko oops, sorry. Well this sounds like a full new Q, gettign why it hangs, reading code in comments is awfull, can you post a new question and set the link here please ? – Tensibai Mar 14 '16 at 13:35

choroba · Answer 2 · 2016-02-08T17:25:15.570

2

Perl to the rescue. This should be very fast, as it does a binary search in the given range.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

use List::Util qw{ max min };
use constant { SIZE      => 100,
               TOLERANCE => 3000,
           };


my @times2;
open my $F2, '<', 'file2' or die $!;
while (<$F2>) {
    chomp;
    push @times2, $_;
}

my $num = 0;
open my $F1, '<', 'file1' or die $!;
while (my $time = <$F1>) {
    chomp $time;

    my $from = max(0, $num - SIZE);
    my $to   = min($#times2, $num + SIZE);
    my $between;
    while (1) {
        $between = int(($from + $to) / 2);

        if ($time < $times2[$between] && $to != $between) {
            $to = $between;

        } elsif ($time > $times2[$between] && $from != $between) {
            $from = $between;

        } else {
            last
        }
    }
    $num++;
    if ($from != $to) {
        my $f = $time - $times2[$from];
        my $t = $times2[$to] - $time;
        $between = ($f > $t) ? $to : $from;
    }
    say "$time $times2[$between]" if TOLERANCE >= abs $times2[$between] - $time;
}

edited Feb 08 '16 at 17:25

answered Feb 08 '16 at 14:05

choroba

231,213
25
204
289

I've compile but I get this: Missing right curly or square bracket at closest.pl line 37, at end of line syntax error at closest.pl line 37, at EOF Execution of closest.pl aborted due to compilation errors. After I've added the right curly at line 37 I get this. Global symbol "$nr" requires explicit package name at closest.pl line 38. Global symbol "$time" requires explicit package name at closest.pl line 38. Global symbol "$between" requires explicit package name at closest.pl line 38. Execution of closest.pl aborted due to compilation errors. – Nikko Feb 08 '16 at 14:39
It works for me. What Perl version do you have? `perl -v`. Have you noticed there's a scrollbar on the right? – choroba Feb 08 '16 at 15:04
without or with the right curly ? version: (v5.18.2) – Nikko Feb 08 '16 at 15:06
@Nikko: As posted here. Nothing is missing (you probably didn't copy the closing right curly bracket, and added it manually to a wrong position). – choroba Feb 08 '16 at 15:07
I've updated the question. In principle it is working, but I am expecting just two lines of output. Can you fix it? – Nikko Feb 08 '16 at 15:12
@Nikko: I don't understand the question now. Why are there only two lines in the output? – choroba Feb 08 '16 at 15:40
I think just to adding a tolerance to your code, it should work. Just two lines because they are the only ones that fulfill the tolerance condition. – Nikko Feb 08 '16 at 15:45
Thanks! It is working with the example, but when I check with a larger file it doesn't display the correct output. I think the lines with the same value (so the difference is 0) are not being displayed ,except for the first nr. – Nikko Feb 08 '16 at 16:46
and when I try with the real file it says repeatedly this: Use of uninitialized value in numeric gt (>) at closest.pl line 33, <$F1> line 21618. – Nikko Feb 08 '16 at 16:56
@Nikko: Can you show an input line that produces the warning? – choroba Feb 08 '16 at 16:58
http://www.filedropper.com/file1_2 and http://www.filedropper.com/file2_1 are the original files. I don't know perl too much. But I am sure with those files you will get something. – Nikko Feb 08 '16 at 17:01
I'm not getting any files from the links. – choroba Feb 08 '16 at 17:06
Weird I've tested them, another host: http://www.filehosting.org/file/details/541924/bothfiles.zip – Nikko Feb 08 '16 at 17:11
I'm not giving my email address to anyone. – choroba Feb 08 '16 at 17:16
http://s000.tinyupload.com/index.php?file_id=07762346186856365156 no email needed. – Nikko Feb 08 '16 at 17:18
The two files don't have the same number of lines. The difference is about 9000, which is more than the "SIZE". Setting SIZE to 8643 is enough. – choroba Feb 08 '16 at 17:24
exactly! and the files are like this (I should've said that in the example.) – Nikko Feb 08 '16 at 17:26
@Nikko: It doesn't make any sense. What line from file2 should I use to match line number 29320 from file1? – choroba Feb 08 '16 at 17:27
I know what is the problem. I wasn't aware of the big difference of the lines for the example... At the end is simple, it should be compared with the lasts lines from file 2 ... (what matters more is the time, than the NR, the closer the better.) – Nikko Feb 08 '16 at 17:32
I think your routines is almost perfect. But for example it has repeated values that shouldn't exist. – Nikko Feb 08 '16 at 17:34
@Nikko: Where is that stated in the question? – choroba Feb 08 '16 at 17:35
It is not. I am updating the better I can now. – Nikko Feb 08 '16 at 17:36
Maybe just adding `$num = $between;` before the `say` line helps? – choroba Feb 08 '16 at 17:43
I think it didn't. I am updating the question without talking in general now, maybe is clearer. Why I need some conditions. – Nikko Feb 08 '16 at 17:48
I think is better instead of searching using NR, will be for example if the value I am searching is 4.071420520520e+06 search – Nikko Feb 08 '16 at 18:01
it is basically a new question, but thanks to you I found a problem in my logic. I am fixing it now. I think is better to do a double loop. For example if I am searching the closest to 4.071420520520e+06 first it has to search the closest value to 4.0714 but check if they are too many of them. Let's say the routine found 1000 lines that match that. – Nikko Feb 08 '16 at 18:08
Then search again but against 4.07142 and if the number of lines found is less than some value (let's say 100) then search between those lines the closest age under some tolerance (3000 years) if the tolerance is satisfied then again should choose the closest between them. – Nikko Feb 08 '16 at 18:09
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/102929/discussion-between-nikko-and-choroba). – Nikko Feb 08 '16 at 19:07

Closest value different files, with different number of lines and other conditions ( bash awk other)

2 Answers2

Linked