2

I'm writing this to avoid a O(n!) time complexity but I only have pseudocode right now because there are some things I'm unsure about implementing.

This is the format of the file that I want to pass into this script. The data is sorted by the third column -- the start position.

93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
...
...
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530

Explanation of the code:

I want to create an array of arrays to find when two pieces of information have overlapping lengths.

Columns 3 and 4 of the input file are start and stop positions on a single track line. If any row(x) has a position in column 3 that is shorter than the position in column 4 in any row(y) then this means that x starts before y ends and there is some overlap.

I want to find every row that overlaps with asnyrow without having to compare every row to every row. Because they are sorted I simply add a string to an inner array of the array which represents one row. If the new row being looked at does not overlap with one of the rows already in the array then (because the array is sorted by the third column) no further row will be able to overlap with the row in the array and it can be removed.

This is what I have an idea of

#!/usr/bin/perl -w

use strict;

my @array

while (<>) {

    my thisLoop = ($id, $name, $begin, $end) = split;
    my @innerArray = split; # make an inner array with the current line, to 
                            # have strings that will be printed after it

    push @array(@innerArray)

    for ( @array ) { # loop through the outer array being made to see if there 
                     # are overlaps with the current item

        if ( $begin > $innerArray[3]) # if there are no overlaps then print 
                                      # this inner array and remove it
                                      # (because it is sorted and everything
                                      # else cannot overlap because it is 
                                      # larger)
            # print @array[4-]
            # remove this item from the array
        else
            # add to array this string
            "$id overlap with innerArray[0] \t innerArray[0]: $innerArray[2], $innerArray[3] "\t" $id :  $begin, $end         
            # otherwise because there is overlap add a statement to the inner
            # array explaining the overlap

The code should produce something like

87 overlap with 93     93: 1 82      87: 1 7982
76 overlap with 93     93: 1 82      76: 1 20690
65 overlap with 93     93: 1 82      65: 2 170
76 overlap with 87     87: 1 7912    76: 2 20690
65 overlap with 87     87: 1 7912    65: 2 170
65 overlap with 76     76: 2 20690   65: 2 170
256 overlap with 76    76: 2 20690   256: 17515 66740
228 overlap with 166   166: 72503 123150   228: 72510 114530

This was tricky to explain so ask me if you have any questions

Borodin
  • 126,100
  • 9
  • 70
  • 144
Sam
  • 1,765
  • 11
  • 82
  • 176
  • 3
    Related: [Quickest way to determine range overlap in Perl](http://stackoverflow.com/q/4677465/176646), which also asks about finding the overlap of sets of ranges. In this case, I think you could just compare your set to itself. – ThisSuitIsBlackNot May 18 '16 at 19:58
  • @ThisSuitIsBlackNot Are there built in java interval search trees that I would just be able to access and search my whole set with – Sam May 18 '16 at 20:10
  • 2
    *"If any row(x) has a position in column 3 that is shorter than the position in column 4 in any row(y) then this means that x starts before y ends and there is some overlap"* Are you sure about this? If row **x** starts at 10 and stops at 20 and row **y** starts at 30 and stops at 40, then 10 is less than 40 but there is no overlap. You may be correct if your data is sorted in some way, but what you say isn't generally true. – Borodin May 18 '16 at 20:14
  • 1
    What do you mean by *"a single track line"*? – Borodin May 18 '16 at 20:14
  • @Borodin I'm just trying to describe that if say the start and stop positions represent a piece of tape or something that all rows(pieces of information) would be filled in on the same piece of tape, just a way I trying to explain it to visualize it – Sam May 18 '16 at 20:26
  • @B.Monster: So you mean something like a Perl string? – Borodin May 18 '16 at 20:28
  • 1
    @ThisSuitIsBlackNot: Nice source. Thank you – Borodin May 18 '16 at 20:30
  • 1
    @B.Monster: How many records are you dealing with? How big is your file? – Borodin May 18 '16 at 20:31

2 Answers2

1

This produces exactly the output that you asked for given your sample data as input. It runs in well under one millisecond

Do you have other constraints that you haven't explained? Making your code run faster should never be an end in itself. There is nothing inherently wrong with an O(n!) time complexity: it is the execution time that you must consider, and if your code is fast enough then your job is done

use strict;
use warnings 'all';

my @data = map [ split ], grep /\S/, <DATA>;

for my $i1 ( 0 .. $#data ) {

    my $v1 = $data[$i1];

    for my $i2 ( $i1 .. $#data ) {

        my $v2 = $data[$i2];

        next if $v1 == $v2;

        unless ( $v1->[3] < $v2->[2] or $v1->[2] > $v2->[3] ) {
            my $statement = sprintf "%d overlap with %d", $v2->[0], $v1->[0];
            printf "%-22s %d: %d %-7d %d: %d %-7d\n", $statement, @{$v1}[0, 2, 3], @{$v2}[0, 2, 3];

        }
    }
}

__DATA__
93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530

output

87 overlap with 93     93: 1 82      87: 1 7912   
76 overlap with 93     93: 1 82      76: 2 20690  
65 overlap with 93     93: 1 82      65: 2 170    
76 overlap with 87     87: 1 7912    76: 2 20690  
65 overlap with 87     87: 1 7912    65: 2 170    
65 overlap with 76     76: 2 20690   65: 2 170    
256 overlap with 76    76: 2 20690   256: 17515 66740  
228 overlap with 166   166: 72503 123150  228: 72510 114530 
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • The file is tens of thousands of lines so I don't think something O(n!) would finish – Sam May 18 '16 at 22:21
  • I'm not sure how you work out that it is ***O(n!)***. I would have thought it was ***O(n\*n)***. And it is very wrong to reject a simple solution out of hand without even testing it just becuase you *"don't think [it] would finish"* – Borodin May 19 '16 at 08:24
1

I am using the posted input and output files as a guide on what is required.

A note on complexity. In principle, each line has to be compared to all following lines. The number of operations actually carried out depends on the data. Since it is stated that the data is sorted on the field to be compared the inner loop iterations can be cut as soon as overlapping stops. A comment on complexity estimate is at the end.

This compares each line to the ones following it. For that all lines are first read into an array. If the data set is very large this should be changed to read line by line and then the procedure turned around, to compare the currently read line to all previous. This is a very basic approach. It may well be better to build auxiliary data structures first, possibly making use of suitable libraries.

use warnings;
use strict;

my $file = 'data_overlap.txt';
my @lines = do { 
    open my $fh, '<', $file or die "Can't open $file -- $!";
    <$fh>;
};

# For each element compare all following ones, but cut out 
# as soon as there's no overlap since data is sorted
for my $i (0..$#lines) 
{  
    my @ref_fields = split '\s+', $lines[$i];
    for my $j ($i+1..$#lines) 
    {   
        my @curr_fields = split '\s+', $lines[$j]; 
        if ( $ref_fields[-1] > $curr_fields[-2] ) { 
            print "$curr_fields[0] overlap with $ref_fields[0]\t" .
                "$ref_fields[0]: $ref_fields[-2] $ref_fields[-1]\t" .
                "$curr_fields[0]: $curr_fields[-2] $curr_fields[-1]\n";
        }   
        else { print "\tNo overlap, move on.\n"; last }
    }   
}

With the input in file 'data_overlap.txt' this prints

87 overlap with 93      93: 1 82        87: 1 7912
76 overlap with 93      93: 1 82        76: 2 20690
65 overlap with 93      93: 1 82        65: 2 170
        No overlap, move on.
76 overlap with 87      87: 1 7912      76: 2 20690
65 overlap with 87      87: 1 7912      65: 2 170
        No overlap, move on.
65 overlap with 76      76: 2 20690     65: 2 170
256 overlap with 76     76: 2 20690     256: 17515 66740
        No overlap, move on.
        No overlap, move on.
        No overlap, move on.
228 overlap with 166    166: 72503 123150       228: 72510 114530

A comment on complexity

Worst case   Each element has to be compared to every other (they all overlap). This means that for each element we need N-1 comparisons, and we have N elements. This is O(N^2) complexity. This complexity is not good for operations that are used often and on potentially large data sets, like what libraries do. But it is not necessarily bad for a particular problem -- the data set still needs to be quite large for that to result in prohibitively long runtimes.

Best case   Each element is compared only once (no overlap at all). This implies N comparisons, thus O(N) complexity.

Average   Let us assume that each element overlaps with a "few" next ones, let us say 3 (three). This means that there would be 3N comparisons. This is still O(N) complexity. This holds as long as the number of comparisons does not depend on the length of the list (but is constant), which is a very reasonable typical scenario here. This is good.

Thanks to ikegami for bringing this up in the comment, along with the estimate.

Remember that the importance of the computational complexity of a technique depends on its use.

Community
  • 1
  • 1
zdim
  • 64,580
  • 5
  • 52
  • 81
  • The OP's data is sorted by the *start* column. It wasn't very clear in the original post, and I hope I've improved it – Borodin May 18 '16 at 21:05
  • @Borodin Oh ... then this 'answer' may need to go altogether ... reviewing. Thank you! – zdim May 18 '16 at 21:07
  • Maybe not. I think we're both ignoring the request to avoid a specific complexity, and that's correct as that should only be a goal in experimental software. The OP hasn't explained any real constraints, and as far as we know his file has only seven or so lines of data – Borodin May 18 '16 at 21:10
  • 1
    @Borodin Right. Looking at it some more -- I don't think they can get the complexity target. If the data is not sorted everything has to be compared. I don't see how it is possible to tell when to cut it out. As for the length, given what the input looks like I think that it is reasonable to guess that it is feasible to pull it all into an array first. – zdim May 18 '16 at 21:13
  • Analysis: Worst case: O(N^2). Average case: O(N), since the runs of overlaps are probably usually short. – ikegami May 19 '16 at 05:31
  • @ikegami Thank you! I wasn't sure that I could assess the average case like that. (i thought that OP's _avoid a O(n!)_ was talking merely about not running each vs each, and in that not getting the complexity right.) I am going to add a note on complexity, and I'd like to quote this if that is OK with you. – zdim May 19 '16 at 05:54