2

Preamble (you can skip this, this is just my justification)

I have created an application which uses sqlite as its database backend, and the schema works (and performs) very well during general application use.

Now I am attempting to build a reporting system for it, and I have built an excel xll which creates query tables from an unnamed DSN. Because of this, I have to do all of my reporting in sql only (I.e. I can do nothing programatically). This works very well for everything except for one query...

/// SKIP HERE....

My database contains a list of Features which have an id, distance and an indicator for whether the feature is a marker or not. The id is not necessarily in the same order as the distance (a feature with id 10 might have distance 100 and the feature with id 11 might have distance 90).

So the item basically looks like:

Feature { int id, int distance, bool is_marker }

What I am trying to do is find the next and previous feature that are also markers.

/// Edit

My first attempt used:

select 
*          /* I want all the data from this feature */
(select MAX(f2.distance) - f1.distance 
    from feature as f2
    where f2.is_marker && f2.distance < f1.distance) /* and the distance to the previous marker */
from feature as f2

second attempt (this one works, it just takes WAAAY too long for 100,000 features, somewhere around 9 days...):

select
*,          /* I want all the data from this feature */
(select f1.distance - MAX(f2.distance)
    from feature as f2
    where f2.distance AND f2.distance< f1.distance) /* and the distance to the previous marker */
from feature as f1

This query does return what I want, and performs adequately for small databases, but I have to support MUCH larger databases as well.

(Some of the databases have less than 1000 features, but the one I am working on right now has >90,000 features. The query for 1000 features takes <1s, but the one for 90,000 features takes 20 hours. This is because it does not grow linearly resulting in an 80x decrease in performance: 20*60*60/(90,000/1000) = 8000)

The backend database uses sqlite, and I am using the sqliteodbc connector to hookup excel.

If I was to do this in code, I would do this:

var features = featureRepository.GetAll();
var featuresWithMarkerDistance = new List<FeatureWithMarkerDistance>();
var previousMarker = null;
for(var index = 0; index < features.Length; index++) {
    var currentFeature = features[index];
    featuresWithMarkerDistance.Add(
        new FeaturesWithMarkerDistance(currentFeature, 
            feature.distance - previousMarker.distance));
    if(feature.is_marker) {
        previousMarker = feature;
    }
}

// FeatureWithMarkerDistance { int id, int distance, bool is_marker, int marker_distance }

// EDITS:

Here is a concrete example:

(The underlying table)
feature_id is_marker distance
1          false     100
2          false     90
3          false     101
4          true      50
5          false     5
6          true      85
7          false     150
8          false     75

(There is an index on distance)

The results I want:

feature_id is_marker distance distance_to_closest_previous_marker
1          false     100      15
2          false     90       5
3          false     101      16
4          true      50       null
5          false     5        null
6          true      85       35
7          false     150      65
8          false     75       25

So if I was getting the previous marker for feature_id 1, the distance for feature_id 1 is 100, and the closest marker is feature_id 6 at distance 85. To get the distance to closest previous marker, I take (100 - 85) = 15. I need to get this value for every single feature to include in a report. (This must be done in a single sql query because I'm using an odbc connector with excel). The above query does fetch what I want, it just performs very badly because in the where clause it has to search over the entire database for every feature.

What I would like to do is this: (unless there is a more performant way)

   select *          
    /* I want all the data from this feature */ 
    /* previous  = */ (select MAX(f2.distance) - f1.distance 
        from feature as f2
        where f2.is_marker && f2.distance >= previous && f2.distance < f1.distance) 
    /* and the distance to the previous marker */
    from feature as f2

So the basic theory is that I would store the previous marker value and only look at that value and beyond when I'm looking for the next marker.

Sorry about the confusion originally (I forgot to put the MAX() in originally)

bdkoepke
  • 23
  • 4

3 Answers3

0

Don't know SQLite, but does something like this work (I looked up the syntax, found LEFT JOIN and EXISTS, but not NOT EXISTS)?

select f2.*, f2.distance - f1.distance
from feature f2
left join feature f1 on f1.is_marker
                    and f2.distance > f1.distance
                    and not exists(select 1 from feature f1b
                                   where f1b.is_marker
                                     and f2.distance > f1b.distance
                                     and f1.distance < f1b.distance)
where f2.is_marker

I have no clue about performance, but expect that an index on (is_marker, distance) can be advantageous (you have to test whether or not it is useful to include is_marker in the index or not, in addition to this depending on SQLite, it probably also depends on the percentage of columns having is_marker = true).

Set
  • 391
  • 1
  • 2
  • First off, thanks for posting. I tried this query and it gets the distance between the markers. I will edit my question to try to improve it. – bdkoepke Jul 13 '12 at 22:54
  • Throwing a combined index on is_marker and distance makes it possible to run the query (it takes 52 seconds), but I'm thinking I can do better. This would be about 10 minutes for 1 million features. I have to run this query ten times per project, so that would be 1.4 hours, still a bit too long... – bdkoepke Jul 14 '12 at 03:28
0

The examples really helped. Here you go.

SELECT F2.feature_id, F2.is_marker, F2.distance, 
       F2.distance - (SELECT F1.distance FROM features F1
                      WHERE F1.is_marker<>0 
                        AND F1.distance<F2.distance
                      ORDER BY F1.distance DESC
                      LIMIT 1) AS "distance_to_closest_previous_marker"
FROM features F2
Holger Brandt
  • 4,324
  • 1
  • 20
  • 35
  • Thanks for posting as well. I tried this query, and this is what I want to do; however, this query performs about the same as the query I originally posted. The problem is that the query execution time is non-linear. If I run this query with limit 100, 1000, and 10,000 the query times are 15ms, 188ms, 32838ms. If it was linear time growth, I would expect the 10,000ms query to take 1.5 seconds. 33 seconds isn't too terrible, but I need to perform this query on 100,000 features. With 1 13x growth rate (188/15 = ~12, 32838/15 = ~12) = 188*12*10*32838/1000/3600 = 206 hours for 100,000 features. – bdkoepke Jul 14 '12 at 02:59
  • I'm not sure if it is even possible to do this query faster. Doing this in code takes <10 seconds (with 100,000 features). Some of the project files have >1 million features, which would be only 1.5 minutes in code, but using this query I would be dead before it completed :) – bdkoepke Jul 14 '12 at 03:02
0

I used SQLite3 shell and I tried your query adapted

SELECT *, 
       (SELECT MIN(feature.distance-distance) FROM feature AS f
               WHERE is_marker AND distance<feature.distance) 
       FROM feature;

and it performed fairly well for 5000 records. Maybe your weakest point is sqliteobdc? If indeed it stills slow, and assuming you have few true is_marker, you can create a table with just distances from feature where is_marker is true:

CREATE TEMP TABLE markers_distance (distance);
CREATE UNIQUE INDEX markers_idx ON markers_distance (distance);
INSERT OR IGNORE INTO markers_distance 
       SELECT distance FROM feature WHERE is_marker;

and now your query against markers_distance should be much faster:

SELECT *, 
       (SELECT MIN(feature.distance-distance) FROM markers_distance
               WHERE distance<feature.distance) 
       FROM feature;
LS_ᴅᴇᴠ
  • 10,823
  • 1
  • 23
  • 46