1

Environment:

  • Java 8; specifically, Oracle JDK 1.8u25;
  • h2 as a SQL backend;
  • jooq for querying.

The table/database I'm querying from:

private static final String H2_URI_PREFIX = "jdbc:h2:";
private static final String H2_URI_POSTFIX
    = ";LOG=0;LOCK_MODE=0;UNDO_LOG=0;CACHE_SIZE=131072";
private static final String H2_USERNAME = "sa";
private static final String H2_PASSWORD = "";

private static final List<String> H2_DDL = Arrays.asList(
   "create table matchers ("
        + " id integer not null,"
        + " class_name varchar(255) not null,"
        + " matcher_type varchar(30) not null,"
        + " name varchar(1024) not null"
        + ");",
    "create table nodes ("
        + " id integer not null,"
        + " parent_id integer not null,"
        + " level integer not null,"
        + " success integer not null,"
        + " matcher_id integer not null,"
        + " start_index integer not null,"
        + " end_index integer not null,"
        + " time long not null"
        + ");",
    "alter table matchers add primary key(id);",
    "alter table nodes add primary key(id);",
    "alter table nodes add foreign key (matcher_id)"
        + " references matchers(id)"
);

// ...

private void doDdl(final DSLContext jooq)
{
    H2_DDL.forEach(jooq::execute);

    jooq.createIndex("nodes_parent_id").on(NODES, NODES.PARENT_ID)
        .execute();
    jooq.createIndex("nodes_start_index").on(NODES, NODES.START_INDEX)
        .execute();
    jooq.createIndex("nodes_end_index").on(NODES, NODES.END_INDEX)
        .execute();
}

Even though I show the full DDL code here (note that NODES and MATCHERS are generated by jooq's code generation package), only the nodes/NODES table is of interest.

One row in the nodes table represents a match event; of interest here are the start_index, end_index and level columns. It is guaranteed that start_index is less than or equal to end_index; as to the level column, it is the depth in the matcher tree, and depths start at 0; that is, for some matcher c in matcher path /a/b/c, c's level will be 2.

Now, the result I want to obtain is the following:

Given a line range (10, 25 or 50), return a map where keys are the line number and values are the maximum depth of the parse tree for this line; should only be considered matchers which are currently active for this line

A line is materialized by an interval [start, end) (start inclusive, end exclusive). A matcher is considered active for a given line if both of the following statements are true:

  • its start index is strictly less than the line's end index; and
  • its end index is greater than, or equal to, the line's start index.

Now, how I solve this query:

  • I create a virtual column consisting of a series of SQL case statements, one per line, checking whether a matcher is active for a given line; this column is called line;
  • I do select line, max(level) and group by line, with the added condition that the end index should be greater than or equal to the first line's start index, and the start index should be strictly less than the last line's end index.

The code:

@Override
public Map<Integer, Integer> getDepthMap(final int startLine,
    final int wantedLines)
    throws GrappaDebuggerException
{
    loadInputBuffer();

    final List<IndexRange> ranges
        = IntStream.range(startLine, startLine + wantedLines)
        .mapToObj(inputBuffer::getLineRange)
        .collect(Collectors.toList());

    final int startIndex = ranges.get(0).start;
    final int endIndex = ranges.get(ranges.size() - 1).end;
    final Condition indexCondition = NODES.START_INDEX.lt(endIndex)
        .and(NODES.END_INDEX.ge(startIndex));

    final Field<Integer> lineField = getLineField(startLine, ranges);

    final Map<Integer, Integer> ret = new HashMap<>();

    jooq.select(lineField, DSL.max(NODES.LEVEL))
        .from(NODES)
        .where(indexCondition)
        .groupBy(lineField)
        .forEach(r -> ret.put(r.value1(), r.value2() + 1));

    IntStream.range(startLine, startLine + wantedLines)
        .forEach(line -> ret.putIfAbsent(line, 0));

    return ret;
}

private Field<Integer> getLineField(final int startLine,
    final List<IndexRange> ranges)
{
    CaseConditionStep<Integer> step = DSL.decode()
        .when(activeThisRange(ranges.get(0)), startLine);

    final int size = ranges.size();

    for (int i = 1; i < size; i++)
        step = step.when(activeThisRange(ranges.get(i)), startLine + i);

    return step.as("line");
}

private static Condition activeThisRange(final IndexRange range)
{
    return NODES.START_INDEX.lt(range.end)
        .and(NODES.END_INDEX.ge(range.start));
}

This request takes approximately 15 seconds on the busiest parts of a 22 million table entry if I query for 25 lines (that is, lines n to n + 24 for some n), but is there any way to improve it?

Note that changing the SQL engine is NOT an option; this is a GUI app for which databases are "forgettable"; and I don't want to require that a "full fledged" RDBMS engine be installed!

fge
  • 119,121
  • 33
  • 254
  • 329

1 Answers1

1

I don't know H2 well enough, but since your predicate will always hit both START_INDEX and END_INDEX, it would be better to create an index on both columns:

 jooq.createIndex("better_index")
     .on(NODES, NODES.START_INDEX, NODES.END_INDEX)
     .execute();

The reason for this is that the SQL engine will need to hit the disk and scan indexes only once, as all the relevant information for the predicate is already contained in the index. This will greatly reduce your IO.

Again, not sure if H2 has this covered (pun intended), but if you also add NODES.LEVEL to the index, you'd have a so-called covering index, i.e. an index that contains all the data required for this particular query, removing the need to hit the disk again (for the MAX() function):

 jooq.createIndex("covering_index")
     .on(NODES, 
         NODES.START_INDEX, 
         NODES.END_INDEX
         NODES.LEVEL)
     .execute();

Here's also a very interesting question about range queries on PostgreSQL.

Community
  • 1
  • 1
Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
  • Hmm, why would the disk be hit twice because of `max()`? Also, well, I am somewhat reluctant to "carpet indexing"... – fge Feb 11 '15 at 09:40
  • Twice: Once for loading the index values, and once for loading the `LEVEL` attribute from the table. [Here's another good article explaining the idea of covering indexes](http://use-the-index-luke.com/sql/clustering/index-only-scan-covering-index). Again, I don't know if this all works in H2. It would certainly work in Oracle / PostgreSQL / SQL Server, etc. What do you mean by "carpet indexing"? – Lukas Eder Feb 11 '15 at 09:58
  • "carpet indexing" is a moniker I stole from [this book](https://duckduckgo.com/?q=book+the+art+of+SQL&ia=products) and which means to, basically, add indices on everything "just because"... Also, well, the more the indices the worse the insertion performance. Although I insert ony once, 22 million rows still take some time ;) – fge Feb 11 '15 at 10:07
  • Anyway, I'll try and see with a covering index; I also don't expect this tree to be hit very often (after all you can only click a limited number of times per second and other tabs contain much more interesting statistics :p). I really can't wait for the day however when I can use jooq for all DDL as well :p – fge Feb 11 '15 at 10:10
  • 1
    Uhwell, the index on at least both `start_index` and `end_index` is in fact very useful, I filter on both in more than one query! Accepted. And thanks for jooq! – fge Feb 11 '15 at 11:55
  • Sure, you're welcome! [And thanks for this video](https://www.youtube.com/watch?v=ayWlXzvC9s8)! I already anticipated more questions from you :) – Lukas Eder Feb 11 '15 at 12:46
  • Uh, I didn't expect that you would find that video and honestly it sucks a little :p Anyway, unrelated, but will there be the possibility to build a schema via a builder pattern? For instance I see that in the generated code for tables, `this` is escaped during object construction when creating the fields... – fge Feb 11 '15 at 14:35
  • @fge: Very hard to say what you really mean... :) I suggest a new question with details... – Lukas Eder Feb 11 '15 at 18:09
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/70733/discussion-between-fge-and-lukas-eder). – fge Feb 11 '15 at 18:17