6

I'm thinking about developing a system to perform highly parallel queries on nested (but tree-like) data. The potential users are data analysts (physicists, specifically), not programmers. For the user interface, I want to use a well-known query language to avoid proliferating new languages.

Most of the data would be structured like this (imagine the following schema for billions of event structures):

event: struct
  |
  +--- timestamp: bigint
  +--- missing energy: float
  +--- tracks: array of struct
  |      |
  |      +--- momentum: float
  |      +--- theta angle: float
  |      +--- hits: array of struct
  |             |
  |             +--- detector id: int
  |             +--- charge: float
  |             +--- time: float
  |             +--- ...
  +--- showers: array of struct
         |
         +--- ...

The database would be read-only, and most of the queries would be things like:

  • momentum of the track with the most hits with theta between -2.4 and 2.4
  • average charge of all hits with time in 0-10 ps on all tracks with momentum greater than 10 GeV/c
  • weighted average theta of the two tracks with highest momentum

et cetera. What these queries have in common is that they all resolve to one scalar per event, though they delve into the arrays of structs to do it. They perform "reduce" like operations (generally fold in Scala, aggregate in Spark, DAF in SQL) across filtered, transformed subsets of those arrays. I could write them in Scala like this:

// missing check for when zero tracks passed filter!
{event => event.tracks                      // get list of tracks
               .filter(abs(_.theta) < 2.4)  // in theta range
               .maxBy(_.hits.size)          // take the one with the most hits
               .momentum                    // return its momentum
}

{event => mean(
            event.tracks                    // get list of tracks
                 .filter(_.momentum > 10)   // in momentum range
                 .flatMap(_.hits)           // explode to hits
                 .filter(_.time < 10)       // in time range
                 .map(_.charge)             // return their charges
              )}                            // ... to the mean function

// again missing check for less than two tracks!
{event => val List(one, two) =              // unpack and assign "one" and "two"
              event.tracks                  // get list of tracks
                   .sortBy(_.momentum)      // sort by momentum
                   .take(2)                 // take the first two
          // now compute the weighted mean of structs "one" and "two"
          (one.theta*one.momentum + two.theta*two.momentum) /
              (one.momentum + two.momentum)
}

Why not just use Scala? My program is implemented in C and will run on GPUs. Whatever Scala I bring to it would be a reimplemented subset--- in other words, an invented language. (The same could be said for Haskell, Javascript, or other language that makes heavy use of functions as arguments.)

Also, these queries ought to be declarative. If I implement too much of a general purpose programming language, details like the order of function calls might become relevant.

Why not just use SQL? Is it possible to write queries like the above easily, such that they're readable by anyone other than the author? Queries like the above are the norm, not complex extremes.

SQL supports nested arrays of structs, but all the examples I can find of using that substructure are horrendously complicated. One has to explode the table of events into a table of tracks (or double-explode to get hits), and some complex accounting would be needed to unexplode and get back to one scalar per event.

I suppose I could use SQL with new functions like MAXIMAL(collection, function) that return a struct from an array, similar to track[12] but using the user-provided function as an objective function for maximizing, minimizing, finding the top/bottom N, etc. I don't think SQL supports passing functions as arguments. If I write an SQL that does, it would be non-standard.

Is there a widely used dialect of SQL that supports passing functions as arguments?

Or is there another declarative language I should consider?

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47
  • 1
    Your nested structs are just additional tables. You have a principle `event` table with a unique identifier. Then a `track` table than has a foreign key to the unique identifier in `event`. The allows a relationship where ***one*** `event` row is associated to ***zero to many*** `track` row(s). The same applies for `event`:`showers` and `track`:`hit`, etc, etc. The SQL then generally become a case of joining two tables, then aggregating, joining that result to another table, and aggregating again, etc, etc. – MatBailie Aug 08 '16 at 14:41
  • In terms of `functions as arguments`, that's not going to be "normal" in any dialect of SQL. Some do have their own CLR and allow you to do some magical things, but even if you made it work it wouldn't be anything a standard SQL developer would recognise *(relevant to you in terms of support)*. But MS SQL Server does have `APPLY` that allows you to encapsulate functions in a different way that may be relevant to you. – MatBailie Aug 08 '16 at 14:46
  • Would it be easy to write/easy to read if every query is a join + aggregation like that? If you can show what the SQL queries would look like (e.g. my three examples) and it's not horrendous, that's the sort of answer I'm looking for. (Sorry about the subjectivity of "horrendous," but I think you know why this is my criterion.) – Jim Pivarski Aug 08 '16 at 14:46
  • If you add comments *(pseudo code, language agnostic, idiot (me) friendly)* to your Scala examples, then I'll *(try to)* translate into SQL. – MatBailie Aug 08 '16 at 14:48
  • Sorry; I picked Scala just to be concrete and concise. In-place functions like these are also concise in Haskell and Julia, and they're common in R and Javascript, though you have to write the whole word "function" every time. (In Python, you have to write the whole word "lambda".) I think these sorts of queries lend themselves to chains of functors, which is why I was asking about functions as arguments. – Jim Pivarski Aug 08 '16 at 15:00
  • Indeed, SQL has different focus and isn't exactly concise. But over the decades has adapted to be much better an analytical queries. I've translated the first query, but I have work to do, I'll give the next one a crack after my meeting. – MatBailie Aug 08 '16 at 15:06
  • See my comment below it; this is an example of what I'd consider complicated. Maybe we should wait to see if someone can suggest a different declarative language. There must be others that have this sort of query as its focus... (Thanks for helping, though!) – Jim Pivarski Aug 08 '16 at 15:10
  • Why again have you eliminated Haskell or other functional languages? If running on GPU *and* concise code *and* not inventing something new are you priorities? A functional language would seem to be perfectly suited... – MatBailie Aug 08 '16 at 15:40
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/120440/discussion-between-jim-pivarski-and-matbailie). – Jim Pivarski Aug 08 '16 at 15:41
  • Have you checked out https://neo4j.com/docs/developer-manual/current/cypher/ as a syntax that supports trees kinda like SQL? – fncomp Aug 31 '16 at 19:58
  • @fncomp Thank you very much for pointing this out! I read the spec and while I think it'll be very useful for graph queries, where you're looking for novel _structures_, in my case the structure is fixed and I don't see an easy way to map my queries into that. But thanks anyway, that's the sort of thing I'm looking for; I just need to find (or make) the right one. – Jim Pivarski Sep 01 '16 at 11:49
  • Cool, maybe at least inspire something. See you at Strangeloop :-) – fncomp Sep 01 '16 at 21:31

4 Answers4

4

JSONiq was designed exactly for querying trees, even and especially when data is highly nested and heterogeneous. It is 95% based on a W3C standard.

Rumble is an open-source implementation of JSONiq that works on collections of billions of records. It uses Spark below the hood but in a way completely transparent to (and hidden from) the user.

The three queries look like so. With Rumble, they can seamlessly run on a laptop on a small amount of data, but also in parallel on potentially billions of objects on a cluster, as long as the underlying parser is streamlined for it. The declarative code is the same.

Query 1:

(
  for $track in root-file("events.root", "Events").tracks
  where abs($track.theta) lt 2.4
  order by size($track.hits) descending
  return track
)[1].momentum

Query 2:

root-file("events.root", "Events").tracks[$$.momentum gt 10].hits[][$$.time lt 10].charge

Query 3:

let $tracks := (
    for $track in root-file("events.root", "Events").tracks
    order by $track.momentum
    return $track
  )[position() le 2]
return
  sum(
    for $t in $tracks
    return $t.theta * $t.momentum
  ) div sum($tracks.momentum)
Ghislain Fourny
  • 6,971
  • 1
  • 30
  • 37
2

Even if you've got pure tree like data structures you might want to look at a graph database. In particular, NEO4J supports a declarative query language known as Cypher:

https://neo4j.com/developer/cypher-query-language/

Titan might also be interesting for the scale you are talking about, it supports Gremlin from the Apache TinkerPop project which is multi-platform (but not declarative):

http://tinkerpop.apache.org/docs/3.0.1-incubating/#preface

PDH
  • 76
  • 6
  • The only reason I'm worried about overkill (a graph representation contains trees and a whole lot more) is because you end up having to address irrelevant concepts in each query. There's an advantage to the DSL being somewhat streamlined. I've been looking at Cypher and Gremlin and thinking about how these queries would look: it seems like you'd have to specifically say that events contain tracks whenever you want to get a particular track from an event. But events _always_ contain tracks... – Jim Pivarski Sep 09 '16 at 19:28
2

I posted this in a comment earlier, but moving it here.

I'm with others on the use of a graph database. I'm not familiar with Neo4j queries, but I expect them to be capable. Similarly, SPARQL would work well for this kind of thing.

For the first query, a SPARQL query might look like:

PREFIX : <http://yournamespace.com/accelerator/> .

SELECT ?momentum (MAX(?hitcount) as ?maxhits)
WHERE {
    SELECT ?momentum (COUNT(?hits) AS ?hitcount)
    WHERE ?track :momentum ?momentum .
          ?track :theta ?theta .
          FILTER (?theta > -2.4 AND ?theta < 2.4) .
          ?track :hits ?hits
    GROUP BY ?track
}
GROUP BY ?momentum;

Identifiers have the : prefix on them because they need to be encoded as URIs. But that's an internal detail for moving to RDF (the data format for SPARQL databases).

The above query is doing sub-queries because you're looking to aggregate (on the count), and then aggregate again (with the max of the count). But you can see that it's all handled in an SQL-like way, and does not require post-processing.

PaulaG
  • 336
  • 1
  • 8
  • I want to delve into this a bit deeper, but so far, this is looking like the best option. RDF and SPARQL satisfy my criteria for "standard": they're defined by the W3C. This query syntax doesn't look too onerous, and is pretty close to the intention. XML is a closer match to the tree-like data than a general graph, even though we'd never represent our data in XML for performance reasons. – Jim Pivarski Sep 09 '16 at 19:23
  • You should note that RDF has nothing to do with XML. The original release of RDF was shortly after XML was standardized, and the W3C felt that using their own "universal serialization format" was a good demonstration that they believed in it. This led to the initial serialization format for RDF being RDF/XML. Unfortunately, this had **lots** of problems. Several people in the XML community thought it was subverting XML, lots of people thought RDF was tied to XML, and it was a horrible format. The recommendation today is to use [Terse Triples Language: TurTLe](https://www.w3.org/TR/turtle/) – PaulaG Sep 10 '16 at 18:03
  • I have another commitment during the Strangeloop unsessions, but please feel free to track me down to talk about it. – PaulaG Sep 10 '16 at 18:06
0

Scala Example 1...

// missing check for when zero tracks passed filter!
{event => event.tracks                      // get list of tracks
               .filter(abs(_.theta) < 2.4)  // in theta range
               .maxBy(_.hits.size)          // take the one with the most hits
               .momentum                    // return its momentum
}

Potential SQL....

WITH
   hit_stats
AS
(
   SELECT
       hit.track_id,
       COUNT(*)    as hit_count
   FROM
       hit
   GROUP BY
       hit.track_id
),
   track_sorted
AS
(
    SELECT
       track.*,
       ROW_NUMBER() OVER (PARTITION BY track.event_id
                              ORDER BY hit_stats.hit_count DESC
                         )
                            track_ordinal
    FROM
       track
    INNER JOIN
       hit_stats
           ON  hit_stats.track_id = track.id
    WHERE
           track.theta > -2.4
       AND track.theta <  2.4
)
SELECT
    *
FROM
    event
INNER JOIN
    track_sorted
        ON  track_sorted.event_id = event.id
WHERE
    track_sorted.track_ordinal = 1

Or using APPLY from MS SQL Server

SELECT
   event.*,
   track.momentum
FROM
   event
OUTER APPLY
(
    SELECT TOP 1
        track.*,
        stat.hit_count
    FROM
        track
    OUTER APPLY
    (
        SELECT
            COUNT(*) hit_count
        FROM
            hit
        WHERE
            track_id = track.id
    )
        stat
    WHERE
            track.event_id = event.id
        AND track.theta    > -2.4
        AND track.theta    <  2.4
    ORDER BY
       stat.hit_count DESC
)
   track

That's very nested, which I find harder to read and maintain than the CTE version. But will likely end up with a very similar execution plan.

Oracle and other dialects have other ways of implementing similar "functions" as MS SQL Server accomplishes with APPLY.

MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • Okay, before you keep going with this, note that I was trying to find a way around the complex accounting of trying to get back to one scalar per event. I mean, you're creating new tables (`hit_stats`) and doing a lot of work with `ids` to put things back together again. – Jim Pivarski Aug 08 '16 at 15:07
  • @JimPivarski - SQL is, as you've mention, declarative. The expression *could* be solved by creating tables, but in this case you're creating an expression *(This notation being called Common Table Expressions)*. These are then expanded, macro like, in the query in which they are used, and the RDBMS then generates an execution plan so solve the declared problem with a lowest possible cost. Basically I'm just saying that this is only being used as an aid to express the problem space, not an expression of the solution. No new tables get created. – MatBailie Aug 08 '16 at 15:10
  • The way this solution will likely plan out is every track has a scalar operation where the number of hits is calculated, by counting entries in an index. The tracks are the sorted and all but the highest ranked track discarded. Finally that track is joined to its parent event. *(Provided that the number of hits is not going to change, the count of hits could be cached in the track table itself.)* – MatBailie Aug 08 '16 at 15:12
  • Oh, I didn't mean to suggest that the SQL engine is doing too much work under the hood (depends on the implementation, I'm sure). I'm speaking purely from the point of view of how it looks to the user. – Jim Pivarski Aug 08 '16 at 15:13
  • In MS SQL Server this can be expressed more "functionally". But would likely be the same execution plan. Give me two mins. – MatBailie Aug 08 '16 at 15:14
  • The number of tracks, showers, and hits are all variable. That's why my Scala examples are also lacking sufficient error-handling, which should be more concise than Scala in the kind of language I'm looking for. – Jim Pivarski Aug 08 '16 at 15:16
  • Variable in that they change each time you run the query? In which case my queries still work, just can't short cut things with cached values. – MatBailie Aug 08 '16 at 15:19
  • Sorry--- the number of tracks varies from one event to the next and the number of hits varies from one track to the next. The dataset as a whole is immutable. – Jim Pivarski Aug 08 '16 at 15:27