Can Continuous Views be reinitialized efficiently?

Question

I'm new to PipelineDB and have yet to even experience it at runtime (installation pending ...). But I'm reading over the documentation and I'm totally intrigued.

Apparently, PipelineDB is able to take set-based query representations and mechanically transform them into an incremental representation for efficiently processing the streams of deltas with storage limited as a function of the output of the continuous view.

Is it also supported to run the set-based query as a set-based query for priming a continuous view? It seems to me that upon creation of a Continuous View the initial data would be computed this traditional way. Also, since Continuous Views can be truncated, can they then be repopulated (from still-available source tables) without tearing down whatever dependent objects it has to allow a drop/create?

It seems to me that this feature would be critical in many practical scenarios. One easy example would be refreshing occasionally to reset the drift from rounding errors in, say, fractional averages.

Another example is if there were bug discovered and fixed in PipelineDB itself which had caused errors in the data. After the software is patched, the queries based on data still available ought to be rerun.

Continuous Views based fully on event streams with no permanent storage could not be rebuilt in that way. Not sure about if only part of the join sources are ephemeral.

I don't see these topics covered in the docs. Can you explain how these are or aren't a concern?

Thanks!

score 0 · Accepted Answer · answered Oct 27 '17 at 01:27

0

Jeff from PipelineDB here.

The main answer to your question is covered in the introduction section of the PipelineDB technical docs:

"PipelineDB can dramatically reduce the amount of information that needs to be persisted to disk because only the output of continuous queries is stored. Raw data is discarded once it has been read by the continuous queries that need to read it."

While continuous views only store the output of continuous queries, almost everybody who is using PipelineDB is storing their raw data somewhere cheap like S3. PipelineDB is meant to be the realtime analytics layer that powers things like realtime reporting applications and realtime monitoring & alerting systems, used almost always in conjunction with other systems for data infrastructure.

If you're interested in PipelineDB you might also want to check out the new realtime analytics API product we recently rolled out called Stride. The Stride API gives developers the benefit of continuous SQL queries, integrated storage, windowed queries, and other things like realtime webhooks, all without having to manage any underlying data infrastructure, all via a simple HTTP API.

If you have any additional technical questions you can always find our open-source users and dev team hanging out in our gitter chat channel.

answered Oct 27 '17 at 01:27

DidacticTactic

94
3

Hi Jeff. Thanks for viewing but I don’t see how this answer is relevant to the question. Is there a part of the question that seems to be based on an incorrect assumption about pipelinedb, and therefore difficult to answer directly? – Jason Kleban Oct 27 '17 at 02:25
Hi Jason, apologies for the delayed response here! It seemed like you were asking about the fundamental way continuous views are constructed and maintained based on the streaming data that hits continuous views, which is that as raw data hits CVs, the CV is updated and the raw data is discarded forever. You can "prime" CVs initially by backfilling data once in order to get them to a certain starting point, and CVs do store metadata for various reasons, but because the continuous views are incrementally updated and don't store raw data they are different than regular tables. – DidacticTactic Nov 07 '17 at 18:19
Thanks Jeff. Ok, so CVs can be back-filled with their table & other CVs sources - as long as the source data is available - not an ephemeral stream. Is the backfilling operation implemented incrementally too? Or is backfilling done by running the query as a set-based operation - the "traditional" way? – Jason Kleban Nov 07 '17 at 18:50
Then, assume that there's some loss of precision in one of our aggregates over time - averaging lots of very small numbers such that rounding errors cause inaccuracies - or maybe some bug is discovered & fixed in PipelineDB that caused inaccuracies. Is it possible to truncate *and re-backfill* Continuous Views without having to drop and recreate in dependency order our whole dependency tree correct them? – Jason Kleban Nov 07 '17 at 20:50
The backfilling is done incrementally too. The only thing that ever gets stored in a continuous view is the output of a continuous SQL query. To your second question, some metadata for computing aggregates like averages is stored behind the scenes in CVs, but if you wanted to recompute a continuous view at any point in the future to account for any type of issue you would necessarily need to run the raw data through a new continuous view, because by definition, the raw data that populated the initial continuous view would have been discarded. – DidacticTactic Nov 08 '17 at 21:51
Is there a specific use case or concern you're trying to solve for? Perhaps we can be more precise in helping you figure this out if we understand the actual scenario you're talking about? – DidacticTactic Nov 08 '17 at 21:55
An incremental backfill algorithm sounds like row-by-agonizing-row (no offense intended!) which surprises me since a continuous view definition is expressed as set-based query. PipelineDB's magic, I think, is its ability to transform the set-based representation into an equivalent incremental algorithm. I was expecting to hear that the query, trivially because of PipelineDB's postgres base, CAN be run set-based for backfilling when, necessarily, sources are materialized as tables or views (or other continuous views). – Jason Kleban Nov 08 '17 at 22:12
But as long as incrementally backfilling has good performance characteristics I guess it shouldn't matter to me. So regardless of that, I'm still very curious about re-backfilling existing/in-use continuous views - especially ones which might be dependencies of other continuous views. I assume that the worst case scenario is that to "re"-backfill a Continuous View, one could drop and recreate it. But drop and create is a schema change and might give DBAs heart-attacks, whereas a TRUNCATE and special rebackfill operation is only a data change. – Jason Kleban Nov 08 '17 at 22:28
Plus, if our application is built on several dependency-levels of Continuous Views, the Drop and Create script for that could get ugly. Rebuilding data in a single CV (seems like it) would be less drastic. – Jason Kleban Nov 08 '17 at 22:28
(I understand that CVs only store the results of the CV query.) Imagine that there had been a bug in PipelineDB's implementation of an aggregate (ex. AVERAGE()) that is now corrected but used by a CV whose output, involving that aggregate, was affected by the bug. Do I have to build the application in CV-dependency-order? Or can I just tell it to re-backfill? – Jason Kleban Nov 08 '17 at 22:28
And although I've so far not been able to devise an illustration of it, I believe this is a valid concern: Given that the algorithms for PipelineDB's supported aggregate functions have been designed to run in a constant storage size AND of course, floating-point arithmetic, I believe there are some datasets for which arithmetic errors will propogate through incremental CV updates and drift the CV output away from the "true" arithmetic answer. – Jason Kleban Nov 08 '17 at 22:54
Whereas a traditional aggregate query would be re-run from scratch, always giving an answer to the highest available precision for the type and dataset, these incremental aggregations would not have enough information to correct themselves. My interest is in accepting these approximations, but periodically rebackfilling a CV to correct for that drift. Since, as you say, backfilling uses the same incremental algorithms, then I guess any such drift is unavoidable. – Jason Kleban Nov 08 '17 at 22:54
If a continuous view became inaccurate due to a bug on our end, that we subsequently fixed, with data that was not stored in the metadata tables that help support aggregates, then yes, you would have to drop and recreate the continuous view. It is not possible to periodically re-backfill a CV to try and modify the CV definition. Backfilling is basically something you could do at the start of a CV to get it up to date, in an effort to jumpstart it for moving forward. – DidacticTactic Nov 11 '17 at 00:57
Let us know if you end up actually using PipelineDB and we can help with specifics. You can find our dev team on Gitter for specific support questions if and when you get going - https://gitter.im/pipelinedb/pipelinedb – DidacticTactic Nov 11 '17 at 00:58

Can Continuous Views be reinitialized efficiently?

1 Answers1