0

Like many, I'm no new the NoSQL world. I did a lot of research, but I still lack only one point, which I can't find proper answer for.

Short description of system:

I'm building a system that collects Visitor's data on different websites. Each visit is an Entity in the datastore, with properties like device type, IP, time of visit..etc.

There will be millions of visits in the datastore.

My Question, is how do I serve this data to clients. My Data is setting in the datastore as "Visit" entities.

Now when a customer logs in, I don't want to show them millions of records. I want for example to show them general stats. Like number of visits on mobile device, number of visits from specific country in some time range, and stuff like that.

Now since I'm new to the NoSQL databases, I'm not sure how I should go around showing these stats in the clients' dashboard.

As I know, Datastore has no support for aggregates, or getting count of query results for example.

I looked at BigQuery, but BigQuery works on Datastore "backups", I need to serve data in real time, without needing to do backups manually.

Also I read about counters, and sharding counters, is this the proper approach? have a counter for each client for each property for each tracking group? and show the total numbers this way? Sounds like too much for a simple purpose.

Any input or explanation that can get me in the right direction would be highly appreciated.

Best Regards

Community
  • 1
  • 1
shadyhossin
  • 575
  • 1
  • 6
  • 16
  • you can stream the data into bigquery, Datastore might not be a good fit for doing complex analysis – marcadian Apr 29 '16 at 00:44
  • I guess to sum it. If I only need count aggr function I can use GAE "datastore statistics api". But if I need to do more than that, more aggregate functions and a bit more complicated queries in real time, I guess I need BigQuery (and stream the data to it from Datastore)? – shadyhossin Apr 29 '16 at 00:54
  • well just use bigquery, from your explanation, I don't think you need to put the data in datastore, just bigquery – marcadian Apr 29 '16 at 01:12
  • I don't feel datastore is a good storage mechanism for what you are trying to accomplish. The cloud platform has the google search api which allows for facets but even this mechanism is behind the times. Have you considered using elastic search? https://www.elastic.co/products/elasticsearch – Jesse Apr 29 '16 at 02:12
  • @marcadian I need to load the data into bigQuery from another source like Datastore/Cloud Storage https://cloud.google.com/bigquery/loading-data – shadyhossin Apr 29 '16 at 10:57
  • @shadyhossin If you have the data in another source in the first place, but you can also insert directly to bigquery, no need to put it in different source. It depends on your use case https://cloud.google.com/bigquery/streaming-data-into-bigquery – marcadian Apr 29 '16 at 19:50

2 Answers2

0

As I know, Datastore has no support for aggregates, or getting count of query results for example.

This is not true. You can get a number of entities returned by a query with one line of code. The query itself can be keys-only, which is very fast and basically free.

Andrei Volgin
  • 40,755
  • 6
  • 49
  • 58
  • Is not fast if you have many entities. The more entities you have the slower it gets. You need to also consider the network not only the operation itself. – The user with no hat Oct 19 '16 at 11:11
0

Yes, counters are a good approach to your problem in terms of performance. They do have some downsides though, such as storage size and the fact that each time you would like to introduce a new type of statistic, you would need to create a counter for it.

In addition to your current "Visit" entities, you could opt for storing the aggregated data in Sharded Counters in the Datastore. These counters can be updated in real-time, or via a Task in one of your task queues. It would be fairly straight-forward to create a Task that would create the various counters for the current Visit entities.

Sharding is a way of creating multiple "underlying" entities that, when combined, represent some meaningful data. Sharding is done to ensure that there are no performance issues due to concurrent updates.

From the Google Documentation:

If you had a single entity that was the counter and the update rate was too fast, then you would have contention as the serialized writes would stack up and start to timeout. The way to solve this problem is a little counter-intuitive if you are coming from a relational database; the solution relies on the fact that reads from the App Engine datastore are extremely fast and cheap. The way to reduce the contention is to build a sharded counter – break the counter up into N different counters. When you want to increment the counter, you pick one of the shards at random and increment it. When you want to know the total count, you read all of the counter shards and sum up their individual counts. The more shards you have, the higher the throughput you will have for increments on your counter. This technique works for a lot more than just counters and an important skill to learn is spotting the entities in your application with a lot of writes and then finding good ways to shard them.

I would recommend having a look at the link for further information and some helpful examples.

Brett
  • 101
  • 6
  • 1
    I found this to be the right answer. Even though I decided to go with BigQuery, as I found it more simple and straight forward as I'm coming from relational database, but it's not the best solution, as queries take longer than I would like. For real time data display with this volume (and desired speed) I found this to be the appropriate solution for this case. Thank you. – shadyhossin Jul 13 '16 at 18:59