44

I have a little problem that needs some suggestions:

  • Lets say we have a few hundred data tables with a few dozen million rows each.
  • Data tables are timestamp(key) - value
  • Data tables are written once every second

The latest entry of each table should be quickly obtainable and will most likely be queried the most (sorta like "follow data in real time"). With the lack of 'Last()' or similar, I was thinking of creating another table "LatestValues" where the latest entry of each data table is updated for a faster retrieval. This, however, would add an extra update for each write operation. Also, most of the traffic would be concentrated on this table (good/bad?). Is there a better solution for this or am I missing something?

Also, lets say we want to query for the values in data tables. Since scanning is obviously out of the question, is the only option left to create a secondary index by duplicating the data, effectively doubling the storaging requirements and the amount write operations? Any other solutions?

I'm primarily looking at DynamoDB and Azure Table Storage, but I'm also curious how BigTable handles this.

user1597701
  • 497
  • 1
  • 5
  • 6

3 Answers3

62

I just published an article today with some common "recipes" about DynamoDB. One of them is "Storing article revisions, getting always the latest" I think it might interest you :)

In a nutshell, you can get the latest item using Query(hash_key=..., ScanIndexForward=True, limit=1)

But, this assumes you have a range_key_defined.

With Scan, you have no such parameter as ScanIndexForward=false and anyway, you can not rely on the order as data is spread over partitions and the Scan request is then load balanced.

To achieve you goal with DynamoDB, you may "split" your timestamp this way:

  1. hash_key: date
  2. range_key: time or full timestamp, as you prefer

Then, you can use the 'trick' of Query + Limit=1 + ScanIndexForward=false

jarmod
  • 71,565
  • 16
  • 115
  • 122
yadutaf
  • 6,840
  • 1
  • 37
  • 48
  • Thank you for the answer and your article was an interesting read. I still have one question regarding ScanIndexForward though. The documentation says: `Specifies ascending or descending traversal of the index...` Does ScanIndexForward work like GROUP BY and just reverse the order of query results or does it actually read the range_key in reverse order i.e. how many reads does this require? My concern is that by the end of the day (assuming 1 write/s) there is over 86k entries and constantly going through them again and again when getting the most recent value would be expensive. – user1597701 Oct 10 '12 at 11:21
  • The ``range_key`` is indexed so that it's efficient and, with ``Query``, you pay only for the retrieved results. This said, I don't know how this is implemented internally. – yadutaf Oct 10 '12 at 14:43
  • Thank you, then this sounds like its exactly what i needed. – user1597701 Oct 11 '12 at 07:18
  • 6
    Just a heads up: in my case, I needed `ScanIndexForward=False` instead of `True`. The default behaviour probably changed at some point after the article was written. The docs for the [`query` method](http://boto3.readthedocs.org/en/latest/reference/services/dynamodb.html#DynamoDB.Table.query) read: _If ScanIndexForward is true , DynamoDB returns the results in order, by range key. This is the default behavior. If ScanIndexForward is false, DynamoDB sorts the results in descending order by range key, and then returns the results to the client._ – Aylen Feb 25 '16 at 22:07
  • 1
    As of now it seems that ScanIndexForward was replaced by BackwardSearch with the same meaning.Unfortunately I couln't found any documentation for this change. – alehro Oct 30 '17 at 12:41
  • 2
    Looks like the link you have here has expired. I found what could be the same content on https://blog.yadutaf.fr/2012/10/07/common-dynamodb-questionsmisconceptionsrecipes/ – Jon Peterson Apr 04 '19 at 14:22
  • What if you wanted a list of the most recent record grouped by hash key? In other words, one record per hash key with the most recent timestamped data. – Steve Hiner May 12 '20 at 17:44
-1

In general, you probably just want to reverse the timestamp, so it decreases over time, leaving the newest row on top.

Here's a blog post of mine outlining how to do this with Windows Azure storage: http://blog.smarx.com/posts/using-numbers-as-keys-in-windows-azure.

UPDATE

I use DynamoDB for one project, but in a very simplistic way, so I don't have much experience. That said, http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/QueryAndScan.html suggest to me that you can just specify ScanIndexForward=false and Limit=1 to get the last item.

user94559
  • 59,196
  • 6
  • 103
  • 103
  • Thank you for your answer. Hadn't considered of trying to solve it like that, although I'm a bit hesitant on formatting the timestamp as it is 'part of the data' and as such it will be queried and it should be in a format that an user can understand. With this solution I'd have to reprocess every timestamp to reverse the transformation for every query. – user1597701 Oct 10 '12 at 11:03
  • I would suggest storing another column with the timestamp in its normal representation. – user94559 Oct 10 '12 at 16:00
-6

For folks who found this thread but only care about 1 table:

You can get the latest item from a table in the UI by clicking on the column to sort by those values.

Alec
  • 164
  • 1
  • 5
  • 4
    This only sorts the currently page of results (~100 records), presumably to avoid scanning the whole table. – zhark Jul 07 '20 at 01:47