How to partition DynamoDB table with time-series data from users of different organizations?

Question

I have an application being built using AWS AppSync with a primary focus of sending telemetry data from a mobile application. I am stuck on how to partition and structure the DynamoDB tables for this as the users of the application belong to different organizations, in those organizations there will be admins who are able to view the data specific to their organization.

OrganizationA
-->Admin # View all the telemetry data
---->User # Send the telemetry data from their mobile application

Based on some research from these resources,

Link 1.
Link 2.

The advised manner is to create tables for individual periods i.e., a table for every day with the telemetry readings.

Example(not sure what pk is in this example):

The way in which I am planning to separate the users using AWS Cognito is by attaching a custom attribute when the user signs up such as Organization and Role(Admin or User) as per this answer then use a Pre-Signup Lambda Trigger.

How should I achieve this?

score 1 · Answer 1 · answered Dec 15 '20 at 22:03

1

Since you really don't need users from one organization to read data from another organization, and for all your access patterns you will always know the organization id, then that attribute should be a factor in partitioning: either at the table level, or at the partition key level.

Then you have to determine if you can simply use the organization id as a partition key, or you need to further partition -- say, by concatenating the organization id and the hour value for each sample. This will depend on the amount of data you expect to generate by each organization in a given day. The tradeoff being more granular partitioning vs. cost of querying for data.

If organizations generate small amounts of data each day (say, a few events an hour) then just use organization id as the partition key. Otherwise, partition the data further.

In all of the above, the sort key should probably be the timestamp of the events, either with second or millisecond precision depending on your needs. That way your queries can retrieve ordered time-series data.

Keep in mind that when you make queries, you may need to execute multiple queries and stick the results together in your application to fully represent the results as the range may span multiple partitions, or even multiple tables.

answered Dec 15 '20 at 22:03

Mike Dinescu

54,171
16
118
151

Thanks for the answer. Yes the last part on needing to concatenate the results together is another issue as I will need to show the telemetry data per trip for a user and I am not sure how to do so. – yudhiesh Dec 16 '20 at 02:21
how much data are you collecting per user? how many user in an org? these are important details.. if you need queries per user than the partition should maybe be by org and by user: so you might have partition keys like: `org1_usrA` to indicate the record is for user "A" in org "1"; or `org1_usrA_05` to indicate it's for org "1", user "A", for hour 05-06 of the day for which the table was created. – Mike Dinescu Dec 16 '20 at 06:45
for queries, you will need to stitch together the data from multiple partitions, so you make N queries and then you combine results locally, in your app (you can make some requests in parallel to speed things up too) – Mike Dinescu Dec 16 '20 at 06:46
The initial plan is to be sending data every 10 seconds from each sensor and each user will have up to 6 sensors with about 60 users at the start. So for the trips how would I know the end of it? Could I set each trip with its own `trip_id` to ease this part? – yudhiesh Dec 16 '20 at 07:12
That’s 84600 samples per user per day. With 60 users per org, that’d be about 4 million points per org, per day. You probably should partition by user, not by org in that case. I’m not sure I follow how trips fit in. This detail is not in the original text. Perhaps you can have a separate table that is not time series that stores details about users, trips orgs and there you can have trip starts and stops. – Mike Dinescu Dec 16 '20 at 18:31
Regarding trips the telemetry data is only collected from point A to point B and that would be a single trip. Based on more research I am thinking about having the PK as UserID and the SK as the timestamp, then using GSI I would have TripID. – yudhiesh Dec 17 '20 at 03:07

How to partition DynamoDB table with time-series data from users of different organizations?

1 Answers1