How to optimize AWS Timestream database or query?

Question

I created a sample AWS Timestream database with one table:

Initially this had 400 sample rows, and the following query took ~0.3 seconds on average from the Timestream Console:

SELECT
  make, COUNT(*) AS makeCount
FROM
  "sampleDB"."IoTMulti"
WHERE
  time BETWEEN TIMESTAMP '2021-11-30 02:47:15' AND TIMESTAMP '2022-12-30 22:47:55'
GROUP BY
  make

However, after populating the table with 10,000,000 test records, the same query now takes ~10 seconds on average. This is quite bad.

Is there anything I can do to reduce the query time?

Here is the code I used to create more sample data:

const Promise = require('bluebird');
const { TimestreamWrite } = require('@aws-sdk/client-timestream-write');

const writeClient = new TimestreamWrite({});

const upsert = async ({ dimensions, metrics, time, version }) => {
  const params = {
    DatabaseName: 'sampleDB',
    TableName: 'IoTMulti',
    Records: [
      {
        Time: time,
        MeasureValues: metrics
      }
    ],
    CommonAttributes: {
      Dimensions: dimensions,
      MeasureValueType: 'MULTI',
      MeasureName: 'IoTMulti-stats',
      Version: version || 1
    }
  };

  const result = await writeClient.writeRecords(params);

  return result;
};

const makes = [
  'Birchfield',
  'Alpha Sports',
  'Ascort',
  'Austin',
  'Australian Six',
  'Australis',
  'Birchfield',
  'Blade',
  'Buchanan',
  'Buckle',
  'Bush Ranger',
  'Caldwell Vale',
  'Cheetah',
  'Chrysler',
  'Ford',
  'FPV',
  'Giocattolo',
  'Goggomobil',
  'Hartnett',
  'Holden',
  'HSV',
  'Honda',
  'Ilinga',
  'Kaditcha',
  'Leyland',
  'Lloyd-Hartnett',
  'Lonsdale',
  'Mitsubishi',
  'Morris',
  'Nissan',
  'Pellandini',
  'Purvis Eureka',
  'Shrike',
  'Southern Cross',
  'Statesman',
  'Tarrant',
  'Toyota',
  'Volkswagen',
  'Zeta'
];

(async () => {
  const write = async () => {
    const rows = [...Array(2500).keys()].map((index) => {
      const minDate = new Date().getTime() - 1000 * 60 * 60 * 24 * 1;
      const maxDate = new Date().getTime();

      return {
        dimensions: [
          { Name: 'fleet', Value: 'Test' },
          {
            Name: 'truck_id',
            Value: Math.floor(Math.random() * 4132246625).toString()
          },
          { Name: 'fuel_capacity', Value: '100' },
          { Name: 'load_capacity', Value: '1000' },
          {
            Name: 'make',
            Value: makes[Math.floor(Math.random() * makes.length)]
          },
          { Name: 'model', Value: Math.floor(Math.random() * 10000).toString() }
        ],
        metrics: [
          {
            Name: 'fuel-reading',
            Value: (Math.random() * 100).toString(),
            Type: 'DOUBLE'
          },
          {
            Name: 'load',
            Value: Math.floor(Math.random() * 2000).toString(),
            Type: 'DOUBLE'
          },
          {
            Name: 'location',
            Value: '25.7617° N, 80.1918° W',
            Type: 'VARCHAR'
          },
          {
            Name: 'speed',
            Value: Math.floor(Math.random() * 100).toString(),
            Type: 'DOUBLE'
          }
        ],
        time: Math.floor(
          Math.random() * (maxDate - minDate + 1) + minDate
        ).toString(),
        version: 1
      };
    });

    await Promise.all(
      rows.map(async (row) => {
        await upsert(row);
      })
    );
  };

  await Promise.mapSeries([...Array(1000000).keys()], async (index) => {
    console.log(`${index + 1}`);
    await write();
  });
})();

I don't know how this would work in JS, but in Java you have async page iterator that can load the query response pages in parallel, that could speed it up — Borislav Stoilov, Dec 22 '22 at 09:00
Thank you. The query is still very slow even with one page of results. — Chad Johnson, Dec 22 '22 at 12:06

Lucas Meier · Answer 1 · 2023-08-21T07:04:59.580

Timestream (TS) uses partition keys to optimize the queries. When you create the TS Table, you can specify which partition key you would like to use. According to the doc, the TS data is stored in memory according to this partitioning scheme.

Usually, users use the partition key for cases where they already know what kind of query will be done on the TS table.

In addition to the partition key that the user can choose at the creation of the TS table, TS automatically partitions data based on the ingestion time, not specifically on the timestamp column itself. This means, for your demo-application where you generated the data in a batch, that this partitioning based on the ingestion timestamp will be useless.

To improve your application's performance, there is only one thing left (which does not explode your costs by putting the data in memory store):

Create a new dimension (you should not use measure_name directly as there is a limit on the number of distinct measure_name your table can have) as a placeholder for a copy of the timestamp in your data. (Sadly, partition keys can only be of type VARCHAR ...)
When creating the TS table, define the partition key to be the new dimension defined before.
Now, the time to process queries of the following form will be almost instantaneous:

SELECT
  make, COUNT(*) AS makeCount
FROM
  "sampleDB"."IoTMulti"
WHERE
  dimension_name = '1638263235000'
GROUP BY
  make

where 1638263235000 is the millisecond equivalent of '2021-11-30 02:47:15'.

Since in your use-case, you are using the BETWEEN statement in your query, you have to do some additional tweeks:

Define a rounding interval. This should for example be equal to the average window size of your BETWEEN queries (right now you asked for a 1-year period, which is quite big. Let's assume you often do queries for a given day instead).
Before writing the data to the TS table, change the dimension_name column to store the rounded timestamp instead.
In your query, you can now get all the entries that date to the 2021-11-30:

SELECT
  make, COUNT(*) AS makeCount
FROM
  "sampleDB"."IoTMulti"
WHERE
  dimension_name = '1638220800000'
GROUP BY
  make

You should however wisely choose the rounding interval; to get the query result for a whole week we would now need to add 6 AND clauses. Alternatively, you can also use the TS string operators for the date filtering. Actually, another (maybe better) solution still exists: using TS cast operations along with TS Date operators to convert the dimension_name timestamps into real dates and perform operations on them.

PS: it is also possible to store the timestamp value as '2021-11-30 02:47:15' directly in the DB if you prefer. PPS: I don't know the performance of using the TS operators with the partition keys. At least with the '1638220800000' values and equality comparisons the result is pretty fast :)

How to optimize AWS Timestream database or query?

1 Answers1