1

I am trying to load 100 billion multi-dimensional time series datapoints into KairosDB from a CSV file with the following format:

timestamp value_1 value_2 .... value_n

I tried to find a fast loading method on the official documentation and here's how I am currently doing the insertion (my codebase is in Python):

f = open(args.file, "r")
# Insert
i = 0
with tqdm(total=int(rows)) as pbar:
    while i < len(range(rows)):   
        data = []
        batch_size = 65000 / column
        while i < len(range(rows)) and batch_size > 0:
            batch_size -= 1
            # print(batch_size)
            i += 1
            values = f.readline()[:-1].split(" ")
            t = (get_datetime(values[0])[0] - datetime(1970, 1, 1)).total_seconds() * 1000
            t = int(t)
            for j in range(column):
                    data.append({
                            "name": "master.data",
                            "datapoints": [[t, values[j + 1]]],
                            "tags": {
                                    "dim": "dim" + str(j)
                            }
                    })    
        r = requests.post("http://localhost:8080/api/v1/datapoints", data = json.dumps(data))
        pbar.update(65000 / column)
pbar.close()

As the code above shows, my code is reading the dataset CSV file and preparing batches of 65000 data points, then sending the datapoints using requests.post.

However, this method is not very efficient. In fact, I am trying to load 100 billion data points and this is taking way longer than expected, loading only 3 Million rows with 100 columns each has been running for 29 hours and still has 991 hours to finish!!!!

enter image description here

I am certain there is a better way to load the dataset into KairosDB. Any suggestions for faster data loading?

AbdelKh
  • 499
  • 7
  • 19

0 Answers0