I am trying to load 100 billion multi-dimensional time series datapoints into KairosDB from a CSV file with the following format:
timestamp value_1 value_2 .... value_n
I tried to find a fast loading method on the official documentation and here's how I am currently doing the insertion (my codebase is in Python):
f = open(args.file, "r")
# Insert
i = 0
with tqdm(total=int(rows)) as pbar:
while i < len(range(rows)):
data = []
batch_size = 65000 / column
while i < len(range(rows)) and batch_size > 0:
batch_size -= 1
# print(batch_size)
i += 1
values = f.readline()[:-1].split(" ")
t = (get_datetime(values[0])[0] - datetime(1970, 1, 1)).total_seconds() * 1000
t = int(t)
for j in range(column):
data.append({
"name": "master.data",
"datapoints": [[t, values[j + 1]]],
"tags": {
"dim": "dim" + str(j)
}
})
r = requests.post("http://localhost:8080/api/v1/datapoints", data = json.dumps(data))
pbar.update(65000 / column)
pbar.close()
As the code above shows, my code is reading the dataset CSV file and preparing batches of 65000 data points, then sending the datapoints using requests.post
.
However, this method is not very efficient. In fact, I am trying to load 100 billion data points and this is taking way longer than expected, loading only 3 Million rows with 100 columns each has been running for 29 hours and still has 991 hours to finish!!!!
I am certain there is a better way to load the dataset into KairosDB. Any suggestions for faster data loading?