0

Let me know if I posted anything incorrectly, here. (Note: KairosDB is on top of Cassandra. Uses Hector).

I'm using the KairosDB Java Client to dump large amounts of sample data into the datastore. I currently dumped 6 million in, and am now attempting to delete all of it with the method as follows:

public static void purgeData(String metricsType, HttpClient c, int num, TimeUnit units){
    try {
        System.out.println("Beginning method");
        c = new HttpClient("http://localhost:8080/api/v1/datapoints/delete");
        QueryBuilder builder = QueryBuilder.getInstance();
        System.out.println("Preparing to delete info");
        builder.setStart(20, TimeUnit.MONTHS).setEnd(1, TimeUnit.SECONDS).addMetric(metricsType);
        System.out.println("Attempted to delete info");
        QueryResponse response = c.query(builder);
        //System.out.println("JSON: " + response.getJson());

    } catch (Exception e) {
        System.out.println("Adding data points produced an error");
        e.printStackTrace();
    }
}

Note that I removed the time interval parameters simply to try and delete all of the data at once.

When executing this method, no points are seemingly deleted. I opted to curl the query with the JSON form of the data and received a HectorException stating "all host pools marked down. Retry burden pushed out to client".

My personal conclusion is that 6 million is too many to delete at once. I was thinking about deleting pieces at a time, but I don't know how to restrict how many rows I delete from the KDB Java client-side. I know that KairosDB is used in production. How do people effectively delete large amounts of data with the Java Client?

Thanks very much for your time!

2 Answers2

0

You can use cqlsh or cassandra-cli to truncate KairosDBs tables (data_points, row_key_index, string_index). I am not familiar enough with KairosDB to know if thats going to cause issues or not though.

> truncate {your keyspace}.data_points;

it might take a few seconds to complete.

Chris Lohfink
  • 16,150
  • 1
  • 29
  • 38
  • Thanks for the response! Unfortunately, I need to do this from inside the Java file, as it's going to be dependent on a parameter provided from a different service. Any ideas with the Java client? – user3781090 Jul 24 '15 at 23:35
  • If you are using a cql driver you can execute the truncate above session.execute("truncate keyspace.data_points"); Theres support in query builder like your using as well – Chris Lohfink Jul 26 '15 at 02:02
  • Thank you so much! I'll look into this. – user3781090 Jul 27 '15 at 15:46
0

6 million datapoints to delete at once should not make any problem.

This exception is weird, it utually means tht Hector could not communicate with cassandra. Did you check that everything's all right ion KairosDB and cassandra log files? Are all configured coordinators in kairosdb.properties of the cluster alive?

If it's not due to cassandra, I recommend raisning an issue on KairosDB github for your problem, associating your JSON of the query and the log of KairosDB.

There are two ways of deleting data in kairosDB.

A) If you need to delete all datapoints for a given metric, you can just use the delete metric API, it calls the same method in the background so expecte the same results. However it will be much faster because you make sure all matching rows are deleted from Cassandra instead of individual cells.

B) If you need to delete only some datapoints for one metric, then you are already using the right method.

Before going further, I see that you don't define tags in your delete query so you would delete all datapoints for all series of this metric during the time interval... Is it what you want to do?

Last, to answer your questions, we are doing delete operations on large amounts of data (batch reinserts of millions of samples, we delete all the matching series for the time interval then we reinsert). Our operations work on large amounts of metrics (thousands of them), so the delete query is very large but works pretty well, we did not handle millions of points on the same metric, but unless you really have only one series the results should be the same.

If the millions samples to delete appear to be the problem (I doubt it) you can try the following : split your delete query by several time intervals (put several times the same metric in your delete query but with fractions of the total time interval), so you would reduce the amount of samples to delete in one batch.

I hope this helps.

Loic

Loic
  • 1,088
  • 7
  • 19
  • I implemented the way you mentioned(batching time intervals) and it's working. Thanks! I'm still confused about why deleting so much proves to be such a problem. Do you know if there is an internal accessible timestamp in Cassandra, so I could perform the delete there? – user3781090 Aug 06 '15 at 17:38
  • Usually deleting all the data at once is much faster, since rows are dropped from Cassandra instead of putting a tombstone on millions of individual cells. So I'm also confused and it's interesting to investigate. I see that you delete data from 20 months ago to 1 second ago, may I ask you why you enter those values? Do you know how many series you have under this metric name durign this time interval? – Loic Aug 06 '15 at 20:33