0

We are trying to write a large number of records (upwards of 5 million at a time) into Cassandra. These are being read from tab delimited files and are being imported into Cassandra using executeAsync. We have been using much smaller datasets (~330k records) which will be more common. Until recently, our script has been silently stopping its import at around 65k records. Since upgrading the RAM from 2Gb to 4Gb the number of records importing have since doubled but we are still not successfully importing all the records.

This is an example of the process we are running at present:

$cluster = \Cassandra::cluster()->withContactPoints('127.0.0.1')->build();
$session = $cluster->connect('example_data');

$statement = $session->prepare("INSERT INTO example_table (example_id, column_1, column_2, column_3, column_4, column_5, column_6) VALUES (uuid(), ?, ?, ?, ?, ?, ?)");
$futures = array();
$data = array();

foreach ($results as $row) {
   $data = array($row[‘column_1’], $row[‘column_2’], $row[‘column_3’], $row[‘column_4’], $row[‘column_5’], $row[‘column_6’]);
   $futures = $session->executeAsync($statement, new \Cassandra\ExecutionOptions(array(
       'arguments' => $data
   )));
}

We suspect that this might be down to the heap running out of space:

DEBUG [SlabPoolCleaner] 2017-02-27 17:01:17,105  ColumnFamilyStore.java:1153 - Flushing largest CFS(Keyspace='dev', ColumnFamily='example_data') to free up room. Used total: 0.67/0.00, live: 0.33/0.00, flushing: 0.33/0.00, this: 0.20/0.00
DEBUG [SlabPoolCleaner] 2017-02-27 17:01:17,133  ColumnFamilyStore.java:854 - Enqueuing flush of example_data: 89516255 (33%) on-heap, 0 (0%) off-heap

The table we are inserting this data is as follows:

CREATE TABLE example_data (
  example_id uuid PRIMARY KEY,
  column_1 int,
  column_2 varchar,
  column_3 int,
  column_4 varchar,
  column_5 int,
  column_6 int
);
CREATE INDEX column_5 ON example_data (column_5);
CREATE INDEX column_6 ON example_data (column_6);

We have attempted to use the batch method but believe it is not appropriate here as it causes the Cassandra process to run at a high level of CPU usage (~85%).

We are using the latest version of DSE/Cassandra available from the repository.

Cassandra 3.0.11.1564 | DSE 5.0.6
deano23
  • 11
  • 1
  • 2

1 Answers1

1

2gb (and 4gb really) is not even the minimum recommended for Cassandra in development or production. Running on it is possible but it requires more tweaking since its below what the defaults are tuned for. Even tweaked you shouldnt expect much performance before it starts having issues keeping up (errors your getting) and you need to add more nodes.

https://docs.datastax.com/en/landing_page/doc/landing_page/planning/planningHardware.html

  • Production: 32 GB to 512 GB; the minimum is 8 GB for Cassandra only and 32 GB for DataStax Enterprise analytics and search nodes.
  • Development in non-loading testing environments: no less than 4 GB.
  • DSE Graph: 2 to 4 GB in addition to your particular combination of DSE Search or DSE Analytics. If you want a large dedicated graph cache, add more RAM.

Also your spamming writes with executeAsync and not applying any backpressure. Eventually you will overrun any system like that. You either need to add some kind of throttling, feedback, or just use synchronous requests.

Chris Lohfink
  • 16,150
  • 1
  • 29
  • 38
  • Thanks @Chris, we've got our build working well now. Regarding throttling, is there anything built into the DataStax php driver which can be utilised for this purpose? – deano23 Mar 06 '17 at 22:54
  • no, I am not familiar enough with php to know any good ways of handling async methods. probably put futures on a list and if the list is > 1000 or something to pull the first one off and do a `get`, so you have about 1000 in flight at all time. Can tweak that number then based some performance numbers (may be low with such small systems). – Chris Lohfink Mar 07 '17 at 14:30
  • Thank you very much Chris, this is very helpful. – deano23 Mar 08 '17 at 14:29