We are trying to write a large number of records (upwards of 5 million at a time) into Cassandra. These are being read from tab delimited files and are being imported into Cassandra using executeAsync. We have been using much smaller datasets (~330k records) which will be more common. Until recently, our script has been silently stopping its import at around 65k records. Since upgrading the RAM from 2Gb to 4Gb the number of records importing have since doubled but we are still not successfully importing all the records.
This is an example of the process we are running at present:
$cluster = \Cassandra::cluster()->withContactPoints('127.0.0.1')->build();
$session = $cluster->connect('example_data');
$statement = $session->prepare("INSERT INTO example_table (example_id, column_1, column_2, column_3, column_4, column_5, column_6) VALUES (uuid(), ?, ?, ?, ?, ?, ?)");
$futures = array();
$data = array();
foreach ($results as $row) {
$data = array($row[‘column_1’], $row[‘column_2’], $row[‘column_3’], $row[‘column_4’], $row[‘column_5’], $row[‘column_6’]);
$futures = $session->executeAsync($statement, new \Cassandra\ExecutionOptions(array(
'arguments' => $data
)));
}
We suspect that this might be down to the heap running out of space:
DEBUG [SlabPoolCleaner] 2017-02-27 17:01:17,105 ColumnFamilyStore.java:1153 - Flushing largest CFS(Keyspace='dev', ColumnFamily='example_data') to free up room. Used total: 0.67/0.00, live: 0.33/0.00, flushing: 0.33/0.00, this: 0.20/0.00
DEBUG [SlabPoolCleaner] 2017-02-27 17:01:17,133 ColumnFamilyStore.java:854 - Enqueuing flush of example_data: 89516255 (33%) on-heap, 0 (0%) off-heap
The table we are inserting this data is as follows:
CREATE TABLE example_data (
example_id uuid PRIMARY KEY,
column_1 int,
column_2 varchar,
column_3 int,
column_4 varchar,
column_5 int,
column_6 int
);
CREATE INDEX column_5 ON example_data (column_5);
CREATE INDEX column_6 ON example_data (column_6);
We have attempted to use the batch method but believe it is not appropriate here as it causes the Cassandra process to run at a high level of CPU usage (~85%).
We are using the latest version of DSE/Cassandra available from the repository.
Cassandra 3.0.11.1564 | DSE 5.0.6