0

I need to insert a large xlsx (near 1 million rows) file in the Cassandra database and i'm having doubts about how to do it because of memory limitations.

I'm working with batch inserts but that is proving to be near impossible due to huge memory impact.

$batch = new Cassandra\BatchStatement(Cassandra::BATCH_UNLOGGED);

    foreach ($workbook->createRowIterator($myWorksheetIndex) as $rowIndex => $values) {

        if ($count > 0) {

            $time = is_int($values[3]) ? $values[2]->format('d-m-Y') . ' ' . date('00:00:00') : $values[2]->format('d-m-Y') . ' ' . $values[3]->format('H:i:s');
            $date = date_convert(DateTime::createFromFormat('d-m-Y H:i:s', $time));


            $prepared = $session->prepare(
                "INSERT INTO teste (ptd_assoc,ref_equip,dates) " .
                "VALUES (?, ?, ?)"
            );

            $batch->add($prepared, array(
                'ptd_assoc' => $values[0],
                'ref_equip' => $values[1],
                'dates' => new Cassandra\Timestamp(strtotime($date)),
                //  'load' => 3.4454
            ));
        }

        $count++;
    }

    $session->execute($batch);

I have successfully transform the xlsx into a more readable csv file. It's possible to Copy it to the database using the Cassandra\SimpleStatement method?

Andre Garcia
  • 894
  • 11
  • 30

1 Answers1

0

If the data is in well-formatted CSV, you may not have to write a custom importer. Have a look at the cqlsh COPY FROM command (help copy;).

Adam Holmberg
  • 7,245
  • 3
  • 30
  • 53
  • 'COPY' is a cqlsh (shell) command and not a CQL (protocol) command. If i want to do this with php, the only way is with a exec function call to cqlsh. Agree? – Andre Garcia Oct 27 '16 at 14:14
  • i installed cqlsh client to access my server but the copy command gives me this error `":1:'module' object has no attribute 'parse_options'"` no idea! – Andre Garcia Oct 27 '16 at 14:53
  • Sounds like https://issues.apache.org/jira/browse/CASSANDRA-12284 If you must use PHP, I recommend writing asynchronous concurrent requests. Loading all rows in a single batch will cause problems on both client and server side. – Adam Holmberg Oct 27 '16 at 17:24