0

I have csv of 32GB with almost 150million rows, i planned to use SStableloader to export data to cassandra on EC2, & to generate SStable i used java codes below. Problem is, on server i am only getting 12k rows, also the filesize of generated SStable is just 28 m & process is not throwing any error. Moreover, if i execute it on another .csv, one with 10 rows, no issues, i get all 10 rows.

if(args.length < 2){
        System.out.println("Something wrong with parameters, heres pattern: <CSV_URL> <Default_Output_Dir>");
        return;
}

CSV_URL = args[0];
DEFAULT_OUTPUT_DIR = args[1];

// magic!
Config.setClientMode(true);

// Create output directory that has keyspace and table name in the path
File outputDir = new File(DEFAULT_OUTPUT_DIR + File.separator + KEYSPACE + File.separator + TABLE);
if (!outputDir.exists() && !outputDir.mkdirs())
{
    throw new RuntimeException("Cannot create output directory: " + outputDir);
}

// Prepare SSTable writer
CQLSSTableWriter.Builder builder = CQLSSTableWriter.builder();
// set output directory
builder.inDirectory(outputDir)
       // set target schema
       .forTable(SCHEMA)
       // set CQL statement to put data
       .using(INSERT_STMT)
       // set partitioner if needed
       // default is Murmur3Partitioner so set if you use different one.
       .withPartitioner(new Murmur3Partitioner());
CQLSSTableWriter writer = builder.build();

try (
    BufferedReader reader = new BufferedReader(new FileReader(CSV_URL));
    CsvListReader csvReader = new CsvListReader(reader, CsvPreference.STANDARD_PREFERENCE)
){
    //csvReader.getHeader(true);

    // Write to SSTable while reading data
    List<String> line;
    while ((line = csvReader.read()) != null)
    {
        writer.addRow(
            Integer.parseInt(line.get(0)),
            ..
            new BigDecimal(line.get(22)),
            new BigDecimal(line.get(23))
        );
    }
}
catch (Exception e)
{
    e.printStackTrace();
}


try
{
    writer.close();
}
catch (IOException ignore) {}

and here's schema:

CREATE KEYSPACE IF NOT EXISTS ma WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE ma;
CREATE TABLE IF NOT EXISTS cassie (PKWID int,DX varchar,......, QS decimal,PRIMARY KEY (PKWID));

using Cassandra 22x. Java driver for creating SSTable

Nenad Bozic
  • 3,724
  • 19
  • 45
Arsalan Saleem
  • 321
  • 2
  • 6
  • 21
  • Are you sure that in CSV `PKWID` is not same for some rows and cassandra is not doing upsert? – Nenad Bozic Sep 22 '15 at 17:35
  • @NenadBozic I am sure about PKWID, each PKWID is different for each rows, but i am not sure about cassandra doing upsert, can u help me to figure out that please. – Arsalan Saleem Sep 23 '15 at 01:51

0 Answers0