I have csv of 32GB with almost 150million rows, i planned to use SStableloader to export data to cassandra on EC2, & to generate SStable i used java codes below. Problem is, on server i am only getting 12k rows, also the filesize of generated SStable is just 28 m & process is not throwing any error. Moreover, if i execute it on another .csv, one with 10 rows, no issues, i get all 10 rows.
if(args.length < 2){
System.out.println("Something wrong with parameters, heres pattern: <CSV_URL> <Default_Output_Dir>");
return;
}
CSV_URL = args[0];
DEFAULT_OUTPUT_DIR = args[1];
// magic!
Config.setClientMode(true);
// Create output directory that has keyspace and table name in the path
File outputDir = new File(DEFAULT_OUTPUT_DIR + File.separator + KEYSPACE + File.separator + TABLE);
if (!outputDir.exists() && !outputDir.mkdirs())
{
throw new RuntimeException("Cannot create output directory: " + outputDir);
}
// Prepare SSTable writer
CQLSSTableWriter.Builder builder = CQLSSTableWriter.builder();
// set output directory
builder.inDirectory(outputDir)
// set target schema
.forTable(SCHEMA)
// set CQL statement to put data
.using(INSERT_STMT)
// set partitioner if needed
// default is Murmur3Partitioner so set if you use different one.
.withPartitioner(new Murmur3Partitioner());
CQLSSTableWriter writer = builder.build();
try (
BufferedReader reader = new BufferedReader(new FileReader(CSV_URL));
CsvListReader csvReader = new CsvListReader(reader, CsvPreference.STANDARD_PREFERENCE)
){
//csvReader.getHeader(true);
// Write to SSTable while reading data
List<String> line;
while ((line = csvReader.read()) != null)
{
writer.addRow(
Integer.parseInt(line.get(0)),
..
new BigDecimal(line.get(22)),
new BigDecimal(line.get(23))
);
}
}
catch (Exception e)
{
e.printStackTrace();
}
try
{
writer.close();
}
catch (IOException ignore) {}
and here's schema:
CREATE KEYSPACE IF NOT EXISTS ma WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE ma;
CREATE TABLE IF NOT EXISTS cassie (PKWID int,DX varchar,......, QS decimal,PRIMARY KEY (PKWID));
using Cassandra 22x. Java driver for creating SSTable