0

I'm trying to find the most effective way to multithread a bulk load of data into multiple tables within a keyspace in Cassandra from a Java program. Here's my Keyspace/Table declaration:

CREATE KEYSPACE IF NOT EXISTS articles  WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : '3'}

CREATE TABLE IF NOT EXISTS articles.bigrams (docid text, bigram text, primary key (docid, bigram));
CREATE TABLE IF NOT EXISTS articles.unigrams (docid text, unigram text, primary key (docid, unigram));

And here is the portion of the Java program that is giving me issues. I'm trying to create 2 instances of QSQLSSTableWriter and write to each of them:

package cassandrabulktest.cassandra;

import java.io.IOException;
import java.util.ArrayList;
import org.apache.cassandra.exceptions.InvalidRequestException;
import org.apache.cassandra.io.sstable.CQLSSTableWriter;



public class UnigramLoader {
    private static final String UNIGRAM_SCHEMA = "CREATE TABLE articles.unigrams (" +
                                                      "docid text, " +
                                                      "unigram text, " +
                                                      "PRIMARY KEY (unigram, docid))";

    private static CQLSSTableWriter unigram_writer = CQLSSTableWriter.builder()
                .inDirectory("/tables/articles/unigrams")
                .forTable(UNIGRAM_SCHEMA)
                .using("INSERT INTO articles.unigrams (docid, unigram) VALUES (?, ?)")
                .build();

    private static final String BIGRAM_SCHEMA = "CREATE TABLE articles.bigrams (" +
                                                      "docid text, " +
                                                      "bigram text, " +
                                                      "PRIMARY KEY (bigram, docid))";

    private static CQLSSTableWriter bigram_writer = CQLSSTableWriter.builder()
                .inDirectory("/tables/articles/bigrams")
                .forTable(BIGRAM_SCHEMA)
                .using("INSERT INTO articles.bigrams (docid, bigram) VALUES (?, ?)")
                .build();


    public static void load(String articleId, ArrayList<String> unigrams, ArrayList<String> bigrams) throws IOException, InvalidRequestException {        
        for (String unigram : unigrams) {
            unigram_writer.addRow(unigram, articleId);
        }

        for (String bigram : bigrams) {
            bigram_writer.addRow(bigram, articleId);
        }
    }

    public static void closeWriter() throws IOException {
        unigram_writer.close();
        bigram_writer.close();
    }
}

If it worked, this would start creating the SSTable files in 2 directories. However, I'm getting this error when running:

Exception in thread "Thread-1" java.lang.ExceptionInInitializerError
    at edu.georgetown.cassandrabulktest.runnables.UnigramRunnable.run(UnigramRunnable.java:69)
    at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found 662e2edf-c864-34a4-bca6-f83b25af6f6a; expected 7247b490-b141-11e4-a8f9-8b65543eda40)
    at org.apache.cassandra.config.CFMetaData.reload(CFMetaData.java:1125)
    at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:337)
    at org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:360)
    at edu.georgetown.cassandrabulktest.cassandra.UnigramLoader.<clinit>(UnigramLoader.java:29)
    ... 2 more
Caused by: org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found 662e2edf-c864-34a4-bca6-f83b25af6f6a; expected 7247b490-b141-11e4-a8f9-8b65543eda40)
    at org.apache.cassandra.config.CFMetaData.validateCompatility(CFMetaData.java:1208)
    at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:1140)
    at org.apache.cassandra.config.CFMetaData.reload(CFMetaData.java:1121)
    ... 5 more

Is there no way to do this, or is there a different way to accomplish what I want to do? Thanks in advance!

ev0lution37
  • 1,129
  • 2
  • 14
  • 28

1 Answers1

0

You might want to try building and using a single writer instance, as there seem to be some race conditions when using multiple writers concurrently.

Stefan Podkowinski
  • 5,206
  • 1
  • 20
  • 25