0

I had a single node (DataStax) Casandra cluster , in which I had to insert some 10gb of data from a file. I wrote a java program to read the file and store the data as foll :

 import java.io.BufferedReader;
 import java.io.FileReader;
 import java.io.IOException;
 import java.util.Date;
 import com.datastax.driver.core.BoundStatement;
 import com.datastax.driver.core.Cluster;
 import com.datastax.driver.core.PreparedStatement;
 import com.datastax.driver.core.Session;

 public class Xb {

//cluster and session for cassandra connection
private static Cluster cluster;
private static Session session;

//variables for storing file elements
private static String taxid;
private static String geneid;
private static String status;
private static String rna_version;
private static String rna_gi;

private static String protein_version;
private static String protein_gi;
private static String gen_nuc_ver;

private static String gen_nuc_gi;
private static String start_gen_acc;
private static String end_gen_acc;

private static String orientation;
private static String assembly;

     private static String mature_ver;

     private static String mature_gi;

     private static String symbol;

    //Connecting the cassandra node(local host)
    public static Cluster connect(String node){
    return Cluster.builder().addContactPoint(node).build();
   }
    public static void main(String[] args) {
    private static String symbol;
    long lStartTime = new Date().getTime();
    // TODO Auto-generated method stub
    //call connect by passing localhost 
    cluster =connect("localhost");
    session = cluster.connect();
    //session.execute("CREATE KEYSPACE test1 WITH REPLICATION =" +"{'class':'SimpleStrategy','replication_factor':3}");
    //session.createtable('genomics');
    //use test1 : triggers the use of test1 keyspace
    session.execute("USE test1");
    //for counting the lines in the file
    int lineCount=0;

    try
    {
        //Reading the file
        FileReader fr = new FileReader("/home/syedammar/gene2refseq/gene2refseq");
        BufferedReader bf = new BufferedReader(fr);
        String line;
        //iterating over each line in file
        while((line= bf.readLine())!=null){
                lineCount++;
                //splitting the line based on tab spaces
                String[] a =line.split("\\s+");
                System.out.println("Line Count now is ->"+lineCount);
                //System.out.println("This is content"+line+" OVER HERE");
                /*for(int i =0;i<a.length;i++){
                System.out.println(i+"->"+a[i]);
              }*/
                //assigning the values to the corresponding variables
                taxid =a[0];
                geneid=a[1];
                status=a[2];
                rna_version=a[3];
                rna_gi=a[4];
                protein_version=a[5];
                protein_gi=a[6]; 
                gen_nuc_ver=a[7];
                gen_nuc_gi=a[8];
                start_gen_acc=a[9];
                end_gen_acc=a[10];
                orientation=a[11];
                assembly=a[12];
                mature_ver=a[13];
                mature_gi=a[14];
                symbol=a[15];

            //Writing the insert query
            PreparedStatement statement = session.prepare(
            "INSERT INTO test.genomics " +
            "(taxid, " +
            "geneid, " +
            "status, " +
            "rna_version, " +
            "rna_gi, " +
            "protein_version, " +
            "protein_gi, " +
            "gen_nuc_ver, " +
            "gen_nuc_gi, " +
            "start_gen_acc, " +
            "end_gen_acc, " +
            "orientation, " +
            "assembly, " +
            "mature_ver, " +
            "mature_gi," +
            "symbol" + 
            ") VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);"); 

            //create the bound statement and initialise it with your prepared statement
            BoundStatement boundStatement = new BoundStatement(statement); 

            session.execute( // this is where the query is executed
            boundStatement.bind( // here you are binding the 'boundStatement'
            taxid,geneid,status,rna_version,rna_gi,protein_version,protein_gi,gen_nuc_ver,gen_nuc_gi,start_gen_acc,end_gen_acc,orientation,assembly,mature_ver,mature_gi,symbol));
    }//end of while
} //end of try
    catch(IOException e){
        e.printStackTrace();
    }   
        long lEndTime = new Date().getTime(); 
        long difference = lEndTime - lStartTime;
        int seconds = (int) (difference / 1000) % 60 ; //converting milliseconds to seconds
        System.out.println("Elapsed seconds: " + seconds);
        System.out.println("No of lines read are :"+ lineCount);
        System.out.println("Record's entered into cassandra successfully");

        session.close();
        cluster.close();http://stackoverflow.com/editing-help

    }//end of m}// end of class

This worked fine i got the records stored in Cassandra.

Now I have set up a 4 node Cassandra cluster , and I wanna do the same task of reading the same file and storing its content into the 4 node cluster.

My question is how would I do that, to which node I need to feed this program. How do i approach this ?

And my query is how would I establish connection with the 4 node cluster, what changes will I have to make in the above code. Like there would be some change in this part

 public static Cluster connect(String node){
    return Cluster.builder().addContactPoint(node).build();
} 

what would be the changes , N to which node do I feed this program ? I am not clear how would it happen. Also let me know will it take the same amount of time to insert the entire data in 4 node cluster as it took for single node or will it be faster.

Thanks

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
Syed Ammar Mustafa
  • 373
  • 1
  • 7
  • 18

1 Answers1

0

For a good example (reference program) of how to best load data to Cassandra using the DataStax java driver, take a look at Brian Hess's Cassandra-loader.

which node do I need to feed this program

All cassandra nodes are equal, and all of them can take writes. The driver, however, takes care of this for you. Just give it a few of your nodes as endpoints and when it establishes the connection it will become aware of what nodes exist. It will also know what nodes own what data and perform the writes accordingly.

will it take the same amount of time to insert the entire data in 4 node cluster as it took for single node or will it be faster.

Once you take replication factor into account, your cluster will scale linearly as you add nodes. So you will be able to increase your throughput linearly. i.e. if 3 nodes RF3 can take X writes, 6 nodes with RF3 can take ~2X writes.

phact
  • 7,305
  • 23
  • 27
  • Thanks!! Also i have another query . I have inserted around 17 million columns in the 3 node cluster , n when query -"select count(*) from keyspace.table name"; It gives an operation time out error where as if I query "select count(*) from keyspace.tablename limit 50000" it works . Could you please tell how can I fix this time out error ? Also I have another doubt regarding the counter tables in Cassandra, how does it differ from the non-column tables and what are its limitations – Syed Ammar Mustafa Jul 29 '15 at 08:01
  • I hope this is not a query you intend to use in production. You're data is stored across multiple machines and you won't get good response times on full table scans. If this is just for testing you can crank up your read timeout. – phact Jul 29 '15 at 12:16
  • Sorry that was rows , not columns in the above post. – Syed Ammar Mustafa Jul 29 '15 at 13:15
  • As I had asked previously "will it take the same amount of time to insert data (around 17mn records) in a 3-4 node cluster as it took in a single node I need more clarity on that . It took around 4+ hrs for me do load the data into the single node, where as when I am trying with a cluster of 3 nodes , with RF=3 It taking more than double the time , Is it like that or am I doing something wrong. – Syed Ammar Mustafa Jul 29 '15 at 13:21
  • Same hardware? what's your RF? – phact Jul 29 '15 at 13:24
  • Is "CREATE KEYSPACE genomics WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '3'}" same as "CREATE KEYSPACE genomics WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': '3'}" Please Ans this – Syed Ammar Mustafa Jul 29 '15 at 13:25
  • No hard ware is as follow's Node 1 (primary) 8GbRam, 50Gb diskspace alloted, 8 cores . Node 2 - 16Gb Ram , 50Gb diskspace alloted, 8 cores, Node 3 - 8Gb Ram, 50Gb space alloted, 8 cores – Syed Ammar Mustafa Jul 29 '15 at 13:35
  • Hey could you answer these queries please – Syed Ammar Mustafa Jul 29 '15 at 14:16
  • RF 3 with 3 nodes will not be faster than RF1 with 1 node. You essentially need to do 3x the writes due to your replication factor. Are these VM's? C* will perform best when it is given it's own machine. If you need to use vm's make sure you use directly mounted disk volumes. 8GB ram is on the low end since the recommended jvm heap for c* is 8gb. – phact Jul 29 '15 at 16:02
  • Even when I did with one node the RF was 3 . Should I make the node with 16GB the primary node. Bcz now the 8Gb node is the primary. Will it make a difference. Also I use case I to insert some million records and test It with different queries so should I keep the Strategy Simple or NetworkTopology. (We dont have any plans of multiple data centres ) what would you suggest the keyspace defination to be ? – Syed Ammar Mustafa Jul 30 '15 at 07:37
  • Even with RF 3, if you only have 1 node you only have 1 copy of the data. There is no Primary node in cassandra. All nodes have the same functionality. Either strategy is fine if you only have 1 DC. – phact Jul 30 '15 at 12:27
  • Thanks alot , all the info was of great help to me :). How can I reach out to you if I have any further queries as I will be using DataStax Cassandra and spark . – Syed Ammar Mustafa Jul 31 '15 at 05:33