0

I'm doing batch insert to db, now I'm facing with data types issue when inserting it as a batch into cassandra.

Here I created a table named batch_rows , you can see type of node_id is text :

          ("""
            CREATE TABLE IF NOT EXISTS batch_rows (
                local_pid int,
                node_id text,
                camera_id int,
                geolocation int,
                filepath int,
                PRIMARY KEY (local_pid, node_id)
            )
            """)

After creating table then I tried to grab 500 of data and do the batch insert to db :

statement = 'BEGIN BATCH '
to_insert = 500
for i in range(to_insert):
    location_generator = self.generate_random_coordinate()

    local_pid = i
    node_id= 'P2'
    camera_id= 1
    geolocation= 4
    filepath = 3
                    
    statement += 'INSERT INTO batch_rows (local_pid, node_id, camera_id, geolocation, filepath) VALUES (%s, %s, %s, %s, %s) ' % 
        (local_pid, node_id, camera_id, geolocation, filepath)

statement += 'APPLY BATCH'
statement = SimpleStatement(statement, consistency_level=ConsistencyLevel.QUORUM)

try:
   self.session.execute(statement,timeout=30.0)

except OperationTimedOut:

   ex_type, ex, tb = sys.exc_info()
   print("Batch wide row insertion timed out, this may require additional investigation")
   del tb

And finally it gave me the error :

[Syntax error in CQL query] message="line 1:103 no viable alternative at input ',' (... filepath) VALUES (0, [P2],...)">

so yeah I am kinda stuck atm, all the helps would be appreciated guy.!

frankenstein
  • 125
  • 2
  • 11
  • 1
    500 is way too many statements to be batching together. Especially with all of them going to different partitions. – Aaron Mar 28 '23 at 09:36
  • @Aaron that was just a number to test, for actual production stage, it would be something around 50 data per batch. But if I change the data type of node_id to int instead of text , the code runs normal , do you know why? – frankenstein Mar 28 '23 at 09:39

1 Answers1

1

Your code is invalid and has multiple CQL syntax errors which is why it is not working.

The BEGIN BATCH ... APPLY BATCH; commands are only valid when executing CQL queries in cqlsh -- they are not designed to be used with the native CQL drivers.

Here is an example code showing how to execute a batch operation with the Python driver APIs using a SimpleStatement within a BatchStatement:

batch = BatchStatement()
batch.add(SimpleStatement("INSERT INTO users (username, email) VALUES (%s, %s)", (username, email))
batch.add(SimpleStatement("INSERT INTO users_by_email (email, username) VALUES (%s, %s)", (email, username))
session.execute(batch)

For details and more examples, see the Python driver API document on the BatchStatement class.

Warning

However, it important to note that what you are trying to implement is a bad idea and is NOT recommended.

It is important to note that batches are not an optimisation in Cassandra in the same way that you use batches in relational databases. Only use batches when necessary or you risk overloading the coordinator nodes which can affect the overall performance of your cluster.

When bulk-loading data, the recommended approach is to issue multiple asynchronous inserts to maximise the throughput of the cluster since the requests will be distributed to all nodes in the cluster instead of just one coordinator node.

I strongly recommend having a look at Batching inserts, updates and deletes in Cassandra, paying particular attention to the sections on when it is good to batch operations and when it is not.

Additionally, I'll reiterate what I said in your other question -- avoid making schema changes programatically to avoid schema disagreements. Cheers!

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
  • Hi, thank again for your feedback, yeah I will avoid making schema changes. You said batch insertition is not recommended so if I have a batch of data how can I insert it to db, ofcourse I can't just for loop item by item and insert it individually this would take alot of time, what is the optimize way for batch insert in this case? – frankenstein Mar 28 '23 at 08:45
  • 1
    I've updated my answer with the recommended approach for bulk-loading data. Cheers! – Erick Ramirez Mar 28 '23 at 09:34
  • Hi, thank for your suggestion, one more question is if I change the data type of node_id to int instead of text , the code runs normal , do you know why? – frankenstein Mar 28 '23 at 09:40