2

So, I am building an application in python that takes data from twitter and then saves it to cassandra. My current problems lies in a script which reads data from kafka and tries to write it to cassandra, as follows:

import threading, logging, time
import multiprocessing
from cassandra.cluster import Cluster

from kafka import KafkaConsumer, KafkaProducer




class Consumer(multiprocessing.Process):
   def __init__(self):
        multiprocessing.Process.__init__(self)
        self.stop_event = multiprocessing.Event()

    def stop(self):
         self.stop_event.set()

    def run(self):
       consumer = KafkaConsumer(bootstrap_servers='localhost:9092',
                                 auto_offset_reset='earliest',
                                 consumer_timeout_ms=1000)
        consumer.subscribe(['twitter'])



    while not self.stop_event.is_set():
        for message in consumer:
            # session.execute(
            #     """
            #     INSERT INTO mensaje_73 (tweet)
            #     VALUES (message)
            #     """
            # )
            print(message)
            cluster = Cluster()
            session = cluster.connect('twitter')
            session.execute(
                    """
                    INSERT INTO mensaje_73 (tweet)
                    VALUES (message)
                    """
                )

            # if self.stop_event.is_set():
            #     break

    consumer.close()


   def main():

    tasks = [
        Consumer()
    ]

    for t in tasks:
        t.start()

    time.sleep(10)

    for task in tasks:
        task.stop()



if __name__ == "__main__":
     logging.basicConfig(
        format='%(asctime)s.%(msecs)s:%(name)s:%(thread)d:% 
   (levelname)s:%(process)d:%(message)s',
        level=logging.INFO
    )
    main()

I have tried to insert test messages into table twitter.mensaje_73 and it has worked perfectly, as here:

import threading, logging, time
import multiprocessing
from cassandra.cluster import Cluster

from kafka import KafkaConsumer, KafkaProducer


cluster = Cluster()
session = cluster.connect('twitter')
session.execute(
    """
    INSERT INTO mensaje_73 (tweet)
    VALUES ('helooo')
    """
)

Any help would be deeply appreciated :)

1 Answers1

2

So the problem here, is that your message variable is being treated as a literal in CQL, which won't work without single quotes. Hence, the error.

To fix this, I would go the route of using a prepared statement, and then bind message to it:

session = cluster.connect('twitter')
preparedTweetInsert = session.prepare(
        """
        INSERT INTO mensaje_73 (tweet)
        VALUES (?)
        """
    )
session.execute(preparedTweetInsert,[message])

Give that a try, and see if it helps.

Also, this seems like a simple data model. But one thing to ask yourself, is how are you going to query this data? This wouldn't work unless tweet was your only PRIMARY KEY. Which also means that the only way you can query an individual tweet, is by the exact text of the message. Something to think about, but partitioning it by day might be a better option as it will distribute well and provide a much better query model.

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • 1
    Hey! Thank you so much. Your answers is very insightful, I was actually just trying to save _anything_ to cassandra, and worry later about querying it properly. As soon as I get back home, I will try your fix and let you know. Thank you really so much!! :) – JOSE MANUEL RAMIREZ LEON Oct 15 '18 at 15:27
  • So! you were absolutely right, however, could you tell me how to insert the data (which comes from Kafka) into cassandra using python? Thank you very much in advance :) – JOSE MANUEL RAMIREZ LEON Oct 15 '18 at 21:30