Pandas and Cassandra: numpy array format incompatibility

Question

I'm using the Python cassandra driver to connect and query our Cassandra cluster.

I want to manipulate my data via Pandas, there is an area in the documentation for the cassandra driver that mentions this exactly: https://datastax.github.io/python-driver/api/cassandra/protocol.html

NumpyProtocolHander: deserializes results directly into NumPy arrays. This facilitates efficient integration with analysis toolkits such as Pandas.

Following the above instructions and doing a SELECT query in Cassandra, I can see the output (via the type() function) as a:

<class 'cassandra.cluster.ResultSet'>

Iterating through the results, this what printing a row comes up as:

{u'reversals_rejected': array([0, 0]), u'revenue': array([ 0, 10]), u'reversals_revenue': array([0, 0]), u'rejected': array([3, 1]), u'impressions_positive': array([3, 3]), u'site_user_id': array([226226, 354608], dtype=int32), u'error': array([0, 0]), u'impressions_negative': array([0, 0]), u'accepted': array([0, 2])}

(I've limited the query results, I'm working with much larger amounts of data - hence wanting to use numpy and pandas).

My knowledge of Pandas is limited, I attempted to run very basic functionalities:

rslt = cassandraSession.execute("SELECT accepted FROM table")

test = rslt[["accepted"]].head(1)

This outputs the following error:

Traceback (most recent call last):
  File "/UserStats.py", line 27, in <module>
    test = rslt[["accepted"]].head(1)
  File "cassandra/cluster.py", line 3380, in cassandra.cluster.ResultSet.__getitem__ (cassandra/cluster.c:63998)
TypeError: list indices must be integers, not list

I understand the error, I just don't know how to "transition" from this supposed numpy array to being able to use Pandas.

Tickon · Accepted Answer · 2016-02-17T14:25:36.403

8

The short answer is:

df = pd.DataFrame(rslt[0])
test = df.head(1)

rslt[0] gives you your data as a Python dict, that can be easily converted to a Pandas dataframe.

For a complete solution:

import pandas as pd
from cassandra.cluster import Cluster
from cassandra.protocol import NumpyProtocolHandler
from cassandra.query import tuple_factory

cluster = Cluster(
    contact_points=['your_ip'],
    )
session = cluster.connect('your_keyspace')
session.row_factory = tuple_factory
session.client_protocol_handler = NumpyProtocolHandler

prepared_stmt = session.prepare ( "SELECT * FROM ... WHERE ...;")
bound_stmt = prepared_stmt.bind([...])
rslt = session.execute(bound_stmt)
df = pd.DataFrame(rslt[0])

Note: The above solution will only get you part of the data if the query is large. So you should do:

df = pd.DataFrame()
for r in rslt:
    df = df.append(r)

edited Feb 17 '16 at 14:25

answered Feb 05 '16 at 11:10

Tickon

1,058
1
16
25

Thanks for your help! Specifying the index for the DataFrame was what I was missing in the end! – joao Feb 05 '16 at 11:15
What goes in the `.bind([...])` brackets? – yeliabsalohcin Jan 11 '18 at 16:33
1

@yeliabsalohcin It's the variables you want to send to the query. So if you had a query like this in your prepared statement: "SELECT * FROM users WHERE user_id=? AND date=?" they would replace the "?". – Tickon Jan 12 '18 at 09:53

Pandas and Cassandra: numpy array format incompatibility

1 Answers1