How to use pandas on spark notebook (data on dashDB) in python

Question

Hello I'm using IBM Bluemix. Here I'm using an Apache Spark notebook and loading data from dashDB I'm trying to provide a visualization and it's not displaying the rows, just the columns.

def get_file_content(credentials):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)


props = {}
props['user'] = credentials['username']
props['password'] = credentials['password']

# fill in table name
table = credentials['username'] + "." + "BATTLES"

   data_df=sqlContext.read.jdbc(credentials['jdbcurl'],table,properties=props)
data_df.printSchema()

return StringIO.StringIO(data_df)

When i use this command:

data_df.take(5)

I get the information of the first 5 rows of data with both columns and rows. But when I do this:

content_string = get_file_content(credentials)
BATTLES_df = pd.read_table(content_string)

I get this error:

ValueError: No columns to parse from file

And then when i try to see the .head() or .tail() only the column names are displayed.

Does anyone see the possible problem here? I have very poor knowledge of python. Please and thank you.

score 1 · Answer 1 · answered Jun 08 '16 at 00:20

1

This is the solution that works for me. I replaced BATTLES_df = pd.read_table(content_string)

with

BATTLES_df=data_df.toPandas()

Thank you

answered Jun 08 '16 at 00:20

Saraida

39
7

score 0 · Answer 2 · answered Jun 07 '16 at 22:24

export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

And go to your spark directory

cd ~/spark-1.6.1-bin-hadoop2.6/

./bin/pyspark --packages com.datastax.spark:spark-cassandra-connector_scalaversion:spark_version-M1

And you can write following code.

import pandas as pd

How to use pandas on spark notebook (data on dashDB) in python

2 Answers2