You'll want to create a Spark DataFrame (formerly SchemaRDD
) using a SQLContext
instead of creating a "raw" RDD
with the SparkContext
. RDD
s don't have named columns, which you would need in order for the by
operation to succeed. This is why the InteractiveSymbol
did not have a car
attribute, it was stripped away in the process of creating the RDD
. Executing this in a Jupyter code cell:
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)
from odo import odo
simple = odo('simple.csv', sqlContext)
simple.count()
would produce a pyspark.sql.dataframe.DataFrame
object, and execute a program on the Spark driver to count the rows:
>>> 5
At this point, you should be able to compute the group-by as you were trying to before:
import blaze as bz
bz.by(simple.car, count=simple.id.count())
BUT. There is a problem with Blaze
, at least for me, as of today, running Blaze 0.9.0
with both Spark 1.6 and Spark 1.4.1. Likely, this is not the same problem you had in the first place, but it is preventing me from reaching a working solution. I tried dropping Jupyter, and running in a pyspark
session directly. To do so yourself, you can omit a few of the lines above, since pyspark
automatically creates sc
and sqlContext
:
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.4.1
/_/
Using Python version 2.7.11 (default, Dec 6 2015 18:57:58)
SparkContext available as sc, HiveContext available as sqlContext.
from odo import odo
simple = odo('simple.csv', sqlContext)
import blaze as bz
bz.by(simple.car, count=simple.id.count())
This produces an error. Even just trying to get an interactive view of simple
like this also produces an error:
simple
Anyway, there seems to be some activity in the Blaze project on Github related to upgrading support for Spark 1.6, so hopefully they'll get this stuff fixed at that point.