Error accessing Spark with Blaze

Question

Trying to do something fairly straightforward with Blaze and my local Spark instance. Loading a csv file with blaze's into() and then use blaze's by()

Python 3.4
Spark 1.4.0
Blaze 0.8.0

csv (simple.csv)

   id,car
    1,Mustang
    2,Malibu
    3,Mustang
    4,Malibu
    5,Murano

code

mport blaze as bz
rdd = bz.into(sc,"simple.csv")
simple = bz.Data(rdd)
simple.count()  #gives me 5 so far so good
bz.by(simple.car, count=simple.id.count()) #throws an error
AttributeError: 'InteractiveSymbol' object has no attribute 'car'

Any ideas on what's going on here?

Side note; this works

simple_csv = bz.Data("simple.csv")
bz.by(simple_csv.car, count=simple_csv.id.count())
    car     count
0   Malibu  2
1   Murano  1
2   Mustang 2

And so does this

simple_csv.car.count_values()
    car count
0   Malibu  2
2   Mustang 2
1   Murano  1

Gotta be the way I'm "loading" it into Spark, right?

@PhillipCloud thanks for the comment. I'm doing it by way of rdd = bz.into(sc,"simple.csv") — Bob Haffner, Jun 17 '15 at 18:50
Yeah, sorry. I didn't see that at first. I removed my comment :) — Phillip Cloud, Jun 17 '15 at 18:53
were you running this in a pyspark shell? I cannot tell how you created the SparkContext — pneumatics, Jan 23 '16 at 00:56
@pneumatics HI, i did this in a jupyter notebook. I'm able to do simple pyspark operations with the context. — Bob Haffner, Jan 24 '16 at 17:18

score 1 · Answer 1 · answered Jan 25 '16 at 00:38

You'll want to create a Spark DataFrame (formerly SchemaRDD) using a SQLContext instead of creating a "raw" RDD with the SparkContext. RDDs don't have named columns, which you would need in order for the by operation to succeed. This is why the InteractiveSymbol did not have a car attribute, it was stripped away in the process of creating the RDD. Executing this in a Jupyter code cell:

import pyspark
from pyspark.sql import SQLContext

sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)

from odo import odo
simple = odo('simple.csv', sqlContext)
simple.count()

would produce a pyspark.sql.dataframe.DataFrame object, and execute a program on the Spark driver to count the rows:

>>> 5

At this point, you should be able to compute the group-by as you were trying to before:

import blaze as bz
bz.by(simple.car, count=simple.id.count())

BUT. There is a problem with Blaze, at least for me, as of today, running Blaze 0.9.0 with both Spark 1.6 and Spark 1.4.1. Likely, this is not the same problem you had in the first place, but it is preventing me from reaching a working solution. I tried dropping Jupyter, and running in a pyspark session directly. To do so yourself, you can omit a few of the lines above, since pyspark automatically creates sc and sqlContext:

$ pyspark
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.4.1
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:57:58)
SparkContext available as sc, HiveContext available as sqlContext.

from odo import odo
simple = odo('simple.csv', sqlContext)
import blaze as bz
bz.by(simple.car, count=simple.id.count())

This produces an error. Even just trying to get an interactive view of simple like this also produces an error:

simple

Anyway, there seems to be some activity in the Blaze project on Github related to upgrading support for Spark 1.6, so hopefully they'll get this stuff fixed at that point.

Error accessing Spark with Blaze

1 Answers1