0

I'm trying to create a view for spark sql, but I'm having trouble creating it from a list of strings.

So I decided to follow the pyspark.sql document verbatim, and it still doesn't work:

testd = [{'name': 'Alice', 'age': 1}]
spark.createDataFrame(testd).collect()

Error trace:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-55-d4321f74b607> in <module>()
      1 testd = [{'name': 'Alice', 'age': 1}]
      2 
----> 3 spark.createDataFrame(testd).collect()

/opt/app/anaconda2/python27/lib/python2.7/site-packages/pyspark/sql/dataframe.pyc in collect(self)
    389         """
    390         with SCCallSiteSync(self._sc) as css:
--> 391             port = self._jdf.collectToPython()
    392         return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
    393 

/opt/app/anaconda2/python27/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/opt/app/anaconda2/python27/lib/python2.7/site-packages/pyspark/sql/utils.pyc in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/opt/app/anaconda2/python27/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o896.collectToPython.
....
TypeError: range() integer end argument expected, got list.

Meanwhile, this in the tutorial:

l = [('Alice', 1)]
spark.createDataFrame(l, ['name', 'age']).collect()

Got the same essential error trace 'range() integer end argument expected, got list.'

What is going on here???

Here's how I initiate my spark instance:

os.environ['SPARK_HOME']='/path/to/spark2-client'
os.environ['PY4JPATH']='/path/to/spark2-client/python/lib/py4j-0.10.4-src.zip'
sys.path.insert(0, os.path.join(os.environ['SPARK_HOME'],'python'))
sys.path.insert(1, os.path.join(os.environ['SPARK_HOME'],'python/lib'))
os.environ['HADOOP_CONF_DIR']='/etc/hadoop/conf'
os.environ['MASTER']="yarn"
os.environ['SPARK_MAJOR_VERSION']="2"
spark = (SparkSession
            .builder
            .appName('APPNAME')
            .config("spark.executor.instances","8")
            .config("spark.executor.memory","32g")
            .config("spark.driver.memory","64g")
            .config("spark.driver.maxResultSize","32g")
            .enableHiveSupport()
            .getOrCreate())

All other spark functions work fine, including hive queries, dataframe joining etc. Only when I try to create something from local memory, it doesn't work.

Thanks for any insights.

Rocky Li
  • 5,641
  • 2
  • 17
  • 33
  • That looks like a version mismatch - probably, but not necessarily, related to all the funky path manipulation. I'd start with confirming that actually use versions you think you do, both locally (driver) and remotely. For the former you can use technique I described [here](https://stackoverflow.com/a/53457308/10465355). – 10465355 Mar 15 '19 at 14:37
  • @user10465355 Perhaps, but something has to be right for *every other function* to work just fine, but this one `createDataFrame` in particular. – Rocky Li Mar 15 '19 at 14:38
  • 1
    No version is fully breaking the compatibility, a `DataFrame` API, with it's minimal dependency on Python code, has negligible failure surface. That's at least my best guess, as the error is not reproducible on proper deployments. – 10465355 Mar 15 '19 at 14:41

1 Answers1

-1

spark.createDataFrame(['Alice',1], ['name', 'age']).collect()

According to the documentation https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.unionByName

  • This does not work -- My question is that it's throwing that error trace when I tried stuff from *the document*. – Rocky Li Mar 15 '19 at 14:33