1

I am querying a CosmosDb collection, and, could able to print the results. When I try to store the results to a Spark DataFrame, it fails.

Referred this site as an example:

How to read data from Azure's CosmosDB in python

Followed the exact steps from above link. Additionally, trying the below

 df = spark.createDataFrame(dataset)

This throws this error:

ValueError: Some of types cannot be determined after inferring

ValueError Traceback (most recent call last)
in ()
25 print (dataset)
26
---> 27 df = spark.createDataFrame(dataset)
28 df.show()
29

/databricks/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
808 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
809 else:
--> 810 rdd, schema = self._createFromLocal(map(prepare, data), schema)
811 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
812 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/databricks/spark/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
440 write temp files.
441 """
--> 442 data, schema = self._wrap_data_schema(data, schema)
443 return self._sc.parallelize(data), schema

But, wanting this to save as a Spark DataFrame

any help would be much appreciated. thanks!!!>

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Shan K
  • 11
  • 3
  • did you try to follow the official example? https://docs.databricks.com/spark/latest/data-sources/azure/cosmosdb-connector.html – silent May 01 '19 at 15:01

2 Answers2

0

In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.

Manually defining a schema will resolve the issue

>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+

Hope it helps.

Mohit Verma
  • 5,140
  • 2
  • 12
  • 27
0

I see you were following my previous answer using an old Python SDK for DocumentDB to query CosmosDB documents to create a PySpark DataFrame object. But you can not directly pass the result docs from client.ReadDocuments method as the parameter data to the function SparkSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True), because the data types are different, as below.

The function createDataFrame requires a parameter data which must be a RDD or list or pandas.DataFrame

enter image description here

However, I downloaded the source codes of pydocumentdb-2.3.3.tar.gz from https://pypi.org/project/pydocumentdb/#files and reviewed the code files document_client.py & query_iterable.py.

# from document_client.py
def ReadDocuments(self, collection_link, feed_options=None):
    """Reads all documents in a collection.

    :param str collection_link:
        The link to the document collection.
    :param dict feed_options:

    :return:
        Query Iterable of Documents.
    :rtype:
        query_iterable.QueryIterable

    """
    if feed_options is None:
        feed_options = {}

    return self.QueryDocuments(collection_link, None, feed_options)

# query_iterable.py
class QueryIterable(object):
    """Represents an iterable object of the query results.
    QueryIterable is a wrapper for query execution context.
    """

So to fix your issue, you have to create a pandas.DataFrame object first via iterate the result Query Iterable of Documents from ReadDocuments method, then to create a PySpark DataFrame object via spark.createDataFrame(pandas_df).

Peter Pan
  • 23,476
  • 4
  • 25
  • 43