0

I am reading a json file with 500 million records from a API and writing to blob in Azure. Tried many ways but getting the below error. I am using PySpark notebook in Azure Synapse

ValueError                                Traceback (most recent call last)
Cell In [17], line 45
 41             total_results.append((parsed_data))
 42             ##print(total_results)
 43 
 44 ##RDD Spark creation 
 ---> 45 rdd = spark.sparkContext.parallelize(total_results)
 46 df = spark.read.option('multiLine','true').json(rdd)
 48 #Create temporary view on dataframe                                                                                    

  File /opt/spark/python/lib/pyspark.zip/pyspark/context.py:686, in 
SparkContext.parallelize(self, c, numSlices)
683     assert self._jvm is not None
684     return self._jvm.PythonParallelizeServer(self._jsc.sc(), numSlices)
--> 686 jrdd = self._serialize_to_jvm(c, serializer, reader_func, createRDDServer)
687 return RDD(jrdd, self, serializer)

File /opt/spark/python/lib/pyspark.zip/pyspark/context.py:729, in 
SparkContext._serialize_to_jvm(self, data, serializer, reader_func, createRDDServer)
727 try:
728     try:
 --> 729         serializer.dump_stream(data, tempFile)
730     finally:
731         tempFile.close()

File /opt/spark/python/lib/pyspark.zip/pyspark/serializers.py:224, in 
BatchedSerializer.dump_stream(self, iterator, stream)
223 def dump_stream(self, iterator, stream):
--> 224     self.serializer.dump_stream(self._batched(iterator), stream)

File /opt/spark/python/lib/pyspark.zip/pyspark/serializers.py:146, in 
FramedSerializer.dump_stream(self, iterator, stream)
144 def dump_stream(self, iterator, stream):
145     for obj in iterator:
--> 146         self._write_with_length(obj, stream)

File /opt/spark/python/lib/pyspark.zip/pyspark/serializers.py:160, in 
FramedSerializer._write_with_length(self, obj, stream)
158     raise ValueError("serialized value should not be None")
159 if len(serialized) > (1 << 31):
 --> 160     raise ValueError("can not serialize object larger than 2G")
161 write_int(len(serialized), stream)
162 stream.write(serialized)

 ValueError: can not serialize object larger than 2G

My code takes JSON in a list and does RDD and write to disk

rdd = spark.sparkContext.parallelize(total_results)
df = spark.read.option('multiLine','true').json(rdd)

#Create temporary view on dataframe                                                                                    
df.createOrReplaceTempView('filter_view')                                                                          
                                                                                                                                                 
#SQL query to filter on deleteddate value                                                                       
df_filter=spark.sql("""select * from filter_view where DeletedDate is null""")

df_filter.coalesce(800).write.format("parquet").save(stagingpath,mode="overwrite")
CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42
Arun.K
  • 103
  • 2
  • 4
  • 21
  • What is the schema of your json? If you don't specify any schema this will make the reading very non-performant since it will need to scan the whole file first before knowing which schema to use. Maybe it's going wrong there. Try specifying a schema when you're doing `spark.read` – Koedlt Feb 14 '23 at 19:23
  • @Koedit added schema but getting the same error. The notebook runs for around 3hrs and throws the error. Any idea on how to do commit to the disk at regular intervals? – Arun.K Feb 15 '23 at 02:58
  • Try Increase the amount of memory available to PySpark by increasing the `spark.driver.memory` configuration property. For example, you can set this property to a higher value like `8g` or `16g` – Pratik Lad Feb 16 '23 at 10:26
  • @PratikLad - Its running at 16g – Arun.K Feb 16 '23 at 17:46

1 Answers1

0

The ValueError: can not serialize object larger than 2G error is similar to the one in PySpark and occurs when trying to serialize an object that is larger than the maximum size limit of 2 GB.

  • You can compress your data before serializing it to reduce its size. PySpark supports several compression formats, such as gzip and snappy.
  • Try Increase the amount of memory available to PySpark by increasing the spark.driver.memory configuration property. For example, you can set this property to a higher value like 8g or 16g
  • try to partition your file into smaller parts first then try to serialize then serialize them and read the data.

Reference - https://www.mail-archive.com/user@spark.apache.org/msg38489.html

Pratik Lad
  • 4,343
  • 2
  • 3
  • 11
  • Answering your comments, 1. Yes i am compressing to snappy 2. Increased memory to 16g 3. I am looping through the API pages and loading the data into a list. Then i am trying to serialize those and write to blob with repartition(200). So the data is already partitioned and serialized. Still getting the same error – Arun.K Feb 17 '23 at 06:05