Transforming PipelinedRDD to spark dataframe

Question

I am trying to convert a PipelinedRDD into a Spark dataframe, but I am getting the following error:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Creating Spark session and taking a subset of the data:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
df = spark.read \
    .options(header=True, delimiter = ";") \
    .csv("decode_this_2.csv")

rdd = df.head(50000) # taking a sample of the data for testing purposes

What the rdd/df looks like:

ID	Metadata
123	jf9834fi2f8924f2
345	oi2ehfd2hf4g3fg2

Using a lambda function to decode the metadata:

rdd2 = spark.sparkContext.parallelize(rdd) # parallelizing data

# create message decoder
message = protobuf_decoder.Metadata()

# decode message
def decode(x):
    try:
        a = x['metadata']
        temp_list = {} # storing data in a dictionary

        # decoding the metadata
        data = base64.b64decode(a)
        message.ParseFromString(data)
    
        temp_list['id'] = message.id # saving to dictionary
        
        # Add the list to a pandas DF to return into the rdd
        df_new_row = pd.DataFrame([temp_list])
        return df_new_row


    except Exception as e:
        print("except")
        print("e: ", e)

# call spark map to apply lambda decode function to the rdd
rdd3=rdd2.map(lambda x: decode(x))

Output of rdd.collect():

[                                document_id                                    
0  RSS-63c1e560-2c42-4ae4-9864-ee6965424944,                                 document_id
0  RSS-98a57ad3-ab20-4811-a9f6-c05fe455d1d5,                                  document_id
0  ASR-BARC-a431734c436079c1bfebbb5078ddc217,                                  document_id
0  ASR-BARC-a431734c436079c1bfebbb5078ddc217,                                  document_id
0  ASR-BARC-a431734c436079c1bfebbb5078ddc217,                                  document_id
0  ASR-BARC-a431734c436079c1bfebbb5078ddc217,                                  document_id
0  ASR-BARC-a431734c436079c1bfebbb5078ddc217,                                     document_id
0  MTN-a39f74a188b967c939fb8c351311603defbba7b6,                                     document_id
0  MTN-a39f74a188b967c939fb8c351311603defbba7b6,                                     document_id
0  MTN-a39f74a188b967c939fb8c351311603defbba7b6]

^^ this output seems a little odd to me, but it seems like every row is a new index (index 0)? I was hoping someone could explain this output to me.

Now I want to upload the data into Google Cloud Storage as a CSV file, but PipelinedRDD is not a format I can export as a csv file. The reason I want to upload the file into Google Cloud Storage as a CSV file is because it will be easy to then upload the file into BigQuery.

I have tried the following functions:

rdd3.toDF() rdd4 = spark.createDataFrame(rdd3, schema = "id")

Unfortunately, I am getting the following error:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I can use the following function, but the problem is that it is not scalable when you are trying to deal with about 100k records. I think it might be taking a long time collect the data (also trying to store a lot of data on memory probably isn't the best idea, especially when there could be more than 100k records):

def rdd_to_df(rdd3):
    # convert rdd to df
    df = rdd3.collect()
    df_df = pd.DataFrame()
    for i in df:
        # print(i)
        df_df = pd.concat([df_df, i])
    return df_df

You need to transform each element of the RDD to a string (comma separated) and then use saveAsTextFile method to save as CSV — Ronak Jain, Jan 09 '23 at 11:12
@RonakJain wouldn't this require me to do a `rdd3.collect()`, which would lead to the scaling problem I am having with my rdd_to_df function? — beeeZeee, Jan 11 '23 at 01:33
I'll update with an answer with both approaches (i.e. saveAsTextFile and toDF) once I can — Ronak Jain, Jan 11 '23 at 02:36
@AlexOtt In order to apply a map function, you need to use the RDD api. — beeeZeee, Feb 22 '23 at 02:32
It should be possible to do that with just DataFrame API, in the worst case, with Pandas UDFs that are much more efficient — Alex Ott, Feb 22 '23 at 08:26

Transforming PipelinedRDD to spark dataframe

0 Answers0