0

I am trying to convert a PipelinedRDD into a Spark dataframe, but I am getting the following error:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Creating Spark session and taking a subset of the data:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
df = spark.read \
    .options(header=True, delimiter = ";") \
    .csv("decode_this_2.csv")

rdd = df.head(50000) # taking a sample of the data for testing purposes

What the rdd/df looks like:

ID Metadata
123 jf9834fi2f8924f2
345 oi2ehfd2hf4g3fg2

Using a lambda function to decode the metadata:

rdd2 = spark.sparkContext.parallelize(rdd) # parallelizing data

# create message decoder
message = protobuf_decoder.Metadata()

# decode message
def decode(x):
    try:
        a = x['metadata']
        temp_list = {} # storing data in a dictionary

        # decoding the metadata
        data = base64.b64decode(a)
        message.ParseFromString(data)
    
        temp_list['id'] = message.id # saving to dictionary
        
        # Add the list to a pandas DF to return into the rdd
        df_new_row = pd.DataFrame([temp_list])
        return df_new_row


    except Exception as e:
        print("except")
        print("e: ", e)

# call spark map to apply lambda decode function to the rdd
rdd3=rdd2.map(lambda x: decode(x))

Output of rdd.collect():

[                                document_id                                    
0  RSS-63c1e560-2c42-4ae4-9864-ee6965424944,                                 document_id
0  RSS-98a57ad3-ab20-4811-a9f6-c05fe455d1d5,                                  document_id
0  ASR-BARC-a431734c436079c1bfebbb5078ddc217,                                  document_id
0  ASR-BARC-a431734c436079c1bfebbb5078ddc217,                                  document_id
0  ASR-BARC-a431734c436079c1bfebbb5078ddc217,                                  document_id
0  ASR-BARC-a431734c436079c1bfebbb5078ddc217,                                  document_id
0  ASR-BARC-a431734c436079c1bfebbb5078ddc217,                                     document_id
0  MTN-a39f74a188b967c939fb8c351311603defbba7b6,                                     document_id
0  MTN-a39f74a188b967c939fb8c351311603defbba7b6,                                     document_id
0  MTN-a39f74a188b967c939fb8c351311603defbba7b6]

^^ this output seems a little odd to me, but it seems like every row is a new index (index 0)? I was hoping someone could explain this output to me.

Now I want to upload the data into Google Cloud Storage as a CSV file, but PipelinedRDD is not a format I can export as a csv file. The reason I want to upload the file into Google Cloud Storage as a CSV file is because it will be easy to then upload the file into BigQuery.

I have tried the following functions:

rdd3.toDF() rdd4 = spark.createDataFrame(rdd3, schema = "id")

Unfortunately, I am getting the following error:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I can use the following function, but the problem is that it is not scalable when you are trying to deal with about 100k records. I think it might be taking a long time collect the data (also trying to store a lot of data on memory probably isn't the best idea, especially when there could be more than 100k records):

def rdd_to_df(rdd3):
    # convert rdd to df
    df = rdd3.collect()
    df_df = pd.DataFrame()
    for i in df:
        # print(i)
        df_df = pd.concat([df_df, i])
    return df_df 
beeeZeee
  • 51
  • 1
  • 2
  • 10

0 Answers0