I am trying to convert a PipelinedRDD into a Spark dataframe, but I am getting the following error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Creating Spark session and taking a subset of the data:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
df = spark.read \
.options(header=True, delimiter = ";") \
.csv("decode_this_2.csv")
rdd = df.head(50000) # taking a sample of the data for testing purposes
What the rdd/df looks like:
ID | Metadata |
---|---|
123 | jf9834fi2f8924f2 |
345 | oi2ehfd2hf4g3fg2 |
Using a lambda function to decode the metadata:
rdd2 = spark.sparkContext.parallelize(rdd) # parallelizing data
# create message decoder
message = protobuf_decoder.Metadata()
# decode message
def decode(x):
try:
a = x['metadata']
temp_list = {} # storing data in a dictionary
# decoding the metadata
data = base64.b64decode(a)
message.ParseFromString(data)
temp_list['id'] = message.id # saving to dictionary
# Add the list to a pandas DF to return into the rdd
df_new_row = pd.DataFrame([temp_list])
return df_new_row
except Exception as e:
print("except")
print("e: ", e)
# call spark map to apply lambda decode function to the rdd
rdd3=rdd2.map(lambda x: decode(x))
Output of rdd.collect():
[ document_id
0 RSS-63c1e560-2c42-4ae4-9864-ee6965424944, document_id
0 RSS-98a57ad3-ab20-4811-a9f6-c05fe455d1d5, document_id
0 ASR-BARC-a431734c436079c1bfebbb5078ddc217, document_id
0 ASR-BARC-a431734c436079c1bfebbb5078ddc217, document_id
0 ASR-BARC-a431734c436079c1bfebbb5078ddc217, document_id
0 ASR-BARC-a431734c436079c1bfebbb5078ddc217, document_id
0 ASR-BARC-a431734c436079c1bfebbb5078ddc217, document_id
0 MTN-a39f74a188b967c939fb8c351311603defbba7b6, document_id
0 MTN-a39f74a188b967c939fb8c351311603defbba7b6, document_id
0 MTN-a39f74a188b967c939fb8c351311603defbba7b6]
^^ this output seems a little odd to me, but it seems like every row is a new index (index 0)? I was hoping someone could explain this output to me.
Now I want to upload the data into Google Cloud Storage as a CSV file, but PipelinedRDD is not a format I can export as a csv file. The reason I want to upload the file into Google Cloud Storage as a CSV file is because it will be easy to then upload the file into BigQuery.
I have tried the following functions:
rdd3.toDF()
rdd4 = spark.createDataFrame(rdd3, schema = "id")
Unfortunately, I am getting the following error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I can use the following function, but the problem is that it is not scalable when you are trying to deal with about 100k records. I think it might be taking a long time collect the data (also trying to store a lot of data on memory probably isn't the best idea, especially when there could be more than 100k records):
def rdd_to_df(rdd3):
# convert rdd to df
df = rdd3.collect()
df_df = pd.DataFrame()
for i in df:
# print(i)
df_df = pd.concat([df_df, i])
return df_df