One of the columns in my source datafile contains double quotes ("), and when I try to write this data from a dataframe into hdfs using pyspark code, it adds extra delimiters in the file. Not sure what is happening right here. My source data has 51 columns and for rows where data contains double quotes("), the number of columns changes to 59. The Delimiter used is pipe(|). The files are read from Google Cloud Storage (GCS) and after completing transformations using pyspark on dataproc, I write the data back to GCS.
Column data with double quotes looks like below.
"$200 CASHBACK in points on * Points valid for 30 days. First month is awarded 24 hours after purchase
Write statement from a dataframe - df to hdfs
df.repartition("dt_col").write.partitionBy("dt_col").format('csv').option("quote", "'").option("nullValue", "").option("quoteMode", "NONE").mode("overwrite").options(delimiter="|",codec="org.apache.hadoop.io.compress.GzipCodec").save(hdfs_merge_path)
Source data.
38896111|REGULAR EARN OFFER|Members earn $200 back in points on select qualifying mattresses.|$|S|N|MATTRESS / HOME BIG TIC|2017-11-01|2018-03-31|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|1|-6.43|-11|-3.85|-11|-11|-3.85|-11|-3.85|-6|1|2018-05-01|70815|1fffff0|2018-04-24||"$200 CASHBACK in points on select Sealy Response, Serta Perfect Sleeper and Beautyrest Silver mattresses*||1|S|S_ONLINE|9300|1|-11|2018-04-25
After the write, output looks like
38896111|REGULAR EARN OFFER|Members earn $200 back in points on select qualifying mattresses.|$|S|N|MATTRESS / HOME BIG TIC|2017-11-01|2018-03-31|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|1|-6.43|-11|-3.85|-11|-11|-3.85|-11|-3.85|-6|1|2018-05-01|70815|1fffff0|2018-04-24||'$200 CASHBACK in points on select Sealy Response, Serta Perfect Sleeper and Beautyrest Silver mattresses*||1|S|S_ONLINE|9300|1|-11|2018-04-25'||||||||
Please note: I have to use the option("quote", "'") during write, else with every run a backslash "\" character is added before the double quote.
The is the code flow is like below. Just used 5 columns for the example here.
schema=StructType([StructField("col1",StringType(),True),StructField("col2",StringType(),True),StructField("col3",StringType(),True),StructField("col4",StringType(),True),StructField("col5",StringType(),True)])
schema_df=spark.read.schema(schema).option("delimiter","|").csv("gs://path/to/incremental_file.txt*")
schema_df.createOrReplaceTempView("df")
schema_df2 = spark.read.schema(schema).option("delimiter", "|").csv("gs://path/to/hist-file*.gz")
schema_df2.createOrReplaceTempView("df2")
union_fn = schema_df2.union(schema_df)
w = Window.partitionBy("col1","col2").orderBy(col("col4").desc())
union_result=union_fn.withColumn("row_num",row_number().over(w)).where(col("row_num") == 1).drop("row_num").drop("col4")
union_result.createOrReplaceTempView("merged_tbl")
schema_merged_tbl=spark.sql("""select col1,col2,col3,col5 as col6 from merged_tbl""")
schema_merged_tbl.repartition("col6").write.partitionBy("col6").format('csv').option("quote","'").option("nullValue","").option("quoteMode","NONE").mode("overwrite").options(delimiter="|",codec="org.apache.hadoop.io.compress.GzipCodec").save(hdfs_merge_path)