0

One of the columns in my source datafile contains double quotes ("), and when I try to write this data from a dataframe into hdfs using pyspark code, it adds extra delimiters in the file. Not sure what is happening right here. My source data has 51 columns and for rows where data contains double quotes("), the number of columns changes to 59. The Delimiter used is pipe(|). The files are read from Google Cloud Storage (GCS) and after completing transformations using pyspark on dataproc, I write the data back to GCS.

Column data with double quotes looks like below.

"$200 CASHBACK in points on *   Points valid for 30 days. First month is awarded 24 hours after purchase 

Write statement from a dataframe - df to hdfs

df.repartition("dt_col").write.partitionBy("dt_col").format('csv').option("quote", "'").option("nullValue", "").option("quoteMode", "NONE").mode("overwrite").options(delimiter="|",codec="org.apache.hadoop.io.compress.GzipCodec").save(hdfs_merge_path)

Source data.

38896111|REGULAR EARN OFFER|Members earn $200 back in points on select qualifying mattresses.|$|S|N|MATTRESS / HOME BIG TIC|2017-11-01|2018-03-31|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|1|-6.43|-11|-3.85|-11|-11|-3.85|-11|-3.85|-6|1|2018-05-01|70815|1fffff0|2018-04-24||"$200 CASHBACK in points on select Sealy Response, Serta Perfect Sleeper and Beautyrest Silver mattresses*||1|S|S_ONLINE|9300|1|-11|2018-04-25

After the write, output looks like

38896111|REGULAR EARN OFFER|Members earn $200 back in points on select qualifying mattresses.|$|S|N|MATTRESS / HOME BIG TIC|2017-11-01|2018-03-31|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|1|-6.43|-11|-3.85|-11|-11|-3.85|-11|-3.85|-6|1|2018-05-01|70815|1fffff0|2018-04-24||'$200 CASHBACK in points on select Sealy Response, Serta Perfect Sleeper and Beautyrest Silver mattresses*||1|S|S_ONLINE|9300|1|-11|2018-04-25'||||||||

Please note: I have to use the option("quote", "'") during write, else with every run a backslash "\" character is added before the double quote.

The is the code flow is like below. Just used 5 columns for the example here.

schema=StructType([StructField("col1",StringType(),True),StructField("col2",StringType(),True),StructField("col3",StringType(),True),StructField("col4",StringType(),True),StructField("col5",StringType(),True)]) 

 schema_df=spark.read.schema(schema).option("delimiter","|").csv("gs://path/to/incremental_file.txt*")    

schema_df.createOrReplaceTempView("df")    

schema_df2 = spark.read.schema(schema).option("delimiter", "|").csv("gs://path/to/hist-file*.gz")    

schema_df2.createOrReplaceTempView("df2")    

union_fn = schema_df2.union(schema_df)    

w = Window.partitionBy("col1","col2").orderBy(col("col4").desc())    

union_result=union_fn.withColumn("row_num",row_number().over(w)).where(col("row_num") == 1).drop("row_num").drop("col4") 

union_result.createOrReplaceTempView("merged_tbl") 

schema_merged_tbl=spark.sql("""select col1,col2,col3,col5 as col6 from merged_tbl""") 

 schema_merged_tbl.repartition("col6").write.partitionBy("col6").format('csv').option("quote","'").option("nullValue","").option("quoteMode","NONE").mode("overwrite").options(delimiter="|",codec="org.apache.hadoop.io.compress.GzipCodec").save(hdfs_merge_path)
vp1008
  • 75
  • 2
  • 10
  • Adding the row information as an answer, since the data doesnt fit in the comment section. – vp1008 Apr 29 '18 at 01:55
  • I just edited my question to reflect the source data and the output generated. – vp1008 Apr 29 '18 at 02:06
  • I edited the question to add code flow. – vp1008 Apr 29 '18 at 15:22
  • the code that I added is a repro of the original code. the code flow is the same, but instead of 51 columns i shared the example with 5 columns. – vp1008 Apr 29 '18 at 16:03
  • I tested it again and the code is just working fine. I didn't test in gs though – Ramesh Maharjan Apr 30 '18 at 05:41
  • I retried the same code with just the column which has double quotes, and I see that the issue is happening again. Whereas if I use a rdd and then create a df out of it, then i do not get the extra delimiters. Something like below. lines1 = sc.textFile("gs://incrmental_file.txt*") part1 = lines1.map(lambda l: l.split("|")) df = part1.map(lambda c: Row(col1=c[0],col2=c[1],col3=c[2],col4=c[3], col5 =c[4])) schema_df = spark.createDataFrame(df) schema_df.createOrReplaceTempView("df") – vp1008 Apr 30 '18 at 13:22
  • It didn't get replicated in mine so I guess I won't be a help on that and I don't have gs filesystem configured. you can update the question and may be somebody will help you . But for now i guess you have solved it. – Ramesh Maharjan Apr 30 '18 at 14:34

1 Answers1

0

By default, spark-csv configures the CSV quoting character as double quote ("). When encountering a double quote that is not prefixed by the escape character, which is backslash (\) by default, it assumes it is entering a value that may contain the delimiter character. In your data that means that as soon as it sees the first double-quote, it starts reading expecting to reach a second double quote to end the value. This doesn't happen on the same line and so assumes that the record is simply missing values for those fields and enters nulls / blanks in their place.

Many of the spark-csv options end up being set on the underlying apache-commons CSVFormat object used to describe how parsing should happen. If in your input dataset, the double quote character should not be used as a quote but should instead be allowed within the field value, you can disable quoting logic with:

 schema_df=spark.read.schema(schema).
   option("delimiter","|").
   option("quote", null).
   csv("gs://path/to/incremental_file.txt*")    
Angus Davis
  • 2,673
  • 13
  • 20
  • Im currently using spark version 2.2.1 and when I use option("quote", null) during the write operation, it errors out and does not allow null. I even tried option("quote", "\u0000"), but then it says the delimiter should be a single character. – vp1008 May 02 '18 at 06:00