parsing unstructured data using pyspark

Question

I am new to Spark. I am trying to parse unstructured data with the below format.

The entire dataset is in a single line.

Each line/record is delimited by a special character ~$| and each column of the record is delimited by tab space.

So how can I parse this and convert it into a data frame?

Raj India 1000 ~$| John Canada 2000 ~$| Steve USA 3000 ~$| Jason USA 4000

score 1 · Answer 1 · answered Jul 08 '20 at 23:48

Use spark.read.text() method and In option keep your custom lineSep

spark.read.option("lineSep", '~$|').text('<filepath>').withColumn("value",regexp_replace(col("value"),'\n','')).show()
#+------------------+
#|             value|
#+------------------+
#|   Raj India 1000 |
#| John Canada 2000 |
#|   Steve USA 3000 |
#|    Jason USA 4000|
#+------------------+

Once dataframe is created then use split function on value column to create new columns to the dataframe using either .getItem,element_at functions.

parsing unstructured data using pyspark

1 Answers1