Load Text Files and store it in Dataframe using Pyspark

Question

I am migrating pig script to pyspark and I am new to Pyspark so I am stuck at data loading.

My pig script looks like:

Bag1 = LOAD '/refined/em/em_results/202112/' USING PigStorage('\u1') AS (PAYER_SHORT: chararray ,SUPER_PAYER_SHORT: chararray ,PAID: double ,AMOUNT: double );

I want something similar to this in Pyspark.

Currently I have tried this in Pyspark: df = spark.read.format("csv").load("/refined/em/em_results/202112/*")

I am able to read the text file with this but values are coming in single column instead of separated in different columns. Please find below some sample values:

|_c0

|AZZCMMETAL2021/1211FGPP7491764 |

|AZZCMMETAL2021/1221HEMP7760484 |

Output should look like this:

_c0 _c1 _c2 _c3_c4 _c5 _c6 _c7

AZZCM METAL 2021/12 11 FGP P 7 491764

AZZCM METAL 2021/12 11 HEM P 7 760484

Any clue how to achieve this? Thanks!!

can you please share a sample of the CSV file? – Netanel Malka Feb 16 '22 at 20:27 — Netanel Malka, Feb 16 '22 at 20:27

score 0 · Answer 1 · answered Feb 17 '22 at 03:51

0

Generaly spark would try to take (,)[comma] as a separator value in you case you have to provide space as your separator.

df = spark.read.csv(file_path, sep =' ')

answered Feb 17 '22 at 03:51

Praveen Kumar

91
1
5

score 0 · Answer 2 · edited Feb 17 '22 at 10:58

0

This resolves the issue. Instead of "\\u1", I used "\u0001". Please find below the answer.

df = spark.read.option("sep","\u0001").csv("/refined/em/em_results/202112/*")

edited Feb 17 '22 at 10:58

Vincent Doba

4,343
3
22
42

answered Feb 17 '22 at 09:28

Neel Sharma

53
1
4

Load Text Files and store it in Dataframe using Pyspark

2 Answers2