How to trim fields when loading into dataframe in spark?

Question

We recently received a file to be ingested, the file is PSV format, however, all the fields are padded with extra characters $~$ on the left and right, so the entire PSV is like below:

$~$Field1$~$|$~$Field2$~$|$~$Field3$~$

$~$Data1$~$|$~$Data2$~$|$~$Data3$~$

$~$Data4$~$|$~$Data5$~$|$~$Data6$~$

$~$Data7$~$|$~$Data8$~$|$~$Data9$~$

$~$Data10$~$|$~$Data11$~$|$~$Data12$~$ .....

There are 100 Million rows in the file.

What would be the best way to trim these paddings so as to make it a standard PSV?

Thank you very much, any suggestion/sharing is appreciated here.

UPDATE:

The data is received from SFTP, and uploaded to Hadoop by IT Data Support (Unix Admin), we only have access to Hadoop cluster, but if it is an easy job for Data Support, maybe I can convince them to do the preprocess. Thanks.

seems trivial if you were to use a through stream or unix pipe as a preprocessor. is that an option? — 4m1r, Feb 08 '19 at 02:50
oh? I am not unix person, can you shed with more light? Thank you very much. keep in mind the volume of the data. OP updated. — mdivk, Feb 08 '19 at 03:02
yeah, streams are generally the best way to do something like this because they can do processing on smaller buffer or line by line. essentially on unix you have tool like cat and sed and you can pipe one into another eg cat file.csv | sed('regex'). i'm not a sed or awk expert but you can likely use them in place without cat or piping too. — 4m1r, Feb 08 '19 at 13:51
Thank you, please post it as an answer, I will accept it once I see it. The sed costs me nearly an hour to trim off the extra characters off from the 28GB data. — mdivk, Feb 10 '19 at 23:36

Hedrack · Answer 1 · 2019-02-09T07:37:41.650

0

Here is a pure Spark solution. There might be better performing solutions.

var df = spark.read.option("delimiter", "|").csv(filePath)
val replace = (value: String, find: String, replace: String) => value.replace(find, replace)
val replaceUdf = udf(replace)
df.select(
       df.columns.map(c => replaceUdf(col(c), lit("$~$"), lit("")).alias(c)): _*)
  .show

Update: You cannot use $~$ as quote option or use $~$|$~$ as a delimiter in the 2.3.0 as those options accept only single character.

edited Feb 09 '19 at 07:37

answered Feb 08 '19 at 08:51

Hedrack

694
1
6
19

Agree, But instead of " | " you can use "$~$|$~$" as delimitter, so we dont have to run udf on too many columns – Krishna Feb 08 '19 at 13:00
You cannot. Delimiter must be one character. :) – Hedrack Feb 09 '19 at 07:34

score 0 · Answer 2 · answered Feb 09 '19 at 15:13

Using regexp_replace and foldLeft to update all columns. Check this out

scala> val df = Seq(("$~$Data1$~$","$~$Data2$~$","$~$Data3$~$"), ("$~$Data4$~$","$~$Data5$~$","$~$Data6$~$"), ("$~$Data7$~$","$~$Data8$~$","$~$Data9$~$"),("$~$Data10$~$","$~$Data11$~$","$~$Data12$~$")).toDF("Field1","field2","field3")
df: org.apache.spark.sql.DataFrame = [Field1: string, field2: string ... 1 more field]

scala> df.show(false)
+------------+------------+------------+
|Field1      |field2      |field3      |
+------------+------------+------------+
|$~$Data1$~$ |$~$Data2$~$ |$~$Data3$~$ |
|$~$Data4$~$ |$~$Data5$~$ |$~$Data6$~$ |
|$~$Data7$~$ |$~$Data8$~$ |$~$Data9$~$ |
|$~$Data10$~$|$~$Data11$~$|$~$Data12$~$|
+------------+------------+------------+


scala> val df2 = df.columns.foldLeft(df) { (acc,x) => acc.withColumn(x,regexp_replace(col(x),"""^\$~\$|\$~\$$""","")) }
df2: org.apache.spark.sql.DataFrame = [Field1: string, field2: string ... 1 more field]

scala> df2.show(false)
+------+------+------+
|Field1|field2|field3|
+------+------+------+
|Data1 |Data2 |Data3 |
|Data4 |Data5 |Data6 |
|Data7 |Data8 |Data9 |
|Data10|Data11|Data12|
+------+------+------+


scala>

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

0

tr might be the faster solution. note, you can pipe any strings, so in this case, I'm cating a file on disk, but this can also be a file stream from sftp.

~/Desktop/test $ cat data.txt
$~$Field1$~$|$~$Field2$~$|$~$Field3$~$

$~$Data1$~$|$~$Data2$~$|$~$Data3$~$

$~$Data4$~$|$~$Data5$~$|$~$Data6$~$

$~$Data7$~$|$~$Data8$~$|$~$Data9$~$

# the '>' will open a new file for writing

~/Desktop/test $ cat data.txt | tr -d \$~\$ > output.psv

# see the results here
~/Desktop/test $ cat output.psv 
Field1|Field2|Field3

Data1|Data2|Data3

Data4|Data5|Data6

Data7|Data8|Data9

examples: https://shapeshed.com/unix-tr/#what-is-the-tr-command-in-unix

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 11 '19 at 15:58

4m1r

12,234
9
46
58

would cat work for such a big file? sed -ie 's/\$\~\$//g' filename is what I used. Thanks – mdivk Feb 11 '19 at 20:22
i don't think cat has any file size limitations, it should stream the bytes from the top of the file on down. there may be implementation where you would hit max file descriptors if you piped your output into file vs. single file descriptor open eg `>`.i think this tr stream will outperform sed in place too. but sed should be reasonably fast depending on your hardware. good luck! https://serverfault.com/questions/429352/file-size-limit-exceeded-in-bash/429369#429369: – 4m1r Feb 11 '19 at 20:37
Thank you for your kind enlightening. – mdivk Feb 11 '19 at 20:46

How to trim fields when loading into dataframe in spark?

3 Answers3