2

I have a pipe-delimited file with varying numbers of columns, like this:

id|name|attribute|extraattribute
1|alvin|cool|funny
2|bob|tall
3|cindy|smart|funny

I'm trying to find an elegant way to import this into a dataframe using pyspark. I could try to fix the files to add a trailing | when the last column is missing (only the last column can be missing), but would love to find a solution that didn't involve changing the input files.

  • One option could be to read it in as one large column, adding a `|` at the end when appropriate, then split into columns as shown in [this post](https://stackoverflow.com/questions/43964554/how-to-split-column-into-multiple-rows-with-pipe-as-separator). – pault May 03 '18 at 15:33

1 Answers1

4

You can use the method csv in the module pyspark.sql.readwriter and set mode="PERMISSIVE":

df = sqlCtx.read.csv("/path/to/file.txt", sep="|", mode="PERMISSIVE", header=True)
df.show(truncate=False)
#+---+-----+---------+--------------+
#|id |name |attribute|extraattribute|
#+---+-----+---------+--------------+
#|1  |alvin|cool     |funny         |
#|2  |bob  |tall     |null          |
#|3  |cindy|smart    |funny         |
#+---+-----+---------+--------------+

From the docs:

PERMISSIVE : sets other fields to null when it meets a corrupted record.

When a schema is set by user, it sets null for extra fields.

This is much easier than what I originally suggested in the comments.

pault
  • 41,343
  • 15
  • 107
  • 149