Importing text file with varying number of columns in Spark

Question

I have a pipe-delimited file with varying numbers of columns, like this:

id|name|attribute|extraattribute
1|alvin|cool|funny
2|bob|tall
3|cindy|smart|funny

I'm trying to find an elegant way to import this into a dataframe using pyspark. I could try to fix the files to add a trailing | when the last column is missing (only the last column can be missing), but would love to find a solution that didn't involve changing the input files.

One option could be to read it in as one large column, adding a `|` at the end when appropriate, then split into columns as shown in [this post](https://stackoverflow.com/questions/43964554/how-to-split-column-into-multiple-rows-with-pipe-as-separator). — pault, May 03 '18 at 15:33

score 4 · Answer 1 · answered May 03 '18 at 15:53

You can use the method csv in the module pyspark.sql.readwriter and set mode="PERMISSIVE":

df = sqlCtx.read.csv("/path/to/file.txt", sep="|", mode="PERMISSIVE", header=True)
df.show(truncate=False)
#+---+-----+---------+--------------+
#|id |name |attribute|extraattribute|
#+---+-----+---------+--------------+
#|1  |alvin|cool     |funny         |
#|2  |bob  |tall     |null          |
#|3  |cindy|smart    |funny         |
#+---+-----+---------+--------------+

From the docs:

PERMISSIVE : sets other fields to null when it meets a corrupted record.

When a schema is set by user, it sets null for extra fields.

This is much easier than what I originally suggested in the comments.

This ignores last column if it contains null in first row. – user0204 Jan 25 '20 at 12:12 — user0204, Jan 25 '20 at 12:12

Importing text file with varying number of columns in Spark

1 Answers1

Linked