Reading multiple files using pyspark with same columns but different ordering

Question

Suppose I have two files.

file0.txt

field1	field2
1	2
1	2

file1.txt

field2	field1
2	1
2	1

Now, if I write:

spark.read.csv(["./file0.txt""./file1.txt"], sep=',', header=True, inferSchema=True).show()

the following dataframe is read by spark.

field1	field2
1	2
1	2
2	1
2	1

but it should have been,

field1	field2
1	2
1	2
1	2
1	2

I tried using inferSchema. As I have a lot of files in the folder I cannot hardcode the ordering of the columns in the csvs.

you could read one file at a time and then append them using `unionByName` — samkart, Apr 06 '23 at 07:27

score 1 · Accepted Answer · answered Apr 06 '23 at 11:39

You can read them one at a time and then union them, so something like this,

import glob

path = 'test_data/'

files=glob.glob(path +'*.txt')

for idx,f in enumerate(files):
    if idx == 0:
        df = spark.read.csv(f, sep=',', header=True, inferSchema=True)
        final_df = df
    else:
        df = spark.read.csv(f, sep=',', header=True, inferSchema=True)
        final_df=final_df.unionByName(df)

output:

+------+------+
|field1|field2|
+------+------+
|     1|     2|
|     1|     2|
|     1|     2|
|     1|     2|
+------+------+

Reading multiple files using pyspark with same columns but different ordering

1 Answers1