0

Suppose I have two files.

file0.txt

field1 field2
1 2
1 2

file1.txt

field2 field1
2 1
2 1

Now, if I write:

spark.read.csv(["./file0.txt""./file1.txt"], sep=',', header=True, inferSchema=True).show()

the following dataframe is read by spark.

field1 field2
1 2
1 2
2 1
2 1

but it should have been,

field1 field2
1 2
1 2
1 2
1 2

I tried using inferSchema. As I have a lot of files in the folder I cannot hardcode the ordering of the columns in the csvs.

Corralien
  • 109,409
  • 8
  • 28
  • 52

1 Answers1

1

You can read them one at a time and then union them, so something like this,

import glob

path = 'test_data/'

files=glob.glob(path +'*.txt')

for idx,f in enumerate(files):
    if idx == 0:
        df = spark.read.csv(f, sep=',', header=True, inferSchema=True)
        final_df = df
    else:
        df = spark.read.csv(f, sep=',', header=True, inferSchema=True)
        final_df=final_df.unionByName(df)

output:

+------+------+
|field1|field2|
+------+------+
|     1|     2|
|     1|     2|
|     1|     2|
|     1|     2|
+------+------+
Tushar Patil
  • 748
  • 4
  • 13