I have fixed width file as below
00120181120xyz12341
00220180203abc56792
00320181203pqr25483
And a corresponding JSON
file that specifies the schema:
{"Column":"id","From":"1","To":"3"}
{"Column":"date","From":"4","To":"8"}
{"Column":"name","From":"12","To":"3"}
{"Column":"salary","From":"15","To":"5"}
I read the schema file into DataFrame using:
SchemaFile = spark.read\
.format("json")\
.option("header","true")\
.json('C:\Temp\schemaFile\schema.json')
SchemaFile.show()
#+------+----+---+
#|Column|From| To|
#+------+----+---+
#| id| 1| 3|
#| date| 4| 8|
#| name| 12| 3|
#|salary| 15| 5|
#+------+----+---+
Likewise, I am parsing the fixed width file into a pyspark DataFrame as below:
File = spark.read\
.format("csv")\
.option("header","false")\
.load("C:\Temp\samplefile.txt")
File.show()
#+-------------------+
#| _c0|
#+-------------------+
#|00120181120xyz12341|
#|00220180203abc56792|
#|00320181203pqr25483|
#+-------------------+
I can obviously hard code the values for the positions and lengths of each column to get the desired output:
from pyspark.sql.functions import substring
data = File.select(
substring(File._c0,1,3).alias('id'),
substring(File._c0,4,8).alias('date'),
substring(File._c0,12,3).alias('name'),
substring(File._c0,15,5).alias('salary')
)
data.show()
#+---+--------+----+------+
#| id| date|name|salary|
#+---+--------+----+------+
#|001|20181120| xyz| 12341|
#|002|20180203| abc| 56792|
#|003|20181203| pqr| 25483|
#+---+--------+----+------+
But how can I use the SchemaFile
DataFrame to specify the widths and column names for the lines so that the schema can be applied dynamically (without hard coding) at run time?