I can read in my text file and can apply a schema, but when I do, many of the values become null.
Do you know why this may be happening?
# Read in fixed-width text file into DataFrame
from pyspark.sql import functions as f
df = spark.read.text(file)
# Remove header
header = df.first()[0]
df = df.filter(~f.col("value").contains(header))
# Take the fixed width file and split into distinct columns
sorted_df = df.select(
df.value.substr( 1, 6).alias('Entry' ),
df.value.substr( 8, 3).alias('Per' ),
df.value.substr(12, 11).alias('Account' ),
df.value.substr(24, 11).alias('Description'),
)
sorted_df.show()
sorted_df.printSchema()
RETURNS:
+------+---+-----------+-----------+
| Entry|Per| Account|Description|
+------+---+-----------+-----------+
| 16524| 01| 3930621977| TXNPUES |
|191675| 01| 2368183100| OUNHQEX |
|191667| 01| 3714468136| GHAKASC |
|191673| 01| 2632703881| PAHFSAP |
| 80495| 01| 2766389794| XDZANTV |
| 80507| 01| 4609266335| BWWYEZL |
| 80509| 01| 1092717420| QJYPKVO |
| 80497| 01| 3386366766| SOQLCMU |
|191669| 01| 5905893739| FYIWNKA |
|191671| 01| 2749355876| CBMJTLP |
+------+---+-----------+-----------+
root
|-- Entry: string (nullable = true)
|-- Per: string (nullable = true)
|-- Account: string (nullable = true)
|-- Description: string (nullable = true)
Then I do my casting, because right now everything is simply a string, but I want to convert some columns to long & int:
from pyspark.sql import functions as f
cast = [f.col('Entry') .cast('long' ),
f.col('Per' ) .cast('int' ),
f.col('Account') .cast('long' ),
f.col('Description').cast('string')
]
sorted_df = sorted_df.select(cast)
sorted_df.show()
sorted_df.printSchema()
RETURNS:
+------+----+-------+-----------+
| Entry| Per|Account|Description|
+------+----+-------+-----------+
| null|null| null| TXNPUES |
|191675|null| null| OUNHQEX |
|191667|null| null| GHAKASC |
|191673|null| null| PAHFSAP |
| null|null| null| XDZANTV |
| null|null| null| BWWYEZL |
| null|null| null| QJYPKVO |
| null|null| null| SOQLCMU |
|191669|null| null| FYIWNKA |
|191671|null| null| CBMJTLP |
+------+----+-------+-----------+
root
|-- Entry: long (nullable = true)
|-- Per: integer (nullable = true)
|-- Account: long (nullable = true)
|-- Description: string (nullable = true)
What could be causing all of these values to suddenly become null? I have the same issue if I try to cast to int.