0

I can read in my text file and can apply a schema, but when I do, many of the values become null.

Do you know why this may be happening?

# Read in fixed-width text file into DataFrame
from pyspark.sql import functions as f

df     = spark.read.text(file)

# Remove header
header = df.first()[0]
df     = df.filter(~f.col("value").contains(header))
# Take the fixed width file and split into distinct columns
sorted_df = df.select(
    df.value.substr( 1,  6).alias('Entry'      ),
    df.value.substr( 8,  3).alias('Per'        ),
    df.value.substr(12, 11).alias('Account'    ),
    df.value.substr(24, 11).alias('Description'),
)
sorted_df.show()
sorted_df.printSchema()

RETURNS:
+------+---+-----------+-----------+
| Entry|Per|    Account|Description|
+------+---+-----------+-----------+
| 16524| 01| 3930621977| TXNPUES   |
|191675| 01| 2368183100| OUNHQEX   |
|191667| 01| 3714468136| GHAKASC   |
|191673| 01| 2632703881| PAHFSAP   |
| 80495| 01| 2766389794| XDZANTV   |
| 80507| 01| 4609266335| BWWYEZL   |
| 80509| 01| 1092717420| QJYPKVO   |
| 80497| 01| 3386366766| SOQLCMU   |
|191669| 01| 5905893739| FYIWNKA   |
|191671| 01| 2749355876|   CBMJTLP |
+------+---+-----------+-----------+
root
 |-- Entry: string (nullable = true)
 |-- Per: string (nullable = true)
 |-- Account: string (nullable = true)
 |-- Description: string (nullable = true)

Then I do my casting, because right now everything is simply a string, but I want to convert some columns to long & int:

from pyspark.sql import functions as f
cast = [f.col('Entry')      .cast('long'  ),
        f.col('Per'  )      .cast('int'   ),
        f.col('Account')    .cast('long'  ),
        f.col('Description').cast('string')
       ] 

sorted_df = sorted_df.select(cast)
sorted_df.show()
sorted_df.printSchema()

RETURNS:
+------+----+-------+-----------+
| Entry| Per|Account|Description|
+------+----+-------+-----------+
|  null|null|   null| TXNPUES   |
|191675|null|   null| OUNHQEX   |
|191667|null|   null| GHAKASC   |
|191673|null|   null| PAHFSAP   |
|  null|null|   null| XDZANTV   |
|  null|null|   null| BWWYEZL   |
|  null|null|   null| QJYPKVO   |
|  null|null|   null| SOQLCMU   |
|191669|null|   null| FYIWNKA   |
|191671|null|   null|   CBMJTLP |
+------+----+-------+-----------+
root
 |-- Entry: long (nullable = true)
 |-- Per: integer (nullable = true)
 |-- Account: long (nullable = true)
 |-- Description: string (nullable = true)

What could be causing all of these values to suddenly become null? I have the same issue if I try to cast to int.

Dave Voyles
  • 4,495
  • 7
  • 33
  • 44

1 Answers1

1

Your values seem to have leading spaces so you'll have to trim them before converting.

e.g.

The Entry column has 6 characters so every number with less than 6 digits is padded with spaces and the cast to int is going to fail (5 rows in your example).

The same happens with Accounts which is populated with 11 chars but the accounts on your example are 10 digits preceded by a space. Again, the cast is going to fail.

Helio Santos
  • 6,606
  • 3
  • 25
  • 31
  • Worked like a charm, thank you! For others who may have to do this, I was able to remove all leading & training text from the dataframe with the information found in this thread, and specifically the answer from user DataDog: https://stackoverflow.com/questions/35155821/trim-string-column-in-pyspark-dataframe – Dave Voyles May 14 '20 at 14:11