How to access "_corrupt_record" column in pyspark?

Question

I am having some problem in Handling bad records and files(CSV). Here is my CSV file

+------+---+---+----+
|  Name| ID|int|int2|
+------+---+---+----+
| Sohel|  1|  4|  33|
| Sohel|  2|  5|  56|
| Sohel|  3|  6| 576|
| Sohel|  a|  7| 567|
|Sohel2|  c|  7| 567|
+------+---+---+----+

I am reading this file with predefine schema

schema = StructType([
  StructField("Name",StringType(),True),
  StructField("ID",IntegerType(),True),
  StructField("int",IntegerType(),True),
  StructField("int2",IntegerType(),True),
  StructField("_corrupt_record", StringType(),True) 
  ])
df = spark.read.csv('dbfs:/tmp/test_file/test_csv.csv', header=True, schema=schema, 
columnNameOfCorruptRecord='_corrupt_record')

And the result is

+------+----+---+----+---------------+
|  Name|  ID|int|int2|_corrupt_record|
+------+----+---+----+---------------+
| Sohel|   1|  4|  33|           null|
| Sohel|   2|  5|  56|           null|
| Sohel|   3|  6| 576|           null|
| Sohel|null|  7| 567|  Sohel,a,7,567|
|Sohel2|null|  7| 567| Sohel2,c,7,567|
+------+----+---+----+---------------+

its give me the result as I expected, But the problem start from here I just want to access those "_corrupt_record " and make a new df. I did filter in df for just "_corrupt_record" but its seems like original CSV file did not have "_corrupt_record" column thats why its giving me error.

badRows = df.filter("_corrupt_record is Not Null").show()

error massage

Error while reading file dbfs:/tmp/test_file/test_csv.csv.
Caused by: java.lang.IllegalArgumentException: _corrupt_record does not exist. Available: Name, ID, int, int2

I am flowing Databricks document, https://docs.databricks.com/data/data-sources/read-csv.html#read-files ,But they also had same error then why they even add it on document!!

All I just want to access " _corrupt_record" column and make new DF. Any help or suggestion will be thankful.

score 0 · Answer 1 · answered Jul 08 '21 at 04:45

0

You need to add enforceSchema=True.

df = spark.read.csv('dbfs:/tmp/test_file/test_csv.csv', header=True, schema=schema, 
enforceSchema=True, columnNameOfCorruptRecord='_corrupt_record')

This should give you the corrupt record column.

answered Jul 08 '21 at 04:45

XXavier

1,206
1
10
14

Have you tried it? coz, I already tried it and its not working. Also my problem is not it, It giving me corrupt record column but I cant access it. – Sohel Reza Jul 08 '21 at 05:08

score 0 · Answer 2 · answered Jul 08 '21 at 06:17

0

Try this , as I can see the DF is already created -

df = df.filter(F.col("_corrupt_record").isNotNull())

answered Jul 08 '21 at 06:17

dsk

1,863
2
10
13

score 0 · Answer 3 · answered Nov 08 '21 at 16:36

The problem stands into the fact that the dataframe df you created is not a Delta Table.

Try to store the content of df into a Delta Table:

%sql
CREATE TABLE IF NOT EXISTS df_delta_temp
 USING delta AS
   SELECT *
   FROM df;

If now you will query your Delta Table df_delta_temp, your data will appear.

Your Delta Table will be created locally into you workspace (it will appear in the Data blade on the left menu). To keep clean the environment, at the end of you elaboration you can drop the Delta Table (that in this way it will be used as a temporary table).

%sql
DROP TABLE IF EXISTS df_delta_temp

I am sure this insightful article will give you some interesting knowledge about this topic: https://python.plainenglish.io/how-to-handle-bad-data-in-spark-sql-5e0276d37ca1

score 0 · Answer 4 · answered Jul 19 '22 at 07:45

0

you need to cache the DF beforehand to use the _corrupt_record. Please refer: Not able to retain the corrupted rows in pyspark using PERMISSIVE mode

answered Jul 19 '22 at 07:45

SR77

33
5

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 20 '22 at 13:02
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/32280612) – halfelf Jul 22 '22 at 01:32

How to access "_corrupt_record" column in pyspark?

4 Answers4