How to read CSV files line by line and remove mismatched datatypes rows using pyspark/python?

Question

I have CSV files with misplaced records issue. I have inferred/final column-datatypes for those files. So, whenever we are getting mismatched datatypes by comparing with final/inferred datatypes, it should read line by line and remove the record(row) with mismatch.

I have the following steps as a requirement,

For example,
1.read csv line by line
2.find the datatype of each column and match with inferred column-datatype
3.if there is mismatch ,then remove the record(row)
otherwise keep all good rows in valid_record_csv file

INFERRED datatypes-column(final datatype) to compare with files.
name: string
date: date
phone: long
col1: integer
col2: double

SAMPLE INPUT csv file

Name   date         phone         col1     col2
124    PANAMA       440894563              0.9800
BB     2022-9-23    449035667              9.08765
BB     GRENADA       BBCC                  PASS

SAMPLE OUTPUT csv file: 
I have one column empty/blank, but it is having inferred/final datatype as integer. Another one with double datatype

NAME   DATE       PHONE       col1    col2
BB     2022      449035667            9.08765

you need only in pyspark ? you accept solution using pandas ? — khaled koubaa, Sep 21 '22 at 16:42
Does this answer your question? [Drop rows in spark which dont follow schema](https://stackoverflow.com/questions/61785843/drop-rows-in-spark-which-dont-follow-schema) — Chris, Sep 21 '22 at 16:45
@Chris Thanks for the link!. But I want the solution either in pyspark or python/pandas — Anos, Sep 21 '22 at 16:49

khaled koubaa · Answer 1 · 2022-09-21T18:15:51.507

0

pandas solution:

import re
import pandas as pd

#0. read the csv file (supposing you have csv file named 'INPUT.csv')

df = pd.read_csv('INPUT.csv')

df
    Name    date        phone       col1    col2
0   124     PANAMA      440894563   PASS    901
1   BB      2022-9-23   449035667   520     9.08765
2   BB      GRENADA     BBCC        0.536   PASS

#1. deal with columns (cleaning):
df['date'] = pd.to_datetime(df['date'], errors='coerce') # deal with date column: keep only valid dates
df['phone'] = pd.to_numeric(df['phone'], errors='coerce').astype(pd.Int64Dtype()) # deal with phone column: keep only valid integers
df['Name'] = df['Name'].map(lambda x: x if isinstance(x, str) and re.match("^[a-zA-Z\s]+$", x) else None) # deal with Name column: keep only Name : strings than only contains letters and spaces

#a. cleaning pure integers
def f_integer(x):
    if '.' in x:
        return None 
    else:
        try:
            return int(x)
        except:
            return None 

df['col1'] = df['col1'].map(f_integer).astype(pd.Int64Dtype())

#b. cleaning pure doubles
def f_double(x):
    try:
        if float(x) and ('.' in x):
            return float(x)
        else:
            return None
    except:
        return None 
df['col2'] = df['col2'].map(f_double)

#-->
    Name    date        phone       col1    col2
0   None    NaT         440894563   <NA>    NaN
1   BB      2022-09-23  449035667   520     9.08765
2   BB      NaT         NaN         <NA>    NaN

#2. remove the record(row) with mismatch:
df.loc[df['Name'].notna() & df['date'].notna() & df['phone'].notna() & df['col1'].notna() & df['col2'].notna()]

#-->
df
    Name    date        phone       col1    col2
1   BB      2022-09-23  449035667   520     9.08765

#3. save the cleaned dataframe to csv
df.to_csv('OUTPUT.csv', index=False)

edited Sep 21 '22 at 18:15

answered Sep 21 '22 at 17:10

khaled koubaa

836
3
14

Hi @khaledkoubaa Really appreciated for your support.Thankyou!!!. I trying to execute the same for some double and integer datatypes. Could you please help me on this. Please refer " REQUIREMENT". I need to drop rows with mismatched datatype for these too. – Anos Sep 21 '22 at 17:25
I have updated the question above for two more datatypes with empty column. Could you please help on this use case – Anos Sep 21 '22 at 17:36
col1 need to be purely integer, right ? – khaled koubaa Sep 21 '22 at 17:45
yes, you are right. It should be integer datatype. – Anos Sep 21 '22 at 17:47
@Anos I update the answer, let me know your comments – khaled koubaa Sep 21 '22 at 18:16
Hi @khaledkoubaa Thankyou so much! Let me try with original data – Anos Sep 21 '22 at 18:36
ok try and tell me if ok or if you have other cleaning ideas – khaled koubaa Sep 21 '22 at 18:37
I'm getting this error when I try to execute datacleaning part of integer error: argument of type 'int' is not iterable – Anos Sep 21 '22 at 18:59
i think you apply the funtion to the column two times in a row, if you use jupyter notebook, put each step in separate cell and execute the cell once (if you did twice, the second execution will be error and you will need to start over from the read csv step). so put each step in separate cell, execute cells one by one, and tell me the errors where – khaled koubaa Sep 21 '22 at 19:14
if you use python interpreter,create .py file , put the code in it, execute it on the terminal – khaled koubaa Sep 21 '22 at 19:15

How to read CSV files line by line and remove mismatched datatypes rows using pyspark/python?

1 Answers1