DataFrame
import pandas as pd
import re
data = {'A':['1','2.0','4b','dog'], 'B':['12','tom','lom','so'],'C':['dog','tom','jerry','55']}
df = pd.DataFrame(data=data)
print(df)
A B C
0 1 12 dog
1 2.0 tom tom
2 4b lom jerry
3 dog so 55
Now assign the list of column which are string or int/float
int_col = ['A'] #I am considering col A numerical
string_col = ['B','C'] # col B & C as string
Fun Part
num_pattern = "^\d+\.?\d*$" #this identify int and float both
int_float = []
errors_num = []
errors_string = []
for col in df[int_col].columns: #here detecting errors in numerical col
i = 0
for cells in df[int_col][col]:
if re.findall(num_pattern,cells):
int_float.append(cells)
if cells not in int_float:
errors_num.append({"column":col, "errors":cells, "index": i,'correct_datatype': 'float'})
i += 1
for col in df[string_col].columns: #here detecting errors in string col
i = 0
for cells in df[string_col][col]:
if re.findall(num_pattern,cells):
errors_string.append({"column":col, "errors":cells, "index": i, 'correct_datatype': 'string'})
i += 1
Now we have 2 errors list on column data type. We will join the list
and print DataFrame which will print Col_name
, error
, index num
, correct Data type
pd.DataFrame(data=errors_string + errors_num)
column errors index correct_datatype
0 B 12 0 string
1 C 55 3 string
2 A 4b 2 float
3 A dog 3 float