How to find data type error in pandas dataframe?

Question

df1:

  product   product_Id   Price
0 Mobile      G67129     4500
1 Earphone    H56438     8900
2 Heater      K12346     fgdht
3 Kitchen     566578     4500
4 4359        Gh1907     5674
5 plastic     G67129     Dfz67

df2:

  Column_Name   Expected_Dtype
0 product          String
1 product_Id       String
2 Price            int

I need to find out the data type error values from df1 and has column datatype information in df2.

Output:

   column_Name  Value  Exp_dtype index
0  product       4359  String    4
1  product_Id   566578 String    3
2  Price       fgdht    int      2
3  Price       Dfz67    int      5

Remember we don't do things for you; we help you when you're stuck. — Juan C, Dec 16 '19 at 17:59
See this post it would help you https://stackoverflow.com/questions/28654047/pandas-convert-some-columns-into-rows — Alex Ferguson, Dec 16 '19 at 18:25
Does this answer your question? [pandas convert some columns into rows](https://stackoverflow.com/questions/28654047/pandas-convert-some-columns-into-rows) — Alex Ferguson, Dec 16 '19 at 18:26
#AKX , Sorry about that, but i don't know where to start with. — kalyan, Dec 16 '19 at 19:08

Sheng Zhuang · Accepted Answer · 2019-12-16T20:16:25.700

As those types mixed up, all being object, I can only think of using str match with regex pattern to pick out error types.

Here is my solution:

find rows with error types first

bad_product = df['product'].loc[df['product'].str.match(r'[0-9.]+')]
bad_product_ID = df.product_Id.loc[df['product_Id'].str.match(r'[0-9.]+')]
bad_price = df.Price.loc[~df['Price'].str.match(r'[0-9.]+')]

join error rows all together

df3 = pd.concat([bad_product,bad_product_ID,bad_price], axis=1).stack().reset_index()
df3.columns = ['index', 'Column_Name', 'value']

merge it with df2

df2.set_index('Column_Name')
df3.set_index('Column_Name')
result = pd.merge(df3, df2, how='left')

result:


  index Column_Name value   Expected_Dtype
0   2   Price       fgdht   int
1   3   product_Id  566578  String
2   4   product     4359    String
3   5   Price       Dfz67   int

when you have no idea to begin with, try to break it down to small task. Hope this would help.

Sheng Zhuang thanks for the solution. Right now you are checking manually that product is a string type so you performed a check. Ideally it should retrieve data type info from df2 and then perform a check. — kalyan, Dec 17 '19 at 07:16
I know, it is not ideal, but I cannot find a way to check each entry's dtype, .dtypes can only check whole series. — Sheng Zhuang, Dec 18 '19 at 00:56

score 1 · Answer 2 · edited Dec 24 '21 at 18:51

DataFrame

import pandas as pd
import re
data = {'A':['1','2.0','4b','dog'], 'B':['12','tom','lom','so'],'C':['dog','tom','jerry','55']}
df = pd.DataFrame(data=data)
print(df)

    A   B   C
0   1   12  dog
1   2.0 tom tom
2   4b  lom jerry
3   dog so  55

Now assign the list of column which are string or int/float

int_col = ['A'] #I am considering col A numerical
string_col = ['B','C'] # col B & C as string

Fun Part

num_pattern = "^\d+\.?\d*$" #this identify int and float both
int_float = []
errors_num = []
errors_string = []
for col in df[int_col].columns: #here detecting errors in numerical col
  i = 0
  for cells in df[int_col][col]:
    if re.findall(num_pattern,cells):
      int_float.append(cells)
    if cells not in int_float:
      errors_num.append({"column":col, "errors":cells, "index": i,'correct_datatype': 'float'})
    i += 1

for col in df[string_col].columns: #here detecting errors in string col
  i = 0
  for cells in df[string_col][col]:
    if re.findall(num_pattern,cells):
      errors_string.append({"column":col, "errors":cells, "index": i, 'correct_datatype': 'string'})
    i += 1

Now we have 2 errors list on column data type. We will join the list and print DataFrame which will print Col_name, error, index num, correct Data type

pd.DataFrame(data=errors_string + errors_num)

  column    errors  index   correct_datatype
0   B       12         0    string
1   C       55         3    string
2   A       4b         2    float
3   A       dog        3    float

How to find data type error in pandas dataframe?

2 Answers2