Pandas: Location of a row with error

Question

I am pretty new to Pandas and trying to find out where my code breaks. Say, I am doing a type conversion:

df['x']=df['x'].astype('int')

...and I get an error "ValueError: invalid literal for long() with base 10: '1.0692e+06'

In general, if I have 1000 entries in the dataframe, how can I find out what entry causes a break. Is there anything in ipdb to output the current location (i.e. where the code broke)? Basically, I am trying to pinpoint what value cannot be converted to Int.

If in ipython you can turn on `pdb` and start debugging: `%pdb` execute the command and then `%debug` you will then be able to walk the stack and display the values — EdChum, Oct 30 '14 at 18:20
@EdChum 's answer is best. You could also loop over the values and wrap in try/except. — exp1orer, Oct 30 '14 at 18:34

score 25 · Accepted Answer · edited May 23 '17 at 10:30

25

The error you are seeing might be due to the value(s) in the x column being strings:

In [15]: df = pd.DataFrame({'x':['1.0692e+06']})
In [16]: df['x'].astype('int')
ValueError: invalid literal for long() with base 10: '1.0692e+06'

Ideally, the problem can be avoided by making sure the values stored in the DataFrame are already ints not strings when the DataFrame is built. How to do that depends of course on how you are building the DataFrame.

After the fact, the DataFrame could be fixed using applymap:

import ast
df = df.applymap(ast.literal_eval).astype('int')

but calling ast.literal_eval on each value in the DataFrame could be slow, which is why fixing the problem from the beginning is the best alternative.

Usually you could drop to a debugger when an exception is raised to inspect the problematic value of row.

However, in this case the exception is happening inside the call to astype, which is a thin wrapper around C-compiled code. The C-compiled code is doing the looping through the values in df['x'], so the Python debugger is not helpful here -- it won't allow you to introspect on what value the exception is being raised from within the C-compiled code.

There are many important parts of Pandas and NumPy written in C, C++, Cython or Fortran, and the Python debugger will not take you inside those non-Python pieces of code where the fast loops are handled.

So instead I would revert to a low-brow solution: iterate through the values in a Python loop and use try...except to catch the first error:

df = pd.DataFrame({'x':['1.0692e+06']})
for i, item in enumerate(df['x']):
   try:
      int(item)
   except ValueError:
      print('ERROR at index {}: {!r}'.format(i, item))

yields

ERROR at index 0: '1.0692e+06'

edited May 23 '17 at 10:30

Community

1
1

answered Oct 30 '14 at 19:43

unutbu

842,883
184
1,785
1,677

Thank you. That solved my problem. I also tried converting everything to 'float' and it worked as well, even with the string as a value. A generic question though: is there any way to step into the error to pinpoint what value (or current index) is breaking? Tnx – user4199637 Oct 30 '14 at 20:22
I've added a suggestion on how to find the problematic value (and index). – unutbu Oct 30 '14 at 20:39
3

is there a more systematic debug mode that would allow pandas to report which rows fails in any exception ? – Frederic Bazin Apr 24 '15 at 13:44
@FredericBazin, you could arrange for code to [drop to the debugger when an exception is raised](http://stackoverflow.com/q/242485/190597). Or, using IPython, you could use its [%pdb "magic function"](http://ipython.org/ipython-doc/1/interactive/tutorial.html#debugging) to start the debugger whenever an uncaught exception occurs. Once you are in the debugger, you could print the value of the current `row`. – unutbu Apr 24 '15 at 19:23
@FredericBazin: Note however that this will only work if the value of the row is introspectable from the frame where the exception occurs. If you are calling a NumPy or Pandas method which runs Cython/C/C++/Fortran code which loops through the rows then the Python debugger won't allow you introspect the state of variables inside the foreign code. That is why in the code above, I made a crude simulation of `astype` *in Python* so the value of the row could be found from within Python. – unutbu Apr 24 '15 at 19:24
Nice "low brow" solution! – John Jiang Sep 23 '18 at 06:14
@FredericBazin I added an answer which is more general – crypdick May 11 '19 at 23:53

Patrick Ng · Answer 2 · 2020-05-18T02:36:35.537

I hit the same problem, and as I have a big input file (3 million rows), enumerating all rows will take a long time. Therefore I wrote a binary-search to locate the offending row.

import pandas as pd
import sys

def binarySearch(df, l, r, func):
    while l <= r:
        mid = l + (r - l) // 2;

        result = func(df, mid, mid+1)
        if result:
            # Check if we hit exception at mid
            return mid, result

        result = func(df, l, mid)
        if result is None:
            # If no exception at left, ignore left half
            l = mid + 1
        else:
            r = mid - 1

    # If we reach here, then the element was not present
    return -1

def check(df, start, end):
    result = None

    try:
        # In my case, I want to find out which row cause this failure
        df.iloc[start:end].uid.astype(int)
    except Exception as e:
        result = str(e)

    return result

df = pd.read_csv(sys.argv[1])

index, result = binarySearch(df, 0, len(df), check)
print("index: {}".format(index))
print(result)

score 2 · Answer 3 · answered May 11 '19 at 23:52

2

To report all rows which fails to map due to any exception:

df.apply(my_function)  # throws various exceptions at unknown rows

# print Exceptions, index, and row content
for i, row in enumerate(df):
    try:
        my_function(row)
    except Exception as e: 
        print('Error at index {}: {!r}'.format(i, row))
        print(e)

answered May 11 '19 at 23:52

crypdick

16,152
7
51
74

Is there any way to do it faster than loop? – Serhii Kushchenko Apr 24 '20 at 04:40
@serhii-kushchenko `df.apply` is already syntactic sugar for built-in for loops; they have the same performance. `df.apply` is orders of magnitude slower than built-in numpy and pandas ops. – crypdick Apr 24 '20 at 14:12
As far as I remember, in Python apply is much faster than for loops. – Serhii Kushchenko Apr 25 '20 at 18:31
1

@serhii-kushchenko nope. https://medium.com/swlh/how-to-efficiently-loop-through-pandas-dataframe-660e4660125d – crypdick Apr 25 '20 at 19:54

tohv · Answer 4 · 2023-03-23T12:02:15.137

0

Another simple way to find the bad guys:

df[~df["x"].str.isnumeric()]

It outputs all rows which does not contain the valid value in your column ['x'].

edited Mar 23 '23 at 12:02

answered Mar 23 '23 at 11:56

tohv

940
5
6

Pandas: Location of a row with error

4 Answers4

Linked

Related