1

In order to validate a column as hex values of a data frame I am doing something like this:

column = data_df[column_name]
try:
    column.apply(lambda x: int(x, 16))
    ok = True
except ValueError:
    ok = False

The problem here seems to be the lambda, since doing int(x, 16) on the console works for any size of numbers.

I am getting an exception, since some values are too large:

In [1]: df.col.apply(lambda x: int(x, 16))
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
...

OverflowError: Python int too large to convert to C unsigned long

What about performance doing this on large data? Can you think of a better solution?

I am new to python, so please be forgiving :) That was my source of inspiration: convert pandas dataframe column from hex string to int

lukas_o
  • 3,776
  • 4
  • 34
  • 50
  • The bottommost answer in this post seems to offer an alternative solution https://stackoverflow.com/questions/9672615/how-can-i-determine-if-user-input-is-a-valid-hexadecimal-number – mortysporty Mar 28 '18 at 16:43
  • 1
    Which version of Python are you using? I tried int(x,16) with a laaarge hexadecimal. It worked. No problem. In Python 3+ there is no size limitation on ints. – mortysporty Mar 28 '18 at 17:11
  • I know, this works for me as well. But it doesn't using the `lambda`. So I guess that should have been in the question, will edit. Thank! – lukas_o Mar 29 '18 at 07:07

1 Answers1

1

I'm not able to recreate your error, but from how I understand your question you want to validate that all the values in a column are hexadecimals.

Perhaps it would be better to evaluate each hexadecimal for itself rather than trying to do them all at once.

This should work

import pandas as pd

# Function that evaluates if a number is a hex
def is_hex(x):
    try:
        int(x, 16)
        return True
    except ValueError:
        return False

# Testdata
df = pd.DataFrame({'hexes' : ['FF1231AAAD', 
                              '4451481AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA923478413819AFCCCCBBBBBBDDDDDDDD',
                              'XXXXXXXXXX']})
# Apply our function using lambdas
df['validate'] = df['hexes'].apply(lambda x : is_hex(x))
# Alternatively use list comprehension (which I believe is faster than apply)
df['validate'] = [is_hex(x) for x in df['hexes']]
# Check that all values are True
df['validate'].all()

To determine which version is actually faster you can try using timeit

%timeit for x in range(100): df['hexes'].apply(lambda x : is_hex(x))
14.6 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit for x in range(100): [is_hex(x) for x in df['hexes']]
3.04 ms ± 55.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Which seems to confirm that the list comprehension is actually almost 5 times as fast.

mortysporty
  • 2,749
  • 6
  • 28
  • 51