0

As part of the kaggle competition (https://www.kaggle.com/competitions/amex-default-prediction/overview), I'm trying to take advantage of a trick where they (other competitors sharing their solution) reduce the size of a column by interpreting a hexadecimal string as a base-16 uint64. I'm trying to work out if this is possible in polars /rust:

# The python approach - this is used via .apply in pandas.
string = "0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a"
def func(x):
    return int(string[-16:], 16)
func(string)
# 13914591055249847850

My attempt at a solution in polars yields nearly the right answer, but the final digits are off, which is a bit confusing:

import polars as pl
def func(x: str) -> int:
    return int(x[-16:], 16)

strings = [
    "0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a",
    "00000fd6641609c6ece5454664794f0340ad84dddce9a267a310b5ae68e9d8e5",
]

df = pl.DataFrame({"id": strings})

result_polars = df.with_column(pl.col("id").apply(func).cast(pl.UInt64)).to_series().to_list()
result_python = [func(x) for x in strings]

result_polars, result_python
# ([13914591055249848320, 11750091188498716672],
#  [13914591055249847850, 11750091188498716901])

I've also tried casting directly from utf-8 to uint64, but I get the following error, which yields nulls if I pass strict=False.

df.with_column(pl.col("id").str.slice(-16).cast(pl.UInt64)).to_series().to_list()
### 
ComputeError: strict conversion of cast from Utf8 to UInt64 failed. consider non-strict cast.

                    If you were trying to cast Utf8 to Date,Time,Datetime, consider using `strptime`
TomNorway
  • 2,584
  • 1
  • 19
  • 26

1 Answers1

1

The values you return from func are:

13914591055249847850
11750091188498716901

These values are larger than can be represented with a pl.Int64. Which is what polars uses for python's int type. If a values overflows, polars instead uses Float64, but this comes with loss of precision.

A better solution

Taking just the latest 16 values of a string throws away a lot of information, meaning you can easily have collisions. It's better to use a hash function that tries to avoids collisions.

You could use the hash expression. This gives you a more qualitative hash, and will be much faster as you don't run python code.

df.with_columns([
    pl.col("id").hash(seed=0)
])
shape: (2, 1)
┌─────────────────────┐
│ id                  │
│ ---                 │
│ u64                 │
╞═════════════════════╡
│ 478697168017298650  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7596707240263070258 │
└─────────────────────┘
ritchie46
  • 10,405
  • 1
  • 24
  • 43