2

I am having a problem to store UUID in Pandas's dataframe after read from CSV. My data is approximately 1 milion rows and the "ID" field is 16 character UUID.

I checked dtype and memory_usage, that columns was"Object" dtype and used 77MB of RAM. May you please guide me how to optimize it? I searched this topic but the result seems not satisfied enough. Thanks

Best Regards

PS: I am using Python 3.7 and Pandas 0.23.4

  • Without profiling your code you are merely guessing, but once you do that you might have to figure out "paging" or "chunking". –  Sep 28 '18 at 16:39
  • Wow, Thanks. I didn't know about "Chunking" untill your comment. I think I had a solution for myself. – Thái Lương Sep 28 '18 at 17:02
  • A 16 character UUID could be converted to a 64 byte unsigned integer, but then you would have to deal with [this problem](https://stackoverflow.com/questions/34283319/why-does-pandas-convert-unsigned-int-greater-than-263-1-to-objects). Also you would also have to convert it back to hex on the way out. – Steven Rumbalski Sep 28 '18 at 17:12

1 Answers1

1

There is not much practical sense in that:

users_ids = orders_df['user_id'].unique().copy()
print("Total user_ids: ", len(np.unique(users_ids)))
print("Total size: ", sys.getsizeof(users_ids))
print("Total size: ", sys.getsizeof(users_ids[0]))

uuid_to_int = {}
next_uuid_int = 0
for uuid in users_ids:
    if uuid not in uuid_to_int:
        uuid_to_int[uuid] = next_uuid_int
        next_uuid_int += 1

def recode_uuid_to_int (uuid) -> int:
    return uuid_to_int[uuid]

print("UUID to int: ", recode_uuid_to_int(UUID('0sb7ff82-4ec5-4c71-9627-ca209e27df5f')))
orders_df['user_id_as_int'] = orders_df['user_id'].apply(lambda x: recode_uuid_to_int(x))
users_ids_recoded = orders_df['user_id_as_int'].unique().copy()
print("After recoding: ", sys.getsizeof(users_ids_recoded))

gives output:

Total user_ids:  [cut]
Total size:  139992
Total size:  56
UUID to int:  1
After recoding:  139992
beyondfloatingpoint
  • 1,239
  • 1
  • 14
  • 23