2

I have a following code and I want to calculate the hamming strings of the strings:

from pandas import DataFrame
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform


df = pd.read_csv("3d_printing.csv", encoding='utf-8', error_bad_lines=False, low_memory=False, names=['file_name', 'phash', 'dhash', 'file_date'])


def hamming_distance(s1, s2):
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(el1 != el2 for el1, el2 in zip(s1, s2))

df.sort_values(by='file_date', ascending=0)
x = pd.DataFrame(np.triu(squareform(pdist(df[['phash']], hamming_distance))),
    columns=df.file_name.str.split('_').str[0],
    index=df.file_name.str.split('_').str[0]).replace(0, np.nan)

z = x[x.apply(lambda col: col.index != col.name)].max(1).max(level=0)
z.to_csv("3d_printing_x.csv", mode='a')

When I run the code I get

ValueError: could not convert string to float: '002889898888b8a9'

I know that pdist requires float values, but at this point I don't know what to do

Cenk_Mitir
  • 113
  • 11

0 Answers0