0

I understand the title might not be very clear, but please hear me out.

I have a pandas dataframe column of ~850 unique display sizes, for e.g.

1   320x480
2   480x320
3   382x215
4   676x320
5   694x320
6   1080x2123
7   2094x1080
8   1080x2020

I want to match them/ convert them to the closest possible standard display sizes (there are around ~20 of them provided in the use-case dataset).

320x350
320x480
480x320
640x360
800x600
1024x768
1280x720
1280x800
1280x1024
1360x768
1366x768
1440x900
1536x864
1600x900 

I tried separating height and width into separate columns and rounding them up but its still creating a lot of non-standard display sizes (for my use case).

How can I achieve this?

lightyagami96
  • 336
  • 1
  • 4
  • 14

1 Answers1

1

Idea is convert both columns/Series to DataFrames by Series.str.split, then use cross join by DataFrame.merge, get differences and sum them , get rows with minimal diffs by DataFrameGroupBy.idxmin and DataFrame.loc, last join together with DataFrame.pop for use and drop columns:

df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)

df = df11.assign(a=1).merge(df22.assign(a=1), on='a')
df['diff'] = df['0_x'].sub(df['0_y']).abs() + df['1_x'].sub(df['1_y']).abs()

df = df.loc[df.groupby(['0_x','1_x'])['diff'].idxmin()]
df['a'] = df.pop('0_x').astype(str).str.cat(df.pop('0_y').astype(str), 'x')
df['b'] = df.pop('1_x').astype(str).str.cat(df.pop('1_y').astype(str), 'x')
print (df)
             a  diff          b
1      320x320     0    480x480
28     382x320   197    215x350
16     480x480     0    320x320
45     676x640    76    320x360
59     694x640    94    320x360
106  1080x1280  1196  2020x1024
78   1080x1280  1299  2123x1024
97   2094x1600   674   1080x900

Similar idea with euclidean distance, with sample data same output:

df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)

df = df11.assign(a=1).merge(df22.assign(a=1), on='a')
df['diff'] = np.sqrt(df['0_x'].sub(df['0_y']) ** 2 + df['1_x'].sub(df['1_y']) ** 2)

df = df.loc[df.groupby(['0_x','1_x'])['diff'].idxmin()]
df['a'] = df.pop('0_x').astype(str).str.cat(df.pop('0_y').astype(str), 'x')
df['b'] = df.pop('1_x').astype(str).str.cat(df.pop('1_y').astype(str), 'x')
print (df)
             a         diff          b
1      320x320     0.000000    480x480
30     382x480   143.627992    215x320
16     480x480     0.000000    320x320
45     676x640    53.814496    320x360
59     694x640    67.201190    320x360
106  1080x1280  1015.881883  2020x1024
78   1080x1280  1117.050133  2123x1024
97   2094x1600   525.771814   1080x900

Another numpy solution:

df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)

a1 = np.sqrt(np.square(df11[0].to_numpy()[:, None] - df22[0].to_numpy()) + 
             np.square(df11[1].to_numpy()[:, None] - df22[1].to_numpy()))

df1['b1'] = df2['col'].to_numpy()[np.argmin(a1, axis=1)]


a2 = (np.abs(df11[0].to_numpy()[:, None] - df22[0].to_numpy()) + 
      np.abs(df11[1].to_numpy()[:, None] - df22[1].to_numpy()))

df1['b2'] = df2['col'].to_numpy()[np.argmin(a2, axis=1)]
print (df1)
         col         b1         b2
1    320x480    320x480    320x480
2    480x320    480x320    480x320
3    382x215    480x320    320x350
4    676x320    640x360    640x360
5    694x320    640x360    640x360
6  1080x2123  1280x1024  1280x1024
7  2094x1080   1600x900   1600x900
8  1080x2020  1280x1024  1280x1024
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252