Objective
To merge df_labelled
file with a portion of labelled points to df
where contains all the point.
What I have tried
Referring to Simple way to Dask concatenate (horizontal, axis=1, columns), I tried the code below
df = df.repartition(npartitions=200)
df = df.reset_index(drop=True)
df_labelled = df_labelled.repartition(npartitions=200)
df_labelled = df_labelled.reset_index(drop=True)
df = df.assign(label = df_labelled.label)
df.head()
But I get the error
ValueError: Not all divisions are known, can't align partitions. Please use
set_index
to set the index.
Another thing I have tried is to do left join of the table, but I got NaN for all label, can you explain what I did wrong?
result = dd.merge(df, df_labelled, on=['x', 'y', 'z','R', 'G', 'B'], how="left")
result.head()
x y z R G B label
0 39020.470000 33884.200003 36.445701 25 39 26 NaN
1 39132.740002 33896.049994 30.405700 19 24 18 NaN
2 39221.059994 33787.050001 26.605700 115 145 145 NaN
Is there anyway I can achieve the expected result as below? I can't run in Pandas because there are a lot of points which will cause memory issue in Pandas.
Data
df (This file has all points)
x y z R G B
0 39047.700012 33861.890015 48.115704 7 18 12
1 39044.110016 33860.150024 47.135700 14 28 15
2 39049.280029 33861.950073 49.405701 30 58 33
3 39029.030000 33937.689993 48.425700 152 154 143
4 39066.980000 33937.870001 49.725699 209 218 225
5 39069.810002 33795.460001 42.405699 113 136 154
df_labelled (This file contains a portion of labelled points)
x y z R G B label
0 39047.700012 33861.890015 48.115704 7 18 12 14
1 39044.110016 33860.150024 47.135700 14 28 15 14
2 39049.280029 33861.950073 49.405701 30 58 33 14
Expected outcome
x y z R G B label
0 39047.700012 33861.890015 48.115704 7 18 12 14
1 39044.110016 33860.150024 47.135700 14 28 15 14
2 39049.280029 33861.950073 49.405701 30 58 33 14
3 39029.030000 33937.689993 48.425700 152 154 143 nan
4 39066.980000 33937.870001 49.725699 209 218 225 nan
5 39069.810002 33795.460001 42.405699 113 136 154 nan