3

So I have following dask dataframe grouped by Problem column.

| Problem | Items   | Min_Dimension | Max_Dimension | Cost  |
|-------- |------   |---------------|-------------- |------ |
| A       | 7       | 2             | 15            | 23    |
| A       | 5       | 2             | 15            | 38    |
| A       | 15      | 2             | 15            | 23    |
| B       | 11      | 6             | 10            | 54    |
| B       | 10      | 6             | 10            | 48    |
| B       | 18      | 6             | 10            | 79    |
| C       | 50      | 8             | 25            | 120   |
| C       | 50      | 8             | 25            | 68    |
| C       | 48      | 8             | 25            | 68    |
| ...     | ...     | ...           | ...           | ...   |

The goal is to create a new dataframe with all rows where the Cost values is minimal for this particular Problem group. So we want following result:

| Problem | Items   | Min_Dimension | Max_Dimension | Cost  |
|-------- |------   |---------------|-------------- |------ |
| A       | 7       | 2             | 15            | 23    |
| A       | 15      | 2             | 15            | 23    |
| B       | 10      | 6             | 10            | 48    |
| C       | 50      | 8             | 25            | 68    |
| C       | 48      | 8             | 25            | 68    |
| ...     | ...     | ...           | ...           | ...   |

How can I achieve this result, i already tried using idxmin() as mentioned in another question on here, but then I get a ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.

Pieterism
  • 43
  • 5

1 Answers1

4

What if you create another dataframe that is grouped by Problem and Cost.min()? Let's say the new column is called cost_min.

df1 = df.groupby('Problem')['Cost'].min().reset_index()

Then, merge back this new cost_min column back to the dataframe.

df2 = pd.merge(df, df1, how='left', on='Problem')

From there, do something like:

df_new = df2.loc[df2['Cost'] == df2['cost_min']]

Just wrote some pseudocode, but I think that all works with Dask.

David Erickson
  • 16,433
  • 2
  • 19
  • 35