Python automation, filling in one dataframe based on another dataframe

Question

I am writing an automation script for work and the key feature of it is hanging me up. I need to fill in one spreadsheet based on information from another spreadsheet. I've created a simplified scenario that replicates my problem.

One sheet, my reference sheet, has the lookup information spread among several columns:

https://www.dropbox.com/s/wby8yyljkyz1o74/lookup_ref.csv?dl=0

The fill-in sheet needs to have the relevant number filled in based on another matching value:

https://www.dropbox.com/s/h9knreacqamjvbg/fill_in_d_number.csv?dl=0

Here's my code:

import pandas as pd

df_ref = pd.read_csv('lookup_ref.csv')
df_fillin = pd.read_csv('fill_in_d_number.csv')

def get_d_number(food):
    return df_ref[df_ref.eq(food).any(1)].d_number.item()

df_fillin.insert(0, 'd_number', get_d_number(df_fillin['food']))

and I'm getting this:

ValueError: can only convert an array of size 1 to a Python scalar

Can anyone help with this? It seems like something that should be really simple, and if it worked it would save me a LOT of work.

welcome to SO! please have a look here: [how to ask a question](https://stackoverflow.com/help/how-to-ask) — Rabinzel, May 21 '22 at 23:23

score 1 · Answer 1 · answered May 22 '22 at 00:23

I am not sure if I really understand what you are trying to do. What would you have given as input and what do you want the output DataFrame to look like?

How I understand it: You have a lookup DataFrame and a fill-in DataFrame and you want to add a column 'd_number' into the fill-in DataFrame where the value of each row is the 'd_number' value of the row in the lookup DataFrame where any column matches the lookup DataFrames 'food' value in that row.

import pandas as pd
import numpy as np

lookup = pd.DataFrame({"d_number": [1, 2, 3, 4, 5],
                       "a": ["apple", np.NaN, np.NaN, np.NaN, np.NaN],
                       "b": [np.NaN, "banana", np.NaN, np.NaN, np.NaN],
                       "c": ["steak", "peanut", np.NaN, np.NaN, np.NaN],
                       "d": ["bread", np.NaN, np.NaN, "sausage", np.NaN],
                       })

fillin = pd.DataFrame({"food": ["apple", "banana", "steak", "peanut", "bread", "sausage"]})

One quick and dirty way to solve this would be as follows:

1 - Merge the four columns into a list column, skipping NA-values.

lookup["merged"] = [[v for v in row if v==v] for row in lookup[["a", "b", "c", "d"]].values.tolist()]

2 - Iterate over a flattened version of that list column and that rows d_number, assign the d_number as value to this food as key in a dictionary.

mapping = {val: id_ for id_, row in zip(lookup["d_number"], lookup["merged"]) for val in row}

3 - Use the resulting mapping dictionary to create a new column by looking up the d_number for each food.

fillin["d_number"] = fillin["food"].apply(lambda f: mapping[f])

Which results in this output:

      food  d_number
0    apple         1
1   banana         2
2    steak         1
3   peanut         2
4    bread         1
5  sausage         4

Python automation, filling in one dataframe based on another dataframe

1 Answers1