How do I select the specific data in a data frame based on thee contents of other columns?

Question

I'm new to pandas and I'm currently trying to use it on a data set I have on my tablet using qPython (temporary situation, laptop's being fixed). I have a csv file with a set of data organised by country, region, market and item label, with additional columns price, year and month. These are set out in the following manner:

Country | Region |   Market   | Item Label | ... | Price | Year | Month |
Canada  | Quebec | Market No. | Item Name  | ... |  $$$  | 2002 |   1   |
Canada  | Quebec | Market No. | Item Name  | ... |  $$$  | 2002 |   2   |
Canada  | Quebec | Market No. | Item Name  | ... |  $$$  | 2002 |   3   |
Canada  | Quebec | Market No. | Item Name  | ... |  $$$  | 2002 |   4   |

and so on. I'm looking for a way to plot these prices against time (I've taken to adding the month/12 to the year to effectively merge the last columns).

Originally I had a code to take the csv data and put it in a Dictionary, like so:

{Country_Name: {Region_Name: {Market_Name: {Item_Name: {"Price": price_list, "Time": time_list}}}}}

and used for loops over the keys to access each price and time list.

However, I'm having difficulty using pandas to get a similar result: I've tried a fair few different approaches, such as iloc, data[data.Country == "Canada"][data.Region == "Quebec"][..., etc. to filter the data for each country, region, market and item, but all of them were particularly slow. The data set is fairly hefty (approx. 12000 by 12), so I wouldn't expect instant results, but is there something obvious I'm missing? Or should I just wait til I have my laptop back?

Edit: to try and provide more context, I'm trying to get the prices over the course of the years and months, to plot how the prices fluctuate. I want to separate them based on the country, region, market and item lael, so each line plotted will be a different item in a market in a region in a country. So far, I have the following code:

def abs_join_paths(*args):
    return os.path.abspath(os.path.join(*args))

def get_csv_data_frame(*path, memory = True):
    return pandas.read_csv(abs_join_paths(*path[:-1], path[-1] + ".csv"), low_memory = memory)

def get_food_data(*path):
    food_price_data = get_csv_data_frame(*path, memory = False)
    return food_price_data[food_price_data.cm_name != "Fuel (diesel) - Retail"]

food_data = get_food_data(data_path, food_price_file_name)

def plot_food_price_time_data(data, title, ylabel, xlabel, plot_style = 'k-'):
    plt.clf()
    plt.hold(True)
    data["mp_year"] += data["mp_month"]/12
    for country in data["adm0_name"].unique():
        for region in data[data.adm0_name == country]["adm1_name"].unique():
            for market in data[data.adm0_name == country][data.adm1_name == region]["mkt_name"]:
                for item_label in data[data.adm0_name == country][data.adm1_name == region][data.mkt_name == market]["cm_name"]:
                    current_data = data[data.adm0_name == country][data.adm1_name == region][data.mkt_name == market][data.cm_name == item_label]
                    #year = list(current_data["mp_year"])
                    #month = list(current_data["mp_month"])
                    #time = [float(y) + float(m)/12 for y, m in zip(year, month)]
                    plt.plot(list(current_data["mp_year"]), list(current_data["mp_price"]), plot_style)
                    print(list(current_data["mp_price"]))
    plt.savefig(abs_join_paths(imagepath, title + ".png"))

Edit2/tl;dr: I have a bunch of prices and times, one after the other in one long list. How do I use pandas to split them up based on the contents of the other columns?

Cheers!

Welcome to stack overflow! Unfortunately, your question is unclear. The title asks how to parse the csv, but the text of the question indicates that it's already parsed in a dataframe and you're trying to operate on it. Please [edit] to clarify what you want the actual output to look like. Based on your description, I'm _guessing_ you want something with a `df.groupby(['Country','Region',...])...` and `.plot()` but it's tough to say — G. Anderson, Dec 02 '19 at 20:34
The question title seems to not match the issue being adressed in the description (or am I getting something wrong?). Also when you say plot theses prices against time, you mean each item label, for each market, for region for each country? Or how 'far' should the grouping go? Do you have an example output of what will your ideal solution be? (Doesn't matter if its hardcoded or anything) — Celius Stingher, Dec 02 '19 at 20:35
Apologies, I have tried to fix the title. i have updated the question with the code I have written. I would like the price and time data plotted against each other for each country, region, market and item, but my for loops are slow. Every other method I tried didn't work, this is the only one that came close (i assume, as it takes so long to run that I've never gotten it to plot before cancelling). — JSDRayne, Dec 03 '19 at 10:10

score 0 · Accepted Answer · answered Dec 02 '19 at 21:31

I hesitate to guess, but it seems that you are probably iterating through rows (you said you were using iloc). This is the slowest operation in pandas. Pandas data frames are optimized for series access.

If your plotting you can use matplotlib directly with pandas data frames and use the groupby method to combine data, without having to iterate through the rows of your data frame.

Without more information it's difficult to answer your question specifically. Please take a look at the comments on your question.

Thank you for your answer. I've tried to edit and address the problems with my question. — JSDRayne, Dec 03 '19 at 17:33

score 0 · Answer 2 · answered Dec 04 '19 at 13:16

The groupby function did the trick:

def plot_food_price_time_data(data, title, ylabel, xlabel, plot_style = 'k-'):
    plt.clf()
    plt.hold(True)
    group_data = data.groupby(["adm0_name", "adm1_name", "mkt_name", "cm_name"])
    for i in range(len(data)):
        print(data.iloc[i, [1, 3, 5, 7]])
        specific_data = group_data.get_group(tuple(data.iloc[i, [1, 3, 5, 7]]))
        plt.plot(specific_data["mp_price"], specific_data["mp_year"] + specific_data["mp_month"]/12)

How do I select the specific data in a data frame based on thee contents of other columns?

2 Answers2