I'm new to pandas and I'm currently trying to use it on a data set I have on my tablet using qPython (temporary situation, laptop's being fixed). I have a csv file with a set of data organised by country, region, market and item label, with additional columns price, year and month. These are set out in the following manner:
Country | Region | Market | Item Label | ... | Price | Year | Month |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 1 |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 2 |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 3 |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 4 |
and so on. I'm looking for a way to plot these prices against time (I've taken to adding the month/12 to the year to effectively merge the last columns).
Originally I had a code to take the csv data and put it in a Dictionary, like so:
{Country_Name: {Region_Name: {Market_Name: {Item_Name: {"Price": price_list, "Time": time_list}}}}}
and used for loops over the keys to access each price and time list.
However, I'm having difficulty using pandas to get a similar result: I've tried a fair few different approaches, such as iloc, data[data.Country == "Canada"][data.Region == "Quebec"][..., etc. to filter the data for each country, region, market and item, but all of them were particularly slow. The data set is fairly hefty (approx. 12000 by 12), so I wouldn't expect instant results, but is there something obvious I'm missing? Or should I just wait til I have my laptop back?
Edit: to try and provide more context, I'm trying to get the prices over the course of the years and months, to plot how the prices fluctuate. I want to separate them based on the country, region, market and item lael, so each line plotted will be a different item in a market in a region in a country. So far, I have the following code:
def abs_join_paths(*args):
return os.path.abspath(os.path.join(*args))
def get_csv_data_frame(*path, memory = True):
return pandas.read_csv(abs_join_paths(*path[:-1], path[-1] + ".csv"), low_memory = memory)
def get_food_data(*path):
food_price_data = get_csv_data_frame(*path, memory = False)
return food_price_data[food_price_data.cm_name != "Fuel (diesel) - Retail"]
food_data = get_food_data(data_path, food_price_file_name)
def plot_food_price_time_data(data, title, ylabel, xlabel, plot_style = 'k-'):
plt.clf()
plt.hold(True)
data["mp_year"] += data["mp_month"]/12
for country in data["adm0_name"].unique():
for region in data[data.adm0_name == country]["adm1_name"].unique():
for market in data[data.adm0_name == country][data.adm1_name == region]["mkt_name"]:
for item_label in data[data.adm0_name == country][data.adm1_name == region][data.mkt_name == market]["cm_name"]:
current_data = data[data.adm0_name == country][data.adm1_name == region][data.mkt_name == market][data.cm_name == item_label]
#year = list(current_data["mp_year"])
#month = list(current_data["mp_month"])
#time = [float(y) + float(m)/12 for y, m in zip(year, month)]
plt.plot(list(current_data["mp_year"]), list(current_data["mp_price"]), plot_style)
print(list(current_data["mp_price"]))
plt.savefig(abs_join_paths(imagepath, title + ".png"))
Edit2/tl;dr: I have a bunch of prices and times, one after the other in one long list. How do I use pandas to split them up based on the contents of the other columns?
Cheers!