2

I have a random algorithm that produces .csv files. The content of the files looks as follows:

module, coverage, timestamp
examples.monkey, 32.142857142857146, 1546513589.59586
examples.monkey, 35.714285714285715, 1546513589.609822
examples.monkey, 35.714285714285715, 1546513589.617172
...
util.container, 27.586206896551722 ,1546513594.559889
util.container, 27.586206896551722 ,1546513594.579989
util.container, 27.586206896551722 ,1546513594.598491

I have between 30 and 100 of these files, with an average length of a couple of thousand lines.

My final goal is to plot a graphs for each measurement and also plot an additional graph depicting the mean value of all measurements at given time. For this I need to calculate the mean of all runs per timestamp. (Of course, if a file does not have an entry for a certain timestamp I would simply ignore it.)

So far I read all the .csv files and concatenate them into a new dataframe.

allFiles = glob.glob("out/*.csv")
dfs = []

for file_ in allFiles:
    df = pd.read_csv(file_, index_col=None, header=0)
    dfs.append(df)

keys = ["Run " + str(i) for i in range(len(dfs))]
glued = pd.concat(dfs, axis=1, keys=keys)

This results in a dataframe that looks as follows:

             Run 0                 ...           Run 4              
            module   coverage      ...        coverage     timestamp
0  examples.monkey  32.142857      ...       32.142857  1.546514e+09
1  examples.monkey  35.714286      ...       32.142857  1.546514e+09
2  examples.monkey  35.714286      ...       32.142857  1.546514e+09
3  examples.monkey  35.714286      ...       35.714286  1.546514e+09
4  examples.monkey  35.714286      ...       35.714286  1.546514e+09

Now my initial idea was to simply group across all runs, grouping by the modules and timestamps at the level=1, across axis=1. Like this

grouped = glued.groupby(by=["module", "timestamp"], level=1, axis=1)

However, this does not work as I get a Keyerror, saying that module and timestamp are missing. Clearly I have some miss-conceptions about how to work with combined dataframes like this.

So how do I best go about getting the mean coverage per module and timestamp across multiple files?

FChris
  • 308
  • 2
  • 13
  • There is some reasom for not use `glued = pd.concat(dfs, axis=0, keys=keys)` for finally `3` column DataFrame if same column names in each file? – jezrael Jan 03 '19 at 12:38
  • Sry, I missed a line in my code. I provide keys as another index to separate them. I edited my question. I based my approach on the answer here https://stackoverflow.com/questions/11174367/averaging-data-from-multiple-data-files-in-python-with-pandas – FChris Jan 03 '19 at 12:41
  • Yes, `keys` are good idea. But I think concat by `axis=0`, not by `axis=1`. Is it possible for you? Also in each file are same columns names? – jezrael Jan 03 '19 at 12:45
  • Yeah which axis I use should not matter, I think I could transpose later anytime. The files are the same column wise, only the entries may be different. – FChris Jan 03 '19 at 12:48

2 Answers2

1

I'm not certain why your answer doesn't work, I'm not very familiar with groubpy in the multi-level column. I can propose an answer that should work. As @jezrael pointed in his comments, you concatenate your dataframe by adding new columns instead of concatenating on the rows. If you do :

glued = pd.concat(dfs)
grouped = glued.groupby(["module", "timestamp"])
mean_by_group = grouped.mean() 

mean_by_group should be a dataframe with a row for each module/timestamp couple and one column containing the mean of the coverage for all the entries corresponding to that timestamp/module couple.

Statistic Dean
  • 4,861
  • 7
  • 22
  • 46
1

You can use concat by axis=0 what is default parameter, so should be removed, then convert first level to column Run and aggregate mean:

allFiles = glob.glob("out/*.csv")

#instead loop list comprehension, but your solution working nice too
dfs = [pd.read_csv(fp,skipinitialspace=True) for fp in allFiles]
keys = ["Run " + str(i) for i in range(len(dfs))]

glued = pd.concat(dfs, keys=keys).reset_index(level=0).rename(columns={'level_0':'Run'})
#convert float column to datetimes 
glued['timestamp'] = pd.to_datetime(glued['timestamp'], unit='s')
#print (glued)

#specify columns what need 
grouped = glued.groupby(by=["Run","module", "timestamp"], as_index=False).mean()
#print (grouped)

grouped1 = glued.groupby(by=["module", "timestamp"], as_index=False).mean()
#print (grouped1)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I am still not sure, why it works when I work on rows and not columns, but the results are exactly what I need. Could you elaborate why there is this difference? – FChris Jan 03 '19 at 13:13
  • @FChris - There is difference, because not possible add column names, is possible grouping only by all levels like `.groupby(level=1, axis=1).mean()`, but not working because need all numeric columns. So if use `grouped = glued.loc[:, pd.IndexSlice[:, ['coverage']]].groupby(level=1, axis=1).mean()` first select all `coverage` levels and then aggregate `mean`. But here is not possible specify `module`. – jezrael Jan 03 '19 at 13:23
  • 1
    I see, thank you very much for the additional explanation – FChris Jan 03 '19 at 13:28
  • 1
    If use `glued = pd.concat(dfs, keys=keys)` then `grouped = glued.groupby(by=["module", "timestamp"], level=1, axis=1)` is same principe like you want grouping by `Run 0` and `Run 1` – jezrael Jan 03 '19 at 13:28