I have a random algorithm that produces .csv files. The content of the files looks as follows:
module, coverage, timestamp
examples.monkey, 32.142857142857146, 1546513589.59586
examples.monkey, 35.714285714285715, 1546513589.609822
examples.monkey, 35.714285714285715, 1546513589.617172
...
util.container, 27.586206896551722 ,1546513594.559889
util.container, 27.586206896551722 ,1546513594.579989
util.container, 27.586206896551722 ,1546513594.598491
I have between 30 and 100 of these files, with an average length of a couple of thousand lines.
My final goal is to plot a graphs for each measurement and also plot an additional graph depicting the mean value of all measurements at given time. For this I need to calculate the mean of all runs per timestamp. (Of course, if a file does not have an entry for a certain timestamp I would simply ignore it.)
So far I read all the .csv files and concatenate them into a new dataframe.
allFiles = glob.glob("out/*.csv")
dfs = []
for file_ in allFiles:
df = pd.read_csv(file_, index_col=None, header=0)
dfs.append(df)
keys = ["Run " + str(i) for i in range(len(dfs))]
glued = pd.concat(dfs, axis=1, keys=keys)
This results in a dataframe that looks as follows:
Run 0 ... Run 4
module coverage ... coverage timestamp
0 examples.monkey 32.142857 ... 32.142857 1.546514e+09
1 examples.monkey 35.714286 ... 32.142857 1.546514e+09
2 examples.monkey 35.714286 ... 32.142857 1.546514e+09
3 examples.monkey 35.714286 ... 35.714286 1.546514e+09
4 examples.monkey 35.714286 ... 35.714286 1.546514e+09
Now my initial idea was to simply group across all runs, grouping by the modules and timestamps at the level=1, across axis=1. Like this
grouped = glued.groupby(by=["module", "timestamp"], level=1, axis=1)
However, this does not work as I get a Keyerror, saying that module and timestamp are missing. Clearly I have some miss-conceptions about how to work with combined dataframes like this.
So how do I best go about getting the mean coverage per module and timestamp across multiple files?