-1

I have data from a csv file that looks like this:

,jobID,hum_starttime,hum_endtime,duration,exit_status,CPU,energy,memory,virt_mem,wall_time
0,525231,29/05/2015 11:53:47,29/05/2015 14:09:16,8129.0,0.0,28:54:56,0,4682480kb,16036608kb,01:13:59
1,504231,08/05/2015 07:46:59,08/05/2015 07:48:55,116.0,0.0,00:00:49,0,2421756kb,2807020kb,00:00:51

I want to plot the exit_status count (i.e the amount of times exit_status == 1 or exit_status == -11) versus start_time in bins of 1 hour. Since there are several distinct exit_status codes, I need to plot it in the form of a stacked bar chart where each distinct exit status is given a different color.

Can anyone please help me? I've been stuck on this for 2 days!! Thanks!

martineau
  • 119,623
  • 25
  • 170
  • 301
PyRsquared
  • 6,970
  • 11
  • 50
  • 86
  • [`csv.reader`](https://docs.python.org/3/library/csv.html) or `csv.DictReader` is likely to be your friend. You also probably want `datetime`. – NightShadeQueen Jul 15 '15 at 14:34
  • What part are you stuck on, reading and analyzing the data, or creating a bar chart with matplotlib from that? Might be better to split your question into two separate ones if you're having trouble with both. – martineau Jul 15 '15 at 15:39

1 Answers1

1

Here's how I would solve it:

  1. Read the csv-file. This can be done using the csv module for python
  2. Read and/or convert the date stamps according to your bin size, and iterate through each line, adding to the correct hour bin. I just do it the dirty way and cut the minutes and seconds: row[0][:-5] returns 15/07/2015 11, a date and hour to work with.

You'll end up with a list status_records which consists of two dicts, representing the two status options, which then contain the hour bins:

  • "1" : {"15/07/2015 11": 3, ...}
  • "-11" : {"15/07/2015 11": 0, ...}

Here's a sample data.csv with some more data (so that you can actually see something, which is difficult with just your 2 entries - I'm using the same date format and the status codes you mentioned):

start_time,exit_status
15/07/2015 11:53:47,1
15/07/2015 11:53:47,1
15/07/2015 11:54:56,1
15/07/2015 12:23:26,-11
15/07/2015 12:27:31,1
15/07/2015 14:01:47,-11
15/07/2015 14:11:56,1
15/07/2015 14:52:47,1
15/07/2015 15:53:23,1
15/07/2015 15:55:11,1

And here's my code (you'll have to change row[0] etc. to the according rows to work with your csv):

#!/usr/bin/env python
import numpy as np
import matplotlib.pyplot as plt
import csv

# 1. reading the csv
status_records = {'1': {}, '-11': {}}

with open('data.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile)
    # 2. iterate through csv
    for row in reader:
        if row[0] == 'start_time': continue # first line
        hour = row[0][:-5]
        status = row[1]

        # if hour not present, add empty 'slot' in each status bin
        if hour not in status_records[status].keys():
            status_records['1'][hour] = 0
            status_records['-11'][hour] = 0
            status_records[status][hour] = 1 # add the status we just read
        else:
            status_records[status][hour] += 1 # update status-hour bin

status1   = status_records['1'].values()
status2 = status_records['-11'].values()

print status1, status2

N = len(status1)
ind = np.arange(N)
width = 0.35

p1 = plt.bar(ind, status1, width, color='g')
p2 = plt.bar(ind, status2, width, color='r', bottom=status1)

plt.ylabel('# of exit status')
plt.title('Exit status in comparison with time')
plt.yticks(np.arange(0,11,10))
plt.legend((p1[0], p2[0]), ('1', '-11'))
plt.show()

Output:

graph

Improvements: You may want to add some useful labels, and decide wether to show time in which nothing happens (which will maybe clutter the chart with gaps). Also, be aware that as-is the dates should be sorted in the csv, else you have to sort them yourself in the code.

Anyway, this should give you something to start with.

adrianus
  • 3,141
  • 1
  • 22
  • 41
  • hey adrianus thanks so much, that really helped. i appreciate it... my next problem :P i actually have so many different exit_status' (-12, -11,... 0, 1, 2, ..., 271 etc.) that i need to modify the code so that i have a for loop inside your second if statement and a for loop that plots all the different exit_status' and assigns a unique color to each distinct exit_status. could you help me out? thanks again! (also, and this is extra so dont worry if its too much, i want to normalise all the data to 1. i.e. the #number of exit status' should have a maximum at 1) – PyRsquared Jul 16 '15 at 14:18
  • How much different status codes are we talking about? Over 271? With `matplotlib.colors` you can generate colors via rgb or Hex-codes. If you only have, let's say a dozen codes, just make a `dict` like `status_colors = {'-12': 'r', '-11': 'y', '0': 'r', ...}` and use it like `color = status_colors[status]` – adrianus Jul 16 '15 at 14:25
  • For the normalizing part: Just loop through each status, go through each bin, sum up the items, and then divide each bin by that sum. – adrianus Jul 16 '15 at 14:28
  • cool thanks. yeah there actually are over 300 status codes so i need a for loop – PyRsquared Jul 16 '15 at 14:43
  • Maybe you want to categorize the status codes into different categories with a color each? A chart with 300 colors is hardly readable :-) You're welcome, consider making it the accepted answer if it helped you. – adrianus Jul 17 '15 at 11:29
  • thanks, i made it the acceptable answer. i learned a lot of python in this process and i appreciate the help. but in the end i just decided to use ggplot in R... im sorry, i hope youre not annoyed, but it literally takes one line of code in R versus 50 lines and multiple for loops in python – PyRsquared Jul 17 '15 at 13:21
  • I think you're exaggerating a bit, but using the right tool for the right job is the same I would do too... – adrianus Jul 20 '15 at 08:00