CDF in Python not displaying correctly

Question

Good morning,

In Python, I have a dictionary (called packet_size_dist) with the following values:

34  =>  0.00909909009099
42  =>  0.02299770023
54  =>  0.578742125787
58  =>  0.211278872113
62  =>  0.00529947005299
66  =>  0.031796820318
70  =>  0.0530946905309
74  =>  0.0876912308769

Notice that the sum of the values == 1.

I am attempting to generate a CDF, which I successfully do, but it looks wrong and I am wondering if I am going about generating it incorrectly. The code in question is:

sorted_p = sorted(packet_size_dist.items(), key=operator.itemgetter(0))
yvals = np.arange(len(sorted_p))/float(len(sorted_p))
plt.plot(sorted_p, yvals)
plt.show()

But the resulting graph looks like this:

Which doesn't seem to quite match the values in the dictionary. Any ideas? I also see a vague green line towards the left of the graph, which I don't know what it is. For example, the graph is depicting that a packet size of 70 occurs about 78% of the time, when in my dictionary it is represented as occurring 5% of the time.

I've tried to clarify the first part of my answer. – Bill Bell Apr 23 '17 at 14:13 — Bill Bell, Apr 23 '17 at 14:13

Bill Bell · Answer 1 · 2017-04-23T14:12:54.410

This is NOT a direct answer to your question. However, I thought I should point out that your data arise from a discrete random variable (rather than one that is continuous) and that therefore, representing them with a series of line segments could be somewhat misleading in some contexts. The representation in cumulative distribution function might be overkill. I offer the following simplification.

An 'x' represents truncation. A dot represents the closed end of a closed-open interval.

Here's the code. I didn't think to use np.cumsum!

import numpy as np
import pylab as pl
from matplotlib import collections  as mc

p = [0.00909909009099,0.02299770023,0.578742125787,0.211278872113,0.00529947005299,0.031796820318,0.0530946905309,0.0876912308769]
cumSums = [0] + [sum(p[:i]) for i in range(1,len(p)+1)]
counts = [30,34,42,54,58,62,66,70,74,80]

lines =[[(counts[i],cumSums[i]),(counts[i+1],cumSums[i])] for i in range(-1+len(counts))]

lc = mc.LineCollection(lines, linewidths=2)
fig, ax = pl.subplots()
ax.add_collection(lc)

pl.plot([30, 80],[0, 1],'bx')
pl.plot(counts[1:-1], cumSums[1:], 'bo')

ax.autoscale()
ax.margins(0.1)

pl.show()

This is more like the plot you appear to want. (Corrected, I hope.)

For which the code.

import numpy as np
import pylab as pl
from matplotlib import collections  as mc
from sys import exit

p = [0.00909909009099,0.02299770023,0.578742125787,0.211278872113,0.00529947005299,0.031796820318,0.0530946905309,0.0876912308769]
cumSums = [sum(p[:i]) for i in range(1,len(p)+1)]
counts = [34,42,54,58,62,66,70,74]

lines = [[(counts[i],cumSums[i]),(counts[i+1],cumSums[i+1])] for i in range(-1+len(p))]

lc = mc.LineCollection(lines, linewidths=2)
fig, ax = pl.subplots()
ax.add_collection(lc)
ax.autoscale()
ax.margins(0.1)

pl.show()

This solution is pretty strange. You invented some numbers (33 and 90), which are not in the data and render the shown CDF completely wrong. — ImportanceOfBeingErnest, Apr 22 '17 at 15:09
@ImportanceOfBeingErnest: Expressed with your usual tact. Anyway, not really. — Bill Bell, Apr 22 '17 at 15:30
So you want to say that the curve in the second picture correctly represents the data from the question? — ImportanceOfBeingErnest, Apr 22 '17 at 15:34

score 1 · Accepted Answer · answered Apr 22 '17 at 15:04

1

Using numpy makes everything a lot easier. So first you may convert your dictionary to a 2-column numpy array. You can then sort this by its first column. Finally simply calculate the cumulative sum of the second column and plot it against the first.

dic = { 34  :  0.00909909009099,
        42  :  0.02299770023,
        54  :  0.578742125787,
        58  :  0.211278872113,
        62  :  0.00529947005299,
        66  :  0.031796820318,
        70  :  0.0530946905309,
        74  :  0.0876912308769 }

import numpy as np
import matplotlib.pyplot as plt

data = np.array([[k,v] for k,v in dic.iteritems()]) # use dic.items() for python3
data = data[data[:,0].argsort()]
cdf = np.cumsum(data[:,1])

plt.plot(data[:,0], cdf)

plt.show()

answered Apr 22 '17 at 15:04

ImportanceOfBeingErnest

321,279
53
665
712

Thanks! However, doesn't this graph depict that the value "74" occurs about 0.95 of the time? I thought with a CDF distribution, the sum total needs to == 1.0, which if you sum the dictionary value keys, they do, the graph just doesn't seem to represent that well. But your code replicates great on my end! – Nicholas Apr 24 '17 at 18:44
No, a cumulative density function (CDF) gives you the probability to to find a value lower than some value x. The graph thus tells you that 100% of all values are less or equal 74; or, a better example, the probability to find a value less or equal than 58 is ~82%. If you actually wanted a probability density function (PDF), which gives you the probability of finding a value at or around some value x, you can just plot your data, because it is already normalized. – ImportanceOfBeingErnest Apr 25 '17 at 07:18
Ahh ok, thank you! No, CDF is what I wanted, I was just switching their meaning in my mind. Much appreciated. – Nicholas Apr 25 '17 at 14:40
Great. So in that case you can choose one of the two answers to [accept](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work). You may additionally upvote both or any of them, in case they've helped you. You might also consider the same for your previous questions in case they are solved. – ImportanceOfBeingErnest Apr 25 '17 at 15:51

CDF in Python not displaying correctly

2 Answers2