0

I have a huge data set, which I would like to lets say bin and plot. Because when I plot the raw data it looks like this.. A very ugly plot: enter image description here

Based on this I generated the mean, std and the size values with a range of 1 and kicked out NaN values and replaced the index with the following code:

test = df.groupby(pd.cut(df['value'], bins=np.arange(160900)))['ratio'].agg(['mean', 'std', 'size'])
test_filtered = test[test[['mean', 'std', 'size']].notnull().all(1)]
test_filtered.reset_index(level=0, inplace=True)

After that I get this

               value       mean       std  size
0   (160088, 160089] 17.5080464 0.0777015    43
1   (160089, 160090] 17.5167586 0.0637891    25
2   (160188, 160189] 17.5099577 0.0892071    13
3   (160189, 160190] 17.4971442 0.0917634    60
4   (160288, 160289] 17.5440752 0.0659020    51
5   (160289, 160290] 17.5638237 0.0615202    64
6   (160290, 160291] 17.5382187 0.0294264     2
7   (160388, 160389] 17.5282669 0.1120136     2
8   (160389, 160390] 17.5479696 0.0794665    64
9   (160390, 160391] 17.5716048 0.0892945    15
10  (160391, 160392] 17.4969686 0.0284094     2
11  (160488, 160489] 17.5587446 0.0449601     5
12  (160489, 160490] 17.5566764 0.0636091    62
13  (160490, 160491] 17.5279026 0.0561810     2
14  (160588, 160589] 17.5922320 0.0126914     2
15  (160589, 160590] 17.5832962 0.0733587    25
16  (160590, 160591] 17.5607141 0.0706487    32
17  (160688, 160689] 17.5186035 0.0773348     6
18  (160689, 160690] 17.5234588 0.0816204    51
19  (160690, 160691] 17.4688810 0.0981311     4
20  (160788, 160789] 17.5797546 0.0264994     6
21  (160789, 160790] 17.5517244 0.0470787    51
22  (160790, 160791] 17.5600856 0.0720480     2
23  (160889, 160890] 17.5355430 0.0584237    34

SO now the question is, how to plot now the mean over the value? I tried some code, but I only get a bunch of Errors... Further, the bins are fixed to 1, but maybe I need another range. Do you know how to specify another "bin window" than 1?

Alternatively do you know a better method how to bin the data with a lets say specific "bin window"?

Thanks in advance, if you know how to fix the problem. ;)

Greets

Klamsi
  • 73
  • 6

2 Answers2

0

If you convert your dataframe to a numpy array then you can use numpy's histogram to control your bin size. Numpy arrays can also filter out NaNs with where.

mTesseracted
  • 290
  • 1
  • 10
  • I tried your approach to convert the dataframe into a numpy array. Before convertion I filtered out the NaNs with `dropna(subset=['value']` Further, I used a smaller data set to get familiar with the code. Currently the data has 627 "rows". By using bins = 160900 I should get a bin window over 1. But if I use this window it took a quiet long time to calculate everything and the resulting plot shows nothing. If I decrease the bins down to lets suppose 5000 I got a plot. My first values starts from 160088 and ends at 160900. How do I get a correct bin window of 1 and a visible plot? – Klamsi May 28 '20 at 11:30
  • Or does there exist an implemented python code that i can define a binning from 160087 to 160900 in steps of 1? – Klamsi May 28 '20 at 12:58
  • @Klamsi look at the documentation for [histogram](https://numpy.org/doc/1.18/reference/generated/numpy.histogram.html), you can set the keyword arg `range=(160088,160900)` to get a histogram for that range. If you want the bins to be 1 wide then set the keyword arg `bins` to the size of your range divided by 1, so (160900-160088)/1=812. Your call will look something like `hist, bin_edges = numpy.histogram(data_array, bins=812, range=(160088,160900) )` – mTesseracted May 28 '20 at 18:18
  • Also to address your other question of plotting the mean over the value, do you just want to plot the mean of the data like in this [example](https://matplotlib.org/3.1.1/gallery/recipes/placing_text_boxes.html#sphx-glr-gallery-recipes-placing-text-boxes-py)? Or do you want the mean per bin? You can do that with custom labels per bin that would be a pain. You could instead make a separate [errorbar plot](https://matplotlib.org/1.2.1/examples/pylab_examples/errorbar_demo.html) that could show the average and standard deviation. – mTesseracted May 28 '20 at 18:35
0
from matplotlib import pyplot as plt
ax = plt.gca()
test_filtered.plot.bar(ax=ax)
plt.xticks(ticks=test_filtered.index, labels=test_filtered.value)
plt.show()

avloss
  • 2,389
  • 2
  • 22
  • 26
  • 1
    Welcome to Stack Overflow. Code dumps without any explanation are rarely helpful. Stack Overflow is about learning, not providing snippets to blindly copy and paste. Please [edit] your question and explain how it works better than what the OP provided. See [answer]. – ChrisGPT was on strike May 28 '20 at 01:11
  • having seen this on two of my answers I can't tell if this is automatic response. – avloss May 28 '20 at 02:11
  • 1
    The comment isn't automated, but it's one that I re-use frequently. Both of your answers where I used this tonight were flagged as potentially low-quality, probably because they are purely code. I came across them in that review queue. Please consider explaining your code when you answer. This will reduce the likelihood that your answers get flagged, which also reduces the likelihood that they get downvoted etc. – ChrisGPT was on strike May 28 '20 at 02:39