0

I am trying to plot a frequency distribution of data loaded into Jupyter notebook from a csv file in my AWS S3 bucket.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
bucket = "a_bucket"
data_key = "ice_freq.csv"
data = f's3://{bucket}/{data_key}'
load = pd.DataFrame(pd.read_csv(data))
load

The dataframe displays fine and all data loads as expected. There are 373909 lines of data and only one column titled Data which contains floats rainging from -7.80 to 4.5.

I then use the following to count the occurrences of each float and plot them to a bar chart.

fig, ax = plt.subplots()
r = load['Data'].value_counts()
r.plot(ax=ax, kind ='bar')

(Note that the value counts if run separately looks like this (Value counts output)

However the bar chart I get can be seen in the following link. Faulty Bar Chart and clearly doesn't display correctly (all the x value look like they have been redacted for some reason). Its also not close to being the correct distribution for the data.

Its odd because if I edit 'bar' to 'line' in the code I get a line graph (Line Chart) that I know is the perfect distribution for the data. So I can see the distribution for the data using the 'line' chart. Why isn't matplotlib capable of plotting the bar chart.

Avlento
  • 1
  • 2

1 Answers1

0

It is a bit unclear how exactly your data look like. For value_counts() to make sense, the values shouldn't be random floats, but just from a limited set, for example rounded to 1 or 2 decimals.

Pandas' bar plot is "categorical" and creates one bar for each row in the series r. The index (used for the x-axis) is the value, and the height is its count. The order is from most frequent to less frequent. You can use .sort_index() to sort the indices.

Pandas' line plot uses a numeric x-axis. But also uses the given order of the index to decide how to connect the values. So, also for the line plot, .sort_index() would help to create a more conventional curve.

Here is an example starting from dummy data:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

load = pd.DataFrame({'Data': np.random.normal(0, 1, 10000).round(1)})
r = load['Data'].value_counts().sort_index()
r.plot(kind='bar')
plt.show()

bar plot from sorted index

For a numeric x-axis, you could use seaborn's histplot starting from the original data (optionally with kde=True for a smooth approximation of the probability distribution):

import seaborn as sns

sns.histplot(x=load['Data'], bins=30, kde=True)

Note that when your data is discrete, and you want lots of bins, you should leave out bins= and set binwidth= to some small multiple of the distance between successive values. E.g. sns.histplot(x=load['Data'], binwidth=0.2). This is needed to avoid that alternating bins count one value more than their neighbor.

JohanC
  • 71,591
  • 8
  • 33
  • 66
  • Thanks JohanC - I have added an image of what r = load['Data'].value_counts() outputs if run separately. I guess that the "redacted" appearance of the x axis is due to so many distinct values being on the axis but I still don't know why the X aixs begins a 0 when clearly there are negative values that are being plotted. The Seaborn solution does work - I set binwidth to 0.01 and the output accurately describes the distribution of the data. So thank you very much. I would still like to know what's wrong with the original bar plot. – Avlento May 30 '21 at 13:28
  • Well, as explained in the second paragraph of the answer, your original plot gets an x-axis with texts as labels, sorted the same as the dataframe (which is sorted from high count to low). You need `.sort_index()` if you want the dataframe sorted by index. Nothing is wrong with the original plot, it shows what you asked it to do: show a bar for each row (with the count on the y-axis, and the row-index on the x-axis, in its original order). – JohanC May 30 '21 at 15:45
  • Ahh I didn't fully grasp what you meant by sort in that first comment. Simple, got it now. Thanks for your help. – Avlento May 31 '21 at 11:34