3

I have a data frame that looks like the following: Dataframe Snapshot

I would like to make a scatter plot with JUST POINTS on the graph and I want all the points to line up in columns on the graph where each columns is a month (Jan, Feb, Mar, etc) on the y axis. The actual data points will be plotted on the y-axis.

When I do

df.plot.scatter()

it of course wants me to declare an x and y value. I can't really do this if you look at the dataframe picture I attached. How can I plot where all the points for each month are lined up vertically on the x-axis above each month label? I have also tried:

df.plot.box

This basically gives me what I want, but I only want the points and not the box/whiskers it also attempts to plot. I just want points.

JMP0629
  • 95
  • 2
  • 10

1 Answers1

2

I don't believe that you will be able to use pandas to plot a scatter plot with a categorical variable. You could assign a numeric value to each month that you are trying to plot, although you could also just use matplotlib

Create a test data set:

data = np.random.randn(4, 3)
df = pd.DataFrame(data, columns=['Jan', 'Feb', 'Mar'])

Convert this to long form:

df = df.melt()

When you plot you need to specify the x location of each category. I use enumerate, although you could create a new column with numeric values as well

groups = df.groupby('variable')
fig, ax = plt.subplots()
x_ticks = []
x_ticklabels = []
for i, (name, group) in enumerate(groups):
    y = group.value
    x = [i]*len(y)
    ax.scatter(x, y)
    x_ticks.append(i)
    x_ticklabels.append(name)

Then you can set your tick labels to be match your x-values:

ax.set_xticks(x_ticks)   
ax.set_xticklabels(x_ticklabels);

enter image description here

Update I like to deal with things in long form as each entry becomes a single observation, however I realize it would be more concise to loop through the columns without transforming the data:

fig, ax = plt.subplots()
for i, (name, value) in enumerate(df.iteritems()):
    ax.scatter([i]*len(value), value)
ax.set_xticks(range(len(df.columns)))
ax.set_xticklabels(df.columns);
johnchase
  • 13,155
  • 6
  • 38
  • 64
  • This did work to plot data, however, the data plotted appears to be erroneous. I checked the table to see what values it was pulling to plot, but I have no idea where it is getting some values. For example, the highest data point for Feb should be 2.5, but it has MANY MANY data points above 3. I looked at the data it is supposedly reading and it is not reflected in that. I do not know where the data is coming from that it is plotting. – JMP0629 Nov 08 '17 at 19:35
  • There is a fair amount of "nan"s in my dataframe. When I convert this to "long" then run the for statement, could this be what is messing things up? – JMP0629 Nov 08 '17 at 19:38
  • Can you please provide a small sample of your data, similar to the manner that I created a dataset and add that to your question? That way we know we are using the same data set. Without having an example of your data it is impossible to know what the expected outcome should be – johnchase Nov 08 '17 at 19:39
  • I did in my original question. It is linked in the first sentence of my original question. – JMP0629 Nov 09 '17 at 17:06
  • You mentioned that your data has null values and values above 3, neither of which are present in the screen shot which suggests that the data you are running is not the data that you linked to. Additionally screen shots are typically [discouraged](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors) If you haven't already check out the [MCVE](https://stackoverflow.com/help/mcve). My answer works perfectly on the data I provided so you need to provide an example of data where it does not work – johnchase Nov 09 '17 at 17:25
  • Exactly what I need as well. Just one question, how to set figsize? – Hristo Stoychev Nov 24 '17 at 11:59