0

I have been trying to plot a line plot with ggplot.

My data looks something like this:

        I04 F04 I05 F05 I06 F06
CAT     3   12  2   6   6   20
DOG     0   0   0   0   0   0
BIEBER  1   0   0   1   0   0

and can be found here.

Basically, we have a certain number of CATs (or other creatures) initially in a year (this is I04), and a certain number of CATs at the end of the year (this is F04). This goes on for some time.

I can plot something like this fairly simply using the code below, and get this:

enter image description here

This is fantastic, but doesn't work very well for me. After all, I have these staring and ending inventory for each year. So I am interested in seeing how the initial values (I04, I05, I06) change over time. So, for each animal, I would like to create two different lines, one for initial quantity and one for final quantity (F01, F05, F06). This seems to me like now I have to consider two factors.

This is really difficult given the way my data is set up. I'm not sure how to tell ggplot that all the I prefixed years are one factor, and all the F prefixed years are another factor. When the dataframe gets melted, it's too late. I'm not sure how to control this situation.

Any advice on how I can separate these values or perhaps another, better way to tackle this situation?

Here is the code I have:

library(ggplot2)
library(reshape2)

DF <- read.csv("mydata.csv", stringsAsFactors=FALSE)

## cleaning up, converting factors to numeric, etc
text_names <- data.frame(as.character(DF$animals))
names(text_names) <- c("animals")
numeric_cols <- DF[, -c(1)]
numeric_cols <- sapply(numeric_cols, as.numeric)
plot_me <- data.frame(cbind(text_names, numeric_cols))
plot_me$animals <- as.factor(plot_me$animals)
meltedDF <- melt(plot_me)

p <- ggplot()
p <- p + geom_line(aes(seq(1:36), meltedDF$value, group=meltedDF$animals, color=meltedDF$animals))
p
tumultous_rooster
  • 12,150
  • 32
  • 92
  • 149

2 Answers2

0

Using your original data from the link:

nd <- reshape(mydata, idvar = "animals", direction = "long", varying = names(mydata)[-1], sep = "")
ggplot(nd, aes(x = time, y = I, group = animals, colour = animals)) + geom_line() + ggtitle("Development of initial inventories")

enter image description here

ggplot(nd, aes(x = time, y = F, group = animals, colour = animals)) + geom_line() + ggtitle("Development of final inventories")

enter image description here

DatamineR
  • 10,428
  • 3
  • 25
  • 45
  • where did 'time' come from? – tumultous_rooster Jan 24 '15 at 04:35
  • The `time` is the result of applying the `reshape`. The function guesses this variable using the supplied value for `sep`, in our case just `""`. Is the plot what you expected? – DatamineR Jan 24 '15 at 04:45
  • Why did you choose to use `reshape` instead of `melt`? – tumultous_rooster Jan 24 '15 at 05:18
  • Because so the `I`'s and the `F`'s are nor separated, and you want separate plots for them, or? – DatamineR Jan 24 '15 at 05:27
  • @MattO'Brien, there are times when `reshape` may make more sense than `melt` (until "data.table" version 1.9.8 comes out with its own version of `melt`). In particular, `reshape` allows you to get a "semi-long" version of your dataset, where you have a set of "id" columns, and one column for each type of measurement. `melt`, on the other hand, might squash all those different measurement variables into a single long variable. – A5C1D2H2I1M1N2O1R2T1 Jan 25 '15 at 07:11
  • @RStudent, you may also be interested in `merged.stack` from my "splitstackshape" package. Usage here would be something like: `merged.stack(DF, var.stubs = c("I", "F"), sep = "var.stubs")`. – A5C1D2H2I1M1N2O1R2T1 Jan 25 '15 at 07:12
0

I think from a data analyst perspective the following approach might provide better insight.

For each animal we visualize the initial and the final quantity in a separate panel. Moreover, each subplot has its own y scale because the values of the different animal types are radically different. Like this, differences within and across animal types are easier to spot.

Given the current structure of your data, we do not need two different factors. After the gather call the indicator column includes data like I04, F04, etc. We just need to separate the first character from the rest resulting in two columns type and time. We can use type as the argument for color in the ggplot call. time provides a unified x-axis across all animal types.

library(tidyr)
library(dplyr)
library(ggplot2)

data %>% gather(indicator, value, -animals) %>% 
  separate(indicator, c('type', 'time'), sep = 1) %>%
  mutate(
    time = as.numeric(time)
    ) %>% ggplot(aes(time, value, color = type)) +
            geom_line() + 
            facet_grid(animals ~ ., scales = "free_y")

enter image description here

Of course, you might also do it the other way round, namely using a subplot for the initial and the final quantities like this:

data %>% gather(indicator, value, -animals) %>% 
  separate(indicator, c('type', 'time'), sep=1) %>%
  mutate(
    time = as.numeric(time)
    ) %>% ggplot(aes(time, value, color = animals)) +
            geom_line() + 
            facet_grid(type ~ ., scales = "free_y")

enter image description here

But as described above, I would not recommend that because the y scale varies too much across animal types.

alex23lemm
  • 5,475
  • 1
  • 21
  • 23