Producing your plot requires two steps, as you already noticed. First, one needs to prepare the data, then one can create the plot.
Preparing the data
You have already shown your efforts of bringing the data to a suitable form, but let me propose an alternative way.
First, I have to make sure that the Category
column of the data frame is a character and not a factor. I also store a vector of all the categories that appear in the data frame:
DF$Category <- as.character(DF$Category)
cats <- unique(unlist(strsplit(DF$Category, ",")))
I then need to summarise the data. For this purpose, I need a function that gives for each value in Comp
the percentage for each category scaled such, that the sum of values gives the number of rows in the original data with that Comp
.
The following function returns this information for the entire data frame in the form of another data frame (the output needs to be a data frame, because I want to use the function with do()
later).
cat_perc <- function(cats, vec) {
# percentages
nums <- sapply(cats, function(cat) sum(grepl(cat, vec)))
perc <- nums/sum(nums)
final <- perc * length(vec)
df <- as.data.frame(as.list(final))
names(df) <- cats
return(df)
}
Running the function on the complete data frame gives:
cat_perc(cats, DF$Category)
## 1 4 2 3
## 1 2.666667 0.6666667 1.333333 1.333333
The values sum up to six, which is indeed the total number of rows in the original data frame.
Now we want to run that function for each value of Comp
, which can be done using the dplyr
package:
library(dplyr)
plot_data <-
group_by(DF, Comp) %>%
do(cat_perc(cats, .$Category))
plot_data
## plot_data
## Source: local data frame [4 x 5]
## Groups: Comp [4]
##
## Comp 1 4 2 3
## (fctr) (dbl) (dbl) (dbl) (dbl)
## 1 Comp1 1.333333 0.0000000 0.6666667 0.0000000
## 2 Comp2 1.000000 0.0000000 0.0000000 0.0000000
## 3 Comp3 0.000000 0.6666667 0.6666667 0.6666667
## 4 Comp4 0.500000 0.0000000 0.0000000 0.5000000
This first groups the data by Comp
and then applies the function cat_perc
to only the subset of the data frame with a given Comp
.
I will plot the data with the ggplot2
package, which requires the data to be in the so-called long format. This means that each data point to be plotted should correspond to a row in the data frame. (As it is now, each row contains 4 data points.) This can be done with the tidyr
package as follows:
library(tidyr)
plot_data <- gather(plot_data, Category, value, -Comp)
head(plot_data)
## Source: local data frame [6 x 3]
## Groups: Comp [4]
##
## Comp Category value
## (fctr) (chr) (dbl)
## 1 Comp1 1 1.333333
## 2 Comp2 1 1.000000
## 3 Comp3 1 0.000000
## 4 Comp4 1 0.500000
## 5 Comp1 4 0.000000
## 6 Comp2 4 0.000000
As you can see, there is now a single data point per row, characterised by Comp
, Category
and the corresponding value
.
Plotting the data
Now that everything is read, we can plot the data using ggplot
:
library(ggplot2)
ggplot(plot_data, aes(x = Comp, y = value, fill = Category)) +
geom_bar(stat = "identity")
