all.
I am giving a label to each sentence in an article. I am trying to generate a stacked area plot to show at a specific location, the percentage of a certain label.
The location is calculated as (sentence_index/total_number_of_sentence)
The percentage is calculated as at location X, (total number of sentences with label A/total number of sentences)
Here is an example of my data,a complete subsec of loc (0.24,0.28). I have tested that at each location, the sum of all percentage is 1.
> area_df[area_df$loc>0.24,]
label percentage loc
186 B1 0.195 0.25
187 C1 0.111 0.25
188 E1 0.006 0.25
189 G1 0.075 0.25
190 H1 0.008 0.25
191 M1 0.125 0.25
192 M2 0.064 0.25
193 M3 0.084 0.25
194 O1 0.070 0.25
195 O2 0.053 0.25
196 R1 0.209 0.25
197 B1 0.500 0.26
198 M2 0.250 0.26
199 M3 0.250 0.26
200 B1 0.166 0.27
201 C1 0.177 0.27
202 E1 0.015 0.27
203 G1 0.100 0.27
204 H1 0.011 0.27
205 M1 0.114 0.27
206 M2 0.048 0.27
207 M3 0.059 0.27
208 O1 0.074 0.27
209 O2 0.026 0.27
210 R1 0.210 0.27
211 B1 0.125 0.28
212 C1 0.250 0.28
213 G1 0.125 0.28
214 H1 0.125 0.28
215 M1 0.125 0.28
216 O1 0.125 0.28
217 O2 0.125 0.28
I want to create a stacked area plot to represent the overall percentage. I am expecting a solid fill graph with ranging from [0,1]. However, in my geom_area plot, there are some locations with sum(y) greater than 1. when I try set ylim(0,1), there are strange blank(white) lines showing in the area plot.
I am not sure what causes this problem
Here is my code without and with ylim:
# all data stored in area_df
normal_loc_uniq <- sort(unique(normal_loc))
area_df <- data.frame(matrix(ncol = 3,nrow=0))
colnames(area_df) <- c("loc","label","percentage")
# for each location, calculate the percentage
for (one_loc in normal_loc_uniq){
subset <- data[data$normal_loc == one_loc,]
subset_count <- as.data.frame(round(prop.table(table(subset$normal_label, useNA = "no")),5))
names(subset_count) <- c("label","percentage")
subset_count$loc <- as.numeric(one_loc)
subset_count$percentage <- round(subset_count$percentage,3)
# test if there are locations with percentage not equal to 1
if (0.98>sum(subset_count$percentage)| sum(subset_count$percentage) >1.02){
print("error. total percentage is not 1")
}
area_df <- rbind(area_df,subset_count)
}
library(ggplot2)
colors <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf", "#aaffc3")
ggplot(area_df, aes(x = loc, y = percentage, fill = label)) +
geom_area(na.rm=TRUE,position="stack") +
scale_fill_manual(values=colors) +
labs(x = "Relative Location", y = "Percentage", fill = "Label") +
theme_bw()
edit 1: added a complete subset of data