0

I have a dataset consist of 4 variables: CR, EN, LC and VU:

View first few values of my dateset

CR = c(2,   9,  10, 14, 24, 27, 29, 30, 34, 43, 50, 74, 86, 105,    140,    155,    200, …)

EN = c(24,  52, 86, 110,    144,    154,    206,    242,    300,    302,    366,    403,    422,    427,    427,    434,    448, …)

LC = c(447, 476,    543,    580,    647,    685,    745,    763,    819,    821,    863,    904,    908,    926,    934,    951,    968, …)

VU = c(75,  96, 97, 217,    297,    498,    511,    551,    560,    564,    570,    575,    609,    673,    681,    700,    755,...)

I want to create a histogram of a group of these variables in a plot by R that shows the normal distribution and density, a plot similar to the one below...

My desired graph example Could you please help me?

tdy
  • 36,675
  • 19
  • 86
  • 83
Sina Kh.
  • 17
  • 3

1 Answers1

0

Here are the distributions, a clear-cut use of geom_density.

But first, to address "grouping", we need to pivot/reshape the data so that ggplot2 can automatically handle grouping. This will result in a column with a character (or factor) for each of the "CR", "EN", "LC", or "VU", and another column with the particular value. When pivoting, there is typically one or more columns that are preserved (an id, an x-value, a time/date, or something similar), but we don't have any data that would suggest something to preserve.

longdat <- tidyr::pivot_longer(dat, everything())
longdat
# # A tibble: 68 × 2
#    name  value
#    <chr> <dbl>
#  1 CR        2
#  2 EN       24
#  3 LC      447
#  4 VU       75
#  5 CR        9
#  6 EN       52
#  7 LC      476
#  8 VU       96
#  9 CR       10
# 10 EN       86
# # … with 58 more rows
# # ℹ Use `print(n = ...)` to see more rows

ggplot(longdat, aes(x = value, group = name, fill = name)) +
  geom_density(alpha = 0.2)

ggplot density plot

tidyr::pivot_longer works, one can also use melt from either reshape2 or data.table:

longdat <- reshape2::melt(dat, c())
## names are 'variable' and 'value' instead of 'name' and 'value'

Data

dat <- structure(list(CR = c(2, 9, 10, 14, 24, 27, 29, 30, 34, 43, 50, 74, 86, 105, 140, 155, 200), EN = c(24, 52, 86, 110, 144, 154, 206, 242, 300, 302, 366, 403, 422, 427, 427, 434, 448), LC = c(447, 476, 543, 580, 647, 685, 745, 763, 819, 821, 863, 904, 908, 926, 934, 951, 968), VU = c(75, 96, 97, 217, 297, 498, 511, 551, 560, 564, 570, 575, 609, 673, 681, 700, 755)), class = "data.frame", row.names = c(NA, -17L))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you, How can I change the x and Y labels and make the background white and the legends smaller? – Sina Kh. Jan 30 '23 at 18:20
  • 1
    See [`?labs`](https://ggplot2.tidyverse.org/reference/labs.html) for the labels and [`?theme`](https://ggplot2.tidyverse.org/reference/theme.html) for the background and such; you might go with `... + labs(x="some x", y="some y") + theme_bw()` for a starter, then look into the `legend.*` portions within `theme(..)`. – r2evans Jan 30 '23 at 18:30
  • Hi @r2evans I faced with this error while I run "longdat <- tidyr::pivot_longer(dat, everything())" Error in `vec_interleave()`: ! Can't recycle `..1` (size 720) to match `..2` (size 796). I think this error occurred because the rows of my columns are not the same size, for example one column has 700 rows and one has 400 rows!! – Sina Kh. Feb 21 '23 at 04:41
  • The _only_ way in R that a `data.frame` can have columns with different apparent number of rows is using list-columns, where there may be `NULL` values in the middle ... but even then, the list-columns all have the same length as `nrow(theframe)`. I don't know what your real data looks like, but 700 and 400 rows ... can't happen in an R frame. – r2evans Feb 21 '23 at 14:29
  • Can I use a value of zero to make the columns the same size? Is this correct from a statistical and technical point of view? This is my data... https://filetransfer.io/data-package/YRFYQPVh#link – Sina Kh. Feb 21 '23 at 18:41
  • Your _spreadsheet_ may have different rows per column, that's not how R sees it when you import. It's not clear if that worksheet has four separate `data.frame`s or if you get one frame with four columns. If the latter, then you don't have 400-vs-700 rows, you have 0 `NA`s in the longest column (`EN`) and between 76 and 426 `NA`s in the others. Also, depending on how you are reading it into R, you may have seven columns, not four, where three are all-`NA`. – r2evans Feb 21 '23 at 18:53