1

I have a question regarding the visualization of data using ggplot in R. Specifically, regarding the scaling of the y-axis in case of outliers.

Let's start with a sample dataset with observations from 31 IDs. 30 IDs are in an expected range and there is one outlier:

# Load libraries
library(tidyverse)
library(ggbeeswarm)
library(data.table)

# Set seed
set.seed(123)

# Create dataset
ID <- sprintf("ID-%s",seq(1:30))
baseline <- rnorm(30, mean = 50, sd = 3)

df <- data.frame(ID, baseline) %>%
  mutate(`1` = baseline - rnorm(1, mean = 5, sd = 4), 
         `2` = `1` - rnorm(1, mean = 3, sd = 5), 
         `3` = `2` - rnorm(1, mean = 1, sd = 3)) 

# Add outlier
df <- as.data.frame(rbindlist(list(df, list("ID-31", 0.01, 0.02, 0.03 ,1))))

df <- df %>% 
  pivot_longer(-ID) %>% 
  rename(time = name) %>% 
  mutate(time = as.factor(time))

#Plot
ggplot(data = df, aes(x=time, y = value)) + 
  geom_quasirandom() +
  theme_classic() + 
  scale_x_discrete(limits = c("baseline", "1", "2", "3") ) +
  labs(x = "Time", y = "Value")

enter image description here

Expected output

Since the variation in the upper part of the graph is not well visible, I would like to scale the x-axis in a way that shows all values but focusses on a certain part of the plot (in this case values between 20 and 50).

enter image description here

Question

Is it possible to scale the x-axis in such a way?

Additional info

I am specifically not looking for a data transformation solution. Furthermore, I am aware of the scale_y_continuous function in ggplot and it limits argument, but this omits a part of the data.

user213544
  • 2,046
  • 3
  • 22
  • 52
  • `coord_cartesian()` is probably preferable to `scale_y_continuous()` in such a case, but the values at 0 would still be not visible. Zooming in on the upper range using the [ggforce](https://ggforce.data-imaginist.com/) package might be an option. Or `scale_y_log10()`. Or two facets? Or a broken axis, which is usually and rightfully discouraged. – hplieninger May 11 '20 at 12:32

1 Answers1

0

I don''t know anything about having a broken y-axis with ggplot, but this achieves something similar if you can specify in advance which ID is going to be the outlier.

library(tidyverse)
library(ggbeeswarm)
library(data.table)

# Set seed
set.seed(123)

# Create dataset
ID <- sprintf("ID-%s",seq(1:30))
baseline <- rnorm(30, mean = 50, sd = 3)

df <- data.frame(ID, baseline) %>%
  mutate(`1` = baseline - rnorm(1, mean = 5, sd = 4), 
         `2` = `1` - rnorm(1, mean = 3, sd = 5), 
         `3` = `2` - rnorm(1, mean = 1, sd = 3)) 

# Add outlier
df <- as.data.frame(rbindlist(list(df, list("ID-31", 0.01, 0.02, 0.03 ,1))))

df <- df %>% 
  pivot_longer(-ID) %>% 
  rename(time = name) %>% 
  mutate(time = as.factor(time),
         is_outlier = (as.character(ID) == "ID-31"))

ggplot(data = df, aes(x=time, y = value)) + 
  geom_point() + 
  facet_grid(rows = vars(is_outlier), 
             scales = "free_y",
             switch = "y") +
  theme_classic() + 
  scale_x_discrete(limits = c("baseline", "1", "2", "3") ) +
  labs(x = "Time", y = "Value")
Ben
  • 77
  • 9