6

I am working with R and have the following dataset which consists of sentences taken out of books and contains data about the book id, their cover colour (colour), and a sentence ID which is matched with the corresponding book.

My dataset
    Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here
    .

I would like to take four randomized subsamples (each containing 25% of the original data) with following conditions:

1) the distribution of book-colours should remain the same as in the original dataset. If there were 10% blue books, this should also be reflected in the subsamples

2) the subsample should not be taken/split by number of rows (which is the sentence ID) but by "Book ID". This means if Book ID 4 is sampled, then all sentences 7,8,9,10,11 should be in the sample dataset.

3) Also, each Book ID should only be in one of the 4 sub samples - this means if I decided to merge all 4 subsamples, I want to end up with the original dataset again.

What would be the best solution to split my dataset in the way described above?

lole_emily
  • 95
  • 9
  • Perhaps this helps you: https://stats.stackexchange.com/a/326015/282453 – Martin Gal Jun 15 '20 at 15:14
  • Maybe I've not understood well, but if the proportions of colors should be equal in each sample and in the full dataset (1), but you do not want repetitions in samples(3), if you put the green book (`book ID` == 3) in the first sample (for example), what do you expect about it in the remaining three samples? – s__ Jun 23 '20 at 12:55
  • 1
    How large is your dataset? In regards to your first specification, you want to reflect the distribution of color in each sample. If you want a truly randomized sample, you would not adjust for this, instead relying on a large enough sample size that your subsample would represent the rest of your samples (or at least **sufficiently** represent, depending on your hypothesis and statistical test). – mhovd Jun 23 '20 at 14:35

2 Answers2

2

Here the short version:

library(tidyverse)

df <- tribble(
    ~Book_ID, ~sentence_ID, ~Colour, ~Sentences
    ,1      , 1, "Blue", "Text goes here"
    ,1      , 2, "Blue", "Text goes here"
    ,1      , 3, "Blue", "Text goes here"
    ,2      , 4, "Red", "Text goes here"
    ,2      , 5, "Red", "Text goes here"
    ,3      , 6, "Green", "Text goes here"
    ,4      , 7, "Orange", "Text goes here"
    ,4      , 8, "Orange", "Text goes here"
    ,4      , 9, "Orange", "Text goes here"
    ,4      , 10, "Orange", "Text goes here"
    ,4      , 11, "Orange", "Text goes here"
    ,5      , 12, "Blue", "Text goes here"
    ,5      , 13, "Blue", "Text goes here"
    ,6      , 14, "Red", "Text goes here"
    ,6      , 15, "Red", "Text goes here"
)

df %>%
    left_join(
        df %>%
            distinct(Book_ID, Colour) %>%
            group_by(Colour) %>%
            mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))
        , by = c("Book_ID", "Colour"))

This will give you:

# A tibble: 15 x 5
   Book_ID sentence_ID Colour Sentences        sub_sample
     <dbl>       <dbl> <chr>  <chr>                 <int>
 1       1           1 Blue   "Text goes here"          3
 2       1           2 Blue   "Text goes here"          3
 3       1           3 Blue   "Text goes here"          3
 4       2           4 Red    "Text goes here"          1
 5       2           5 Red    "Text goes here"          1
 6       3           6 Green  "Text goes here"          1
 7       4           7 Orange "Text goes here"          2
 8       4           8 Orange "Text goes here"          2
 9       4           9 Orange "Text goes here"          2
10       4          10 Orange "Text goes here"          2
11       4          11 Orange "Text goes here"          2
12       5          12 Blue   "Text goes here"          2
13       5          13 Blue   "Text goes here"          2
14       6          14 Red    "Text goes here"          3
15       6          15 Red    "Text goes here"          3

And a short explanation of the code:

Let's start with the nested part

# take the dataframe
df %>%
    # ...and extract the distinct combinations of book and colour
    distinct(Book_ID, Colour) %>%
    # and now for each colour...
    group_by(Colour) %>%
    # ...provide random numbers from 1 to 4
    mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))

Grouping by colour ensures that you have the same distribution of colours in each sample.

The result of this is now left_joined to the original dataframe on the two columns we "distincted" before - which ensures that there can be no duplicates.


One addition

To have the same colour distribution in the subsamples you of course need a sufficient number of books for each colour. So, for example, only 20 different books in green is guaranteed to be differently distributed. In that case you would probably want to "group" colours for the sampling. However, that's a statistics question and clearly beyond the scope of a programming forum.

Georgery
  • 7,643
  • 1
  • 19
  • 52
1

This should work. Books are grouped by color, and then a number from 1:4 is drawn from a pool that has length: next multiple of 4, to ensure equal distribution. The data frame is then split by sample number.

library(readr)
library(dplyr)
library(tidyr)

books <- read_delim(
'Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here', 
'|', trim_ws = TRUE)

books %>%
  # sampling is done on a book ID level. We group by book 
  # and nest the sentences, to get only one row per book.
  group_by(`Book ID`) %>% 
  nest(book_data = c(`sentence ID`, Sentences)) %>% 

  # We want to split colours evenly. We therefore draw a sample number from 1:4
  # for each group of colours. To ensure an even split, we draw from a 
  # vector that is a repeat of 1:4 until it has a lenght, that is the 
  # first multiple of 4, that is >= the number of colours in a group.
  group_by(Colour) %>%
  mutate(sample = sample(rep_len(1:4, (n() + 3) %/% 4 * 4 ), n(), replace = F)) %>% 

  # Unnest the sentences again.
  unnest(book_data) %>% 

  # Split the data frame into lists by the sample number.
  split(.$sample) 
$`1`
# A tibble: 4 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         5 Blue              12 Text goes here      1
2         5 Blue              13 Text goes here      1
3         6 Red               14 Text goes here      1
4         6 Red               15 Text goes here      1

$`2`
# A tibble: 2 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         2 Red                4 Text goes here      2
2         2 Red                5 Text goes here      2

$`3`
# A tibble: 1 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         3 Green              6 Text goes here      3

$`4`
# A tibble: 8 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         1 Blue               1 Text goes here      4
2         1 Blue               2 Text goes here      4
3         1 Blue               3 Text goes here      4
4         4 Orange             7 Text goes here      4
5         4 Orange             8 Text goes here      4
6         4 Orange             9 Text goes here      4
7         4 Orange            10 Text goes here      4
8         4 Orange            11 Text goes here      4

JohannesNE
  • 1,343
  • 9
  • 14
  • I am not a statistician, but I think that the sampling part `sample(rep_len(1:4, (n() + 3) %/% 4 * 4 ), n(), replace = F)` is not random. Each book must be *randomly (!)* assigned to a group, but this does not work here. Think about the logic here: in a 4-book colour, if books are already assigned to sample 1, 2 and 3 the last book can *only* be assigned to 4 - this is not random. Hence, maybe you should change the answer to `sample(1:4, n(), replace = TRUE)`. But just a suggestion. :) – Georgery Jun 25 '20 at 08:21
  • I gave this solution to comply with condition 1 (all sample should have the same distribution of colours). You are right, that this constraint reduces the randomness. – JohannesNE Jun 25 '20 at 11:11