1

I'm currently reading "Practical Statistics for Data Scientists" and following along in R as they demonstrate some code. There is one chunk of code I'm particularly struggling to follow the logic of and was hoping someone could help. The code in question is creating a dataframe with 1000 rows where each observation is the mean of 5 randomly drawn income values from the dataframe loans_income. However, I'm getting confused about the logic of the code as it is fairly complicated with a tapply() function and nested rep() statements.

The code to create the dataframe in question is as follows:

samp_mean_5 <- data.frame(income = tapply(sample(loans_income$income,1000*5),
                                          rep(1:1000,rep(5,1000)),
                                          FUN = mean),
           type='mean_of_5')

In particular, I'm confused about the nested rep() statements and the 1000*5 portion of the sample() function. Any help understanding the logic of the code would be greatly appreciated!

For reference, the original dataset loans_income simply has a single column of 50,000 income values.

cliftjc1
  • 43
  • 6

1 Answers1

0

You have 50,000 loans_income in a single vector. Let's break your code down:

tapply(sample(loans_income$income,1000*5),
       rep(1:1000,rep(5,1000)),
       FUN = mean)

I will replace 1000 with 10 and income with random numbers, so it's easier to explain. I also set set.seed(1) so the result can be reproduced.

  1. sample(loans_income$income,1000*5)
    We 50 random incomes from your vector without replacement. They are (temporarily) put into a vector of length 50, so the output looks like this:
> sample(runif(50000),10*5)
 [1] 0.73283101 0.60329970 0.29871173 0.12637654 0.48434952 0.01058067 0.32337850
 [8] 0.46873561 0.72334215 0.88515494 0.44036341 0.81386225 0.38118213 0.80978822
[15] 0.38291273 0.79795343 0.23622492 0.21318431 0.59325586 0.78340477 0.25623138
[22] 0.64621658 0.80041393 0.68511759 0.21880083 0.77455662 0.05307712 0.60320912
[29] 0.13191926 0.20816298 0.71600799 0.70328349 0.44408218 0.32696205 0.67845445
[36] 0.64438336 0.13241312 0.86589561 0.01109727 0.52627095 0.39207860 0.54643661
[43] 0.57137320 0.52743012 0.96631114 0.47151170 0.84099503 0.16511902 0.07546454
[50] 0.85970500
  1. rep(1:1000,rep(5,1000))
    Now we are creating an indexing vector of length 50:
> rep(1:10,rep(5,10))
[1]  1  1  1  1  1  2  2  2  2  2  3  3  3  3  3  4  4  4  4  4  5  5  5  5  5  6  6  6
[29]  6  6  7  7  7  7  7  8  8  8  8  8  9  9  9  9  9 10 10 10 10 10

Those indices "group" the samples from step 1. So basically this vector tells R that the first 5 entries of your "sample vector" belong together (index 1), the next 5 entries belong together (index 2) and so on.

  1. FUN = mean
    Just apply the mean-function on the data.

  2. tapply
    So tapply takes the sampled data (sample-part) and groups them by the second argument (the rep()-part) and applies the mean-function on each group.

If you are familiar with data.frames and the dplyr package, take a look at this (only the first 10 rows are displayed):

set.seed(1)
df <- data.frame(income=sample(runif(5000),10*5), index=rep(1:10,rep(5,10)))
       income index
1  0.42585569     1
2  0.16931091     1
3  0.48127444     1
4  0.68357403     1
5  0.99374923     1
6  0.53227877     2
7  0.07109499     2
8  0.20754511     2
9  0.35839481     2
10 0.95615917     2

I attached the an index to the random numbers (your income). Now we calculate the mean per group:

df %>% 
  group_by(index) %>%
  summarise(mean=mean(income))

which gives us

# A tibble: 10 x 2
   index  mean
   <int> <dbl>
 1     1 0.551
 2     2 0.425
 3     3 0.827
 4     4 0.391
 5     5 0.590
 6     6 0.373
 7     7 0.514
 8     8 0.451
 9     9 0.566
10    10 0.435

Compare it to

set.seed(1)
tapply(sample(runif(5000),10*5),
       rep(1:10,rep(5,10)),
       mean)

which yields basically the same result:

        1         2         3         4         5         6         7         8         9 
0.5507529 0.4250946 0.8273149 0.3905850 0.5902823 0.3730092 0.5143829 0.4512932 0.5658460 
       10 
0.4352546
Martin Gal
  • 16,640
  • 5
  • 21
  • 39
  • Thank you for your response. (1.) Is there any convention for doing '1000*5' as the second argument of the sample function as opposed to just typing '5000'? (or 50, in your example) – cliftjc1 Jun 12 '20 at 15:53
  • (2.) what argument of the rep() function is the second, nested rep() satisfying? Looking at the documentation, is it the 'times" argument? I understand that the overall purpose of this step is to create an indexing vector, but i guess i'm struggling to wrap my head around how it's doing that – cliftjc1 Jun 12 '20 at 15:57
  • I've been meaning to dive into the dplyr package, as I only have a pretty superficial understanding of it so far. But it looks incredibly useful! thank you for the suggestion – cliftjc1 Jun 12 '20 at 16:01
  • @cliftjc1 (1) I don't know any convention. I guess doing `1000*5` should explicit show that you are drawing each 5 samples for the 1000 observations. Breaking it down to `1000*5` enables you to replace it easier (for the purpose of modelling) by `n*m` with a variable number of samples and observations. – Martin Gal Jun 12 '20 at 19:23
  • (2) The second argument of `rep()` is the times argument. So for each entry of the first argument it contains the number of repetitions (in this case: every entry will repeatet 5 times). Take a look at `rep(1:10, 1:10)`. – Martin Gal Jun 12 '20 at 19:27
  • (3) The `dplyr`-package, the whole `tidyverse`-package including `tidyr`, `stringr`, `ggplot2`is really awesome and worth to be used and understood. This crazy `%>%`-style of coding (imho) makes code easier to read and I really like it very much. – Martin Gal Jun 12 '20 at 19:30
  • Last but not least: Feel free to accept the answer if you found it useful. Just click on the tick mark next to the answer. – Martin Gal Jun 12 '20 at 19:32
  • Another question. Since we're selecting random samples of income, is there any functional difference between doing ```tapply(sample(loans_income$income,1000*5), rep(1:10,rep(5,10)), FUN = mean)``` and ```tapply(sample(loans_income$income,1000*5), rep(1:10,times=50), FUN = mean)```? – cliftjc1 Jun 15 '20 at 20:25
  • ```rep(1:10,rep(5,10))``` produces a vector like 1,1,1,1,1,2,2,2,2,2,3,...,10 whereas ```rep(1:10,50)``` produces a vector like 1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,....,10. In terms of the problem (which is showing how by the CLT sampling distributions of means become more normal at bigger sample sizes), there is no difference right? we just end up with different averages since we are grouping different incomes using different indexing vectors – cliftjc1 Jun 15 '20 at 20:30
  • In this case I don't see any real difference between the two ways. – Martin Gal Jun 15 '20 at 21:10
  • Awesome! I tested it by plotting the histograms doing it the two different ways and came to the same conclusion as you! thanks again – cliftjc1 Jun 15 '20 at 22:07