R function/method to sample data frame using probability until condition is reached

Question

I have a data frame with 3 columns:

ObjectID: the unique identifier of a polygon (or row) AvgWTRisk: probability (0-1) of a disturbance in a forest, ~0.11 is the highest value HA: AREA of a polygon in the forest

I want to develop a function to create a random sample from the data frame, based on the probability value. Here's an example of the data structure:

data

      OBJECTID AvgWTRisk        HA
32697    32697 0.0008456 7.7465000
36480    36480 0.0050852 7.9329797
13805    13805 0.0173463 0.7154995
38796    38796 0.0026580 0.2882192
8494      8494 0.0089310 6.4686595
23609    23609 0.0090647 6.1246000

Dput

structure(list(OBJECTID = c(32697L, 36480L, 13805L, 38796L, 8494L, 
23609L), AvgWTRisk = c(0.0008456, 0.0050852, 0.0173463, 0.002658, 
0.008931, 0.0090647), HA = c(7.7465, 7.9329797, 0.7154995, 0.2882192, 
6.4686595, 6.1246)), row.names = c(32697L, 36480L, 13805L, 38796L, 
8494L, 23609L), class = "data.frame")

I am attempting to do this using the sample() function in R.

Is there any way to use the sum of area as my 'size = ' target as opposed to a number of rows, as such:

Landscape_WTDisturbed <- Landscape_WTRisk[sample(1:nrow(Landscape_WTRisk),
                                                 size = sum(HA >= 100*0.95 && HA <= 100*1.05),
                                                 prob = WTProb, replace = FALSE),]

where: WTProb is as vector of AvgWTRisk, i.e. 'WTProb <- as.vector(Landscape_WTRisk$AvgWTRisk' and HA is the area column from the data frame.

The sample selection above provides me a dataframe with all of the columns but no rows.

As opposed to:

Landscape_WTDisturbed <- Landscape_WTRisk[sample(1:nrow(Landscape_WTRisk),
                                                 size = 10,
                                                 prob = WTProb, replace = FALSE),]

Which works in providing a sample of 10 rows. However, I have no control over the area being selected.

Should I try to achieve this with a while loop, where the area of all of the rows summed together is the criteria, and a small selection of rows can be incrementally added together until the target is reached?

Thank you in advance!

From your description, I don't understand what you're trying to do. Can you please try to clarify how you want to use "the probability value" (`AvgWTRisk`?) in the sampling process? — ulfelder, Jan 05 '20 at 16:23
I'm trying to select rows using the sample function where AvgWTRisk is the value for 'prob'. I had to turn 'AvgWTRisk' into a vector, as per the requirements of the sample function, hence the use of 'WTProb'. However, the issue with the sample function for my purpose is the inability to control the number of samples by anything other than the size criteria (which is simply the number of rows). I tried to use the sample function, but want to control the size of the sample by summed value of selected samples in the 'HA' column, as opposed to the number of rows. — TWRB, Jan 05 '20 at 16:39
The final output I want is a new data frame, called "Landscape_WTDisturbed", for example, of sampled rows from the full data frame that totals up to a specific amount of area (within 5%). For clarity, the AvgWTRisk is the risk of a natural disturbance selecting a forest stand, thus why I need to use this as the probability. — TWRB, Jan 05 '20 at 16:42

score 0 · Answer 1 · answered Jan 05 '20 at 18:03

I hope I understand what you are asking. The following code will first create a permutation of your data in such a way that rows with higher AvgWTRisk will end up closer to the top of the table. In a second step, rows in the middle of the table will be selected based on the sum of HA being in a certain range.

set.seed(123)
WTProb <- Landscape_WTRisk$AvgWTRisk
Landscape_WTDisturbed <- Landscape_WTRisk[sample(1:nrow(Landscape_WTRisk),
                                                 size = nrow(Landscape_WTRisk),
                                                 prob = WTProb, replace = FALSE),]
Landscape_WTDisturbed$HA.sum = cumsum(Landscape_WTDisturbed$HA)
HA.sum.min = 10
HA.sum.max = 25
Landscape_WTDisturbed = Landscape_WTDisturbed[
    Landscape_WTDisturbed$HA.sum >= HA.sum.min &
    Landscape_WTDisturbed$HA.sum <= HA.sum.max,]
Landscape_WTDisturbed
##       OBJECTID AvgWTRisk        HA   HA.sum
## 23609    23609 0.0090647 6.1246000 14.77308
## 38796    38796 0.0026580 0.2882192 15.06130
## 32697    32697 0.0008456 7.7465000 22.80780

This is a good start for me. The size of the sample however carries over all of the rows from the original data frame. What's the purpose of using the probability if also selecting all of the rows? Is there any way I could use, for example, a while loop to select only a small number of rows at a time and bind them together then subset during each iteration of the while loop? — TWRB, Jan 05 '20 at 19:10
You could, of course, select a smaller number of rows, but unless you are concerned about your dataset being too large, there would be no benefit. The purpose of using the probability in the sample is that you are not getting a random permutation of the data. The rows with higher probabilities will tend to be towards the top of the data frame and those with lower probabilities towards the bottom. As long as you are keeping only the middle rows, whether you compute the bottom rows or not will not change your results. — BigFinger, Jan 05 '20 at 22:38

TWRB · Answer 2 · 2020-01-05T20:12:01.650

I've attempted as such:

WTProb <- Landscape_WTRisk$AvgWTRisk
Landscape_WTDisturbed <- Landscape_WTRisk[sample(1:nrow(Landscape_WTRisk),
                                                 size = 1000,
                                                 prob = WTProb, replace = FALSE),]
Landscape_WTDisturbed$HA.sum = cumsum(Landscape_WTDisturbed$HA)

Landscape_WTDisturbed <- Landscape_WTDisturbed[Landscape_WTDisturbed$HA.sum<=DisturbanceArea*1.05,]

Using the cumsum value to add up the values of the HA column, and then select all of the rows that add up to the total 'target'. I can confirm that this approach, a derivative from that recommended by BigFinger - thank you, does produce appropriate results. See below

1) The full samples distribution of risk

summary(Landscape_WTRisk$AvgWTRisk)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000286 0.0013508 0.0030834 0.0061175 0.0072636 0.121604

2) The sample distribution of risk

summary(Landscape_WTDisturbed$AvgWTRisk)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002977 0.006563 0.010800 0.014997 0.015196 0.045924

As you can tell, the distribution was influenced by the probability of the original sample of 1000, sampling rows with substantially higher AvgWTRisk than the distribution in the original dataset.

This approach would not work if more than 1000 samples were needed to the cumulative sum of the target. Still not sure how to make it work more dynamically, if the 'DisturbanceArea' target were to grow beyond the ability of the 1000 sample to meet, this approach would fall apart.

Is this your solution? If not, add this answer's text to your question and delete this answer. — Peter O., Jan 05 '20 at 22:32

R function/method to sample data frame using probability until condition is reached

2 Answers2