0

I have a data frame that has rows of data that should be grouped together based on having the same value in adjacent rows, and assigned a numerical identifier. The first group of data should be given the value of 1, then the next group of data should be given the value of 2 etc. The issue I'm having is I wrote a for loop which takes too long to execute. Here's an example of what the data looks like:

Day    Weather
1       Rainy
2       Rainy
3       Sunny
4       Sunny
5       Sunny
6       Rainy
7       Rainy
8       Windy
9       Windy

I would like to add the following column:

Day    Weather    Change.in.Weather
1       Rainy             1
2       Rainy             1
3       Sunny             2
4       Sunny             2
5       Sunny             2
6       Rainy             3
7       Rainy             3
8       Windy             4
9       Windy             4





dataset$change.in.weather <- 1
for (i in 2:nrow(dataset)) {

if (dataset$weather[i] == dataset$weather[i-1] {
   dataset$change.in.weather[i] <- dataset$change.in.weather[i-1]
 } else {
   dataset$change.in.weather[i] <- dataset$change.in.weather[i-1]+1
 }
}

Since my dataset is over 1 million rows the for loop takes too long to execute so I'm looking for another solution. Thanks!

campbeb2
  • 69
  • 2

1 Answers1

0

This would be faster in data.table. Convert the 'data.frame' to 'data.table' (setDT) and create the new column by assignment (:=) after applying the run-length-id function (rleid) on the concerned column

library(data.table)
setDT(dataset)[, Change.in.Weather := rleid(Weather)]
akrun
  • 874,273
  • 37
  • 540
  • 662