0

I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".

For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:

AirTDay (oC)    
8.16           
10.88          
5.28

19.82
23.62
13.14


28.84
32.21
17.44
31.21

I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.

The output should look as:

Air_T (oC)    TBins 
8.16          8.5 
10.88         10.5 
5.28          NA 
              NA
19.82         20.5
23.62         24.5
13.14         14.5
              NA
              NA
28.84         28.5
32.21         32.5
17.44         18.5
31.21         32.5

I've gotten as far as:

setwd('C:/Users/xxx') 

temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay

Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)  
int <- 2 # bin ranges 2 degrees
mean_int <- int/2

int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence 
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels

temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)

The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:

Air_T (oC)    TBins 
    1         8.16          8.5 
    2         10.88         10.5 
    3         5.28          NA 
    4         19.82         20.5
    5         23.62         24.5
    6         13.14         14.5
    7         28.84         28.5
    8         32.21         32.5
    9         17.44         18.5
   10         31.21         32.5

Any ideas on where am I failing and how to solve it?

ebb
  • 274
  • 2
  • 6
  • 16

1 Answers1

1
v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
      V1 Tbins
1   8.16   8.5
2  10.88  10.5
3   5.28  <NA>
4     NA  <NA>
5  19.82  18.5
6  23.62  22.5
7  13.14  12.5
8     NA  <NA>
9     NA  <NA>
10 28.84  28.5
11 32.21  <NA>
12 17.44  16.5
13 31.21  30.5

This result follows the logic given: we have

paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
 [1] "8-9.99"   "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
 [8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"

From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5

Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • thanks for your answer. I also solved the "NA" issue by adding specifications to the read comand `temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE, na.strings = "NA", blank.lines.skip = FALSE)` Still haven't been able to solve shift in the column name. When I save the new table including the bins, the code inserts a sequential column (1 to n) with the name of the temperature column, the temperature column then gets the TBin heading and the TBin column does not have any heading. Is there any way to avoid inserting the sequential column? – ebb Jan 26 '18 at 19:40
  • I cant tell why the shift but you can rectify that by `setNames(FINALDATA[-1],names(FINALDATA))`. Just curious. Why do you claim 8.5 is the average of 8-9.99?? is the average not 9?? – Onyambu Jan 26 '18 at 19:58
  • I tries your suggestion and still don't get it to work; there must be something I'm doing wrong. I'll keep playing with it anyway. Thanks. You're right, it is 9 for the example I gave in my question, sorry about that. I copied and pasted from some other source as an example, with a 1 C increments instead of two. Thanks again for your help – ebb Jan 26 '18 at 20:12
  • 1
    If the error involves length then you will have to do `setNames(FINALDATA[-1],head(names(FINALDATA)),-1)` – Onyambu Jan 26 '18 at 20:16
  • I just figured it out. It was just a silly mistake. The initial column was the row names, that when saving directly from the data frame was printed as an additional column, shifting all column names. I just set `row.name` to `FALSE`, and now the output is correct. In any case, I learned a lot with your suggestions, and will apply them on future codes, thanks! – ebb Jan 26 '18 at 21:07