2

Good day,

I will present two [likely] very puny problems for your excellent review.

Problem #1

I have a relatively tidy df (dat) with dim 10299 x 563. The 563 variables common to both datasets [that created] dat are 'subject' (numeric), 'label' (numeric), 3:563 (variable names from a text file). Observations 1:2947 are from a 'test' dataset whereas observations 2948:10299 are from a 'training' dataset.

I'd like to insert a column (header = 'type') into dat that is basically rows 1:2947 comprised of string test and rows 2948:10299 of string train that way I can group later on dataset or other similar aggregate functions in dplyr/tidyr.

I created a test df (testdf = 1:10299: dim(testdf) = 102499 x 1) and then:

testdat[1:2947 , "type"] <- c("test")
testdat[2948:10299, "type"] <- c("train")
> head(ds, 2);tail(ds, 2)
  X1.10299 type
1        1 test
2        2 test
      X1.10299  type
10298    10298 train
10299    10299 train

So I really don't like that there is now a column of X1.10299.

Questions:

  • Is there a better and more expedient way to create a column that has what I'm looking for based upon my use case above?
  • What is a good way to actually insert that column into 'dat' so that I can use it later for grouping with dplyr?

Problem #2

The way I arrived at my [nearly] tidy df (dat) from above was to two take dfs (test and train) of the form dim(2947 x 563 and 7352 x 563), respectively, and rbinding them together.

I confirm that all of my variable names are present after the binding effort by something like this:

test.names <- names(test)
train.names <- names(train)
identical(test.names, train.names)
> TRUE

What is interesting and of primary concern is that if I try to use the bind_rows function from 'dplyr' to perform the same binding exercise:

dat <- bind_rows(test, train)

It returns a dataframe that apparently keeps my all of my observations (x: 10299) but now my variable count is reduced from 563 to 470!

Question:

  • Does anyone know why my variables are being chopped?
  • Is this the best way to combine two dfs of the same structure for later slicing/dicing with dplyr/

tidyr?

Thank you for your time and consideration of these matters.

Sample test/train dfs for review (the left most numeric are df indices):

test df test[1:10, 1:5]

   subject labels tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1        2      5         0.2571778       -0.02328523       -0.01465376
2        2      5         0.2860267       -0.01316336       -0.11908252
3        2      5         0.2754848       -0.02605042       -0.11815167
4        2      5         0.2702982       -0.03261387       -0.11752018
5        2      5         0.2748330       -0.02784779       -0.12952716
6        2      5         0.2792199       -0.01862040       -0.11390197
7        2      5         0.2797459       -0.01827103       -0.10399988
8        2      5         0.2746005       -0.02503513       -0.11683085
9        2      5         0.2725287       -0.02095401       -0.11447249
10       2      5         0.2757457       -0.01037199       -0.09977589

train df train[1:10, 1:5]

   subject label tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1        1     5         0.2885845      -0.020294171        -0.1329051
2        1     5         0.2784188      -0.016410568        -0.1235202
3        1     5         0.2796531      -0.019467156        -0.1134617
4        1     5         0.2791739      -0.026200646        -0.1232826
5        1     5         0.2766288      -0.016569655        -0.1153619
6        1     5         0.2771988      -0.010097850        -0.1051373
7        1     5         0.2794539      -0.019640776        -0.1100221
8        1     5         0.2774325      -0.030488303        -0.1253604
9        1     5         0.2772934      -0.021750698        -0.1207508
10       1     5         0.2805857      -0.009960298        -0.1060652

Actual Code (ignore the function calls/I'm doing most of the testing via console).

[http://archive.ics.uci.edu/ml/machine-learning-databases/00240/]The data set I'm using with this code. 1

run_analysis <- function () {
    #Vars available for use throughout the function that should be preserved
    vars <- read.table("features.txt", header = FALSE, sep = "")
    lookup_table <- data.frame(activitynum = c(1,2,3,4,5,6), 
                               activity_label = c("walking", "walking_up", 
                                                  "walking_down", "sitting", 
                                                  "standing", "laying"))
    test <- test_read_process(vars, lookup_table)
    train <- train_read_process(vars, lookup_table)
}

test_read_process <- function(vars, lookup_table) {
    #read in the three documents for cbinding later
    test.sub <- read.table("test/subject_test.txt", header = FALSE)
    test.labels <- read.table("test/y_test.txt", header = FALSE)
    test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")

    #cbind the cols together and set remaining colNames to var names in vars
    test.dat <- cbind(test.sub, test.labels, test.obs)  
    colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))

    #Use lookup_table to set the "test_labels" string values that correspond
    #to their integer IDs
    #test.lookup <- merge(test, lookup_table, by.x = "labels", 
    #               by.y ="activitynum", all.x = T)

    #Remove temporary symbols from globalEnv/memory
    rm(test.sub, test.labels, test.obs)

    #return
    return(test.dat)
}

train_read_process <- function(vars, lookup_table) {
    #read in the three documents for cbinding
    train.sub <- read.table("train/subject_train.txt", header = FALSE)
    train.labels <- read.table("train/y_train.txt", header = FALSE)
    train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")

    #cbind the cols together and set remaining colNames to var names in vars
    train.dat <- cbind(train.sub, train.labels, train.obs)    
    colnames(train.dat) <- c("subject", "label", as.character(vars[,2]))

    #Clean up temporary symbols from globalEnv/memory
    rm(train.sub, train.labels, train.obs, vars)

    return(train.dat)
}
Zach
  • 1,316
  • 2
  • 14
  • 21
  • 1
    Hi Zach, this is definitely curious as nothing about what you're doing jumps out to me as something that would produce what you're seeing. Can you duplicate the behavior with a subset of data that you could reproduce here (smaller number of rows/columns that can be `dput()` or read in? – Forrest R. Stevens May 11 '15 at 17:32
  • @ForrestR.Stevens Updated the question with a reproducible [hopefully] dataset. – Zach May 11 '15 at 17:46

1 Answers1

1

The problem that you're facing stems from the fact that you have duplicated names in the variable list that you're using to create your data frame objects. If you ensure that the column names are unique and shared between the objects the code will run. I've included a fully working example based on the code you used above (with fixes and various edits noted in the comments):

vars <- read.table(file="features.txt", header=F, stringsAsFactors=F)

##  FRS: This is the source of original problem:
duplicated(vars[,2])
vars[317:340,2]
duplicated(vars[317:340,2])
vars[396:419,2]

##  FRS: I edited the following to both account for your data and variable
##    issues:
test_read_process <- function() {
  #read in the three documents for cbinding later
  test.sub <- read.table("test/subject_test.txt", header = FALSE)
  test.labels <- read.table("test/y_test.txt", header = FALSE)
  test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")

  #cbind the cols together and set remaining colNames to var names in vars
  test.dat <- cbind(test.sub, test.labels, test.obs)  
  #colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))
  colnames(test.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))

  return(test.dat)
}

train_read_process <- function() {
  #read in the three documents for cbinding
  train.sub <- read.table("train/subject_train.txt", header = FALSE)
  train.labels <- read.table("train/y_train.txt", header = FALSE)
  train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")

  #cbind the cols together and set remaining colNames to var names in vars
  train.dat <- cbind(train.sub, train.labels, train.obs)    
  #colnames(train.dat) <- c("subject", "labels", as.character(vars[,2]))
  colnames(train.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))

  return(train.dat)
}


test_df <- test_read_process()
train_df <- train_read_process()

identical(names(test_df), names(train_df))


library("dplyr")

## FRS: These could be piped together but I've kept them separate for clarity:
train_df %>%
  mutate(test="train") -> 
  train_df

test_df %>%
  mutate(test="test") -> 
  test_df

test_df %>% 
  bind_rows(train_df) -> 
  out_df

head(out_df)
out_df

##  FRS: You can set your column names to those of the original 
##    variable list but you still have duplicates to deal with:
names(out_df) <- c("subject", "labels", as.character(vars[,2]), "test")

duplicated(names(out_df))
Forrest R. Stevens
  • 3,435
  • 13
  • 21
  • very interesting! You think the reason why I have a column delta between rbind and bind_rows is because bind_rows is dropping the columns with non-alphanumerics? – Zach May 11 '15 at 18:48
  • I don't know without seeing actual excerpts from your code and the way that all of your data are being read in. But from your checks above with `identical()` that didn't seem to be the case? Like I said, I was puzzled by that outcome and couldn't see an obvious reason why you'd be experiencing it especially if all data types were the same, etc (`bind_rows()` will barf if column data types are mixed). – Forrest R. Stevens May 11 '15 at 19:22
  • Updated original post to include the actual code I've written thus far through console testing. – Zach May 11 '15 at 20:00
  • Well, I'm still at a loss to explain what you're observing, without the original data files I may be stuck. I've edited the code above to include you original column names, thinking that it could be strangeness in `dplyr` name checking, but I still can't see why the `bind_rows()` would proceed without error but end up dropping columns. Typically, if column names don't match the columns are retained and NAs filled. – Forrest R. Stevens May 11 '15 at 20:21
  • Hi @Forrest I added the link to the dataset I'm using right before the code block. It is unusual behavior. – Zach May 11 '15 at 20:48
  • I reworked my code to reflect what was going on, namely that the variable list you were using to rename the data as it was being read in (the features.txt file) contains duplicate column names. – Forrest R. Stevens May 11 '15 at 22:03
  • Thanks for taking a look into that. I would have never expected that the variable names would have been duplicates in the source file. What a confounding variable (pun fully intended) that took what appeared to be a pretty straight forward tidying exercise toward adventure. I'll explore your solutions and see what I can leverage for my purposes toward cleaning up the variables. – Zach May 11 '15 at 22:44
  • It just struck me that dplyr may have recognized the duplicate columns and dropped them by default. Do you think that is the case? – Zach May 12 '15 at 00:40
  • I've tested that on dplyr 0.4.1 and it throws an `Error: duplicated column name` when I try the `bind_rows()` call. I honestly have no idea why you saw the dropped columns. Maybe others do? – Forrest R. Stevens May 12 '15 at 03:00
  • Wow. I've tried it in two different environments with RStudio in Ubuntu (64 bit) and Windows 7 (32 bit) using the same code and it performs exactly the same with the same results. – Zach May 12 '15 at 04:25