1

I would like to add a new column in an xdf file. I tested both transforms and transformFunc in rxDatastep.

This line of code works fine for me:

rxDataStep(nyc_jan_xdf,transforms = list(newCol5=ifelse(payment_type==1,10,20)))

but If I use transformFunc:

CashVsCard<-function(x)
{
  if(x$payment_type==1){
    x$newCol13=10
  } else {
    x$newCol13=20
  }
  return(x)
}
rxDataStep(nyc_jan_xdf,transformFunc = CashVsCard)

it doesnt work and returns this error:

Error in doTryCatch(return(expr), name, parentenv, handler) : 
  The variable 'newCol13' has a different number of rows than other columns in the data: 1 vs. 10
In addition: Warning message:
In if (x$payment_type == 1) { :
  the condition has length > 1 and only the first element will be used

Why transformFunc doesnt work?

an example of my Data:

structure(list(VendorID = c(2L, 2L, 2L, 1L, 1L, 1L), tpep_pickup_datetime = c("2016-01-01 00:00:00", 
"2016-01-01 00:00:00", "2016-01-01 00:00:03", "2016-01-01 00:00:04", 
"2016-01-01 00:00:05", "2016-01-01 00:00:06"), tpep_dropoff_datetime = c("2016-01-01 00:00:00", 
"2016-01-01 00:00:00", "2016-01-01 00:15:49", "2016-01-01 00:14:32", 
"2016-01-01 00:14:27", "2016-01-01 00:04:44"), passenger_count = c(5L, 
1L, 6L, 1L, 2L, 1L), trip_distance = c(4.90000009536743, 10.539999961853, 
2.4300000667572, 3.70000004768372, 2.20000004768372, 1.70000004768372
), pickup_longitude = c(-73.9807815551758, -73.9845504760742, 
-73.9693298339844, -74.0043029785156, -73.9919967651367, -73.9821014404297
), pickup_latitude = c(40.7299118041992, 40.6795654296875, 40.7635383605957, 
40.7422409057617, 40.718578338623, 40.7746963500977), RatecodeID = c(1L, 
1L, 1L, 1L, 1L, 1L), store_and_fwd_flag = c("N", "N", "N", "N", 
"N", "Y"), dropoff_longitude = c(-73.9444732666016, -73.9502716064453, 
-73.9956893920898, -74.0073623657227, -74.0051345825195, -73.9709396362305
), dropoff_latitude = c(40.7166786193848, 40.7889251708984, 40.7442512512207, 
40.7069358825684, 40.7399444580078, 40.7967071533203), payment_type = c(1L, 
1L, 1L, 1L, 1L, 1L), fare_amount = c(18, 33, 12, 14, 11, 7), 
    extra = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5), mta_tax = c(0.5, 
    0.5, 0.5, 0.5, 0.5, 0.5), tip_amount = c(0, 0, 3.99000000953674, 
    3.04999995231628, 1.5, 1.64999997615814), tolls_amount = c(0, 
    0, 0, 0, 0, 0), improvement_surcharge = c(0.300000011920929, 
    0.300000011920929, 0.300000011920929, 0.300000011920929, 
    0.300000011920929, 0.300000011920929), total_amount = c(19.2999992370605, 
    34.2999992370605, 17.2900009155273, 18.3500003814697, 13.8000001907349, 
    9.94999980926514)), .Names = c("VendorID", "tpep_pickup_datetime", 
"tpep_dropoff_datetime", "passenger_count", "trip_distance", 
"pickup_longitude", "pickup_latitude", "RatecodeID", "store_and_fwd_flag", 
"dropoff_longitude", "dropoff_latitude", "payment_type", "fare_amount", 
"extra", "mta_tax", "tip_amount", "tolls_amount", "improvement_surcharge", 
"total_amount"), row.names = c(NA, 6L), class = "data.frame")
Kaja
  • 2,962
  • 18
  • 63
  • 99
  • It kind of gives it away doesn't it. `The variable 'newCol13' has a different number of rows than other columns in the data: 1 vs. 10`. I suggest you check your input. Also for us to properly help you an example of your data would be nice. – Erik Schutte May 15 '17 at 09:08
  • did you give me the head of `nyc_jan_xdf`? What are your dimensions? – Erik Schutte May 15 '17 at 09:31
  • I have gave you the result of this code: `dput(head(rxDataStep(nyc_jan_xdf,transforms = list(newCol5=ifelse(payment_type==1,10,20)))))` – Kaja May 15 '17 at 09:32
  • `The number of rows (10906858) times the number of columns (20)` – Kaja May 15 '17 at 09:33
  • Since I don't want to switch R versions atm (I'm on 3.2.2 and that version does not work wit `RevoScaleR`), I could recommend you [this](https://msdn.microsoft.com/microsoft-r/scaler-user-guide-data-transform), they clearly show how you can subset your data and transform it. – Erik Schutte May 15 '17 at 09:42

1 Answers1

0

I have found it. It is not the best solution but it works. I should only change the function like this:

CashVsCard<-function(x)
{

  p<-length(x$payment_type)   
  for(i in 1: p)
  {

    if(x$payment_type[i]==1)
    {
      x$cash_vs_Card4[i]="Card"
    }   else    {
      x$cash_vs_Card4[i]="Others"
    }
  }
  return(x)
}
Kaja
  • 2,962
  • 18
  • 63
  • 99