0

I have a dataset to classify between won cases (14399) and lost cases (8677). The dataset has 912 predicting variables. I am trying to oversample the lost cases in order to reach almost the same number as the won cases (so having 14399 cases for each of the won and lost cases).

TARGET is the column with lost (0) and won (1) cases:

table(dat_train$TARGET)

    0     1 
 8677 14399 

Now I am trying to balance them using ROSE ovun.sample

dat_train_bal <- ovun.sample(dat_train$TARGET~., data = dat_train, p=0.5, seed = 1, method = "over")

I get this error:

Error in parse(text = x, keep.source = FALSE) : 
  <text>:1:17538: unexpected symbol
1: PPER_409030143+BP_RESPPER_9639064007+BP_RESPPER_7459058285+BP_RESPPER_9339059882+BP_RESPPER_9339058664+BP_RESPPER_5209073603+BP_RESPPER_5209061378+CRM_CURRPH_Initiation+Quotation+CRM_CURRPH_Ne

Can anyone help? Thanks :-)

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Welcome to SO; please spend a minute to see how to properly format your code & error messages (done it for you this time). – desertnaut Nov 08 '19 at 12:53

1 Answers1

0

Reproducing your code from a sham example I found an error in your formula dat_train$TARGET~. needs to be corrected as TARGET~.

dframe <- tibble::tibble(val = sample(c("a", "b"), size = 100, replace = TRUE, prob = c(.1, .9))
                         , xvar = rnorm(100)
                         )

# Use oversampling
dframe_os <- ROSE::ovun.sample(formula = val ~ ., data = dframe, p=0.5, seed = 1, method = "over")

table(dframe_os$data$val)
cbo
  • 1,664
  • 1
  • 12
  • 27
  • Thanks for your answer. But I tried what you said and got exact same error: `dat_train_bal <- ovun.sample(formula = TARGET~., data = dat_train, method = "over")` And I received this error: `Error in parse(text = x, keep.source = FALSE) : :1:17528: unexpected symbol 1: PPER_409030143+BP_RESPPER_9639064007+BP_RESPPER_7459058285+BP_RESPPER_9339059882+BP_RESPPER_9339058664+BP_RESPPER_5209073603+BP_RESPPER_5209061378+CRM_CURRPH_Initiation+Quotation+CRM_CURRPH_Ne` –  Nov 09 '19 at 16:25
  • Can you try the above using the sham data ? If this works the problem comes from your data. Thus you should clean you data set first and look for special characters. – cbo Nov 11 '19 at 14:40