0

I am new to R and I am constructing R codes for my personal project/exercise. The data I am using is about a survey on ethnic identity of people from Hongkong. I used 2019 data from http://data.hkupop.hku.hk/v3/hkupop/ethnic_identity/ch.html.

After removing NA values and reducing the columns to that of my necessity, I noticed that the data is highly imbalanced so I tried to use under-sampling, ROSE and SMOTE. (the number had greatly reduced from 1015 observations to 573)

I removed the following column # from the set

df_f <- df[,-c(1,2,5,6,8,9,11,12,14,15,17,18,20,21,25,26,27,29,32,33,34,35,37)]

However, this is not a binary data, thus I had to force the factors in eth_id to combine into 0 = 1&3 (Hong Konger and Hong Kong Chinese) and 1 = 2&4 (Chinese and Chinese Hong Kong citizen)

How I combined the factors

df_p$eth_id <- recode(df_p$eth_id, "c('1', '3')='1+3';c('2', '4') = '2+4'")

library(plyr)

revalue(df_p$eth_id, c('1+3' = 0)) -> df_p$eth_id
revalue(df_p$eth_id, c('2+4' = 1)) -> df_p$eth_id
  • 0 = Hong Kong Citizen + Hong Kong Chinese Citizen
  • 1 = Chinese Citizen + Chinese Hong Kong Citizen

How I renamed the columns

df_f <- df_f %>%  
              rename(
                        eth_id = Q001,
                        HongKonger = Q002A, 
                        Chinese = Q003A, 
                        PRC = Q004A,
                        CH_race = Q005A, 
                        Asian = Q006A, 
                        global = Q007A,
                        class1 = mid, 
                        housing1 = type, 
                        housing2 =  housingv2, 
                        pi = inclin 
                      )

HOW I PROCESSED MY NAs and unnecessary outliers

For the columns [,2:7], I changed their values to 0 for NAs, For example, df_f$HongKonger <- ifelse(is.na(df_f$HongKonger),0,df_f$HongKonger) so on and so forth.

And for the others, I removed the NAs like this:

df_p <- na.omit(df_p, cols= c("eth_id","sex","agegp","edugp","occgp","class","class2","housing1","housing2","pi"), invert=FALSE) 

At this point of my data set, I was left with 14 columns and I renamed them (please refer to above). I uploaded the final structure of my data below which I used for ROSE and SMOTE :-)

Furthermore, I also removed rows that were outliers like:

Remove an unidentifiable ethnic_identity (8881 or level = 5)

df_f <- df_f[!df_f$Q001 == "8881",] table(df_f$Q001) 
df_f <- df_f[!df_f$eth_id == "Don't know / hard to say",] 
  • these codes must be carefully written, if you run it before the renaming please use eth_id in place of Q001 and vice-versa.

Now, I kept on getting this error when I run ROSE: Error in [<-.data.frame(*tmp*, , indY, value = c(1L, 1L, 1L, 1L, 1L, : missing values are not allowed in subscripted assignments of data frames.

This is very misleading because I made sure to remove NA values completely (because all the questions related to this were related to NA issue, which is not applicable to mine) and I even changed all my factor values to numerical. (Because I thought that the program is not understanding? the factor values.)

I am also getting this error message for SMOTE: Error in names(dn) <- dnn : attempt to set an attribute on NULL. This mak

es me even more confused to the level that I am doubting the data itself being not applicable to machine learning. 

Here is the final structure of my data for your reference: 
'data.frame':   573 obs. of  14 variables:
 $ eth_id    : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 1 1 1 1 ...
 $ HongKonger: num  9 0 0 0 0 2 0 2 0 8 ...
 $ Chinese   : num  9 9 1 3 7 0 7 9 0 0 ...
 $ PRC       : num  8 9 1 3 7 3 1 0 1 0 ...
 $ CH_race   : num  12 10 0 3 7 3 0 7 3 4 ...
 $ Asian     : num  0 7 6 0 0 2 2 0 0 6 ...
 $ global    : num  0 0 0 0 0 3 7 0 10 0 ...
 $ sex       : num  1 2 2 1 2 1 1 2 1 2 ...
 $ agegp     : num  6 5 2 2 6 5 2 4 6 1 ...
 $ edugp     : num  2 3 2 3 1 2 2 2 3 3 ...
 $ class1    : num  3 3 3 5 3 3 4 4 4 3 ...
 $ housing1  : num  1 1 2 2 1 2 1 2 1 1 ...
 $ housing2  : num  3 3 1 4 3 1 2 1 3 3 ...
 $ pi        : num  3 2 1 2 1 1 1 4 1 1 ...
 - attr(*, "na.action")= 'omit' Named int  14 24 46 52 58 67 77 84 94 129 ...
  ..- attr(*, "names")= chr  "25" "44" "82" "90" ...

#How I divided the data into train and test set 

    set.seed(123)
    index <- createDataPartition(df_p$eth_id, p = 0.7, list = FALSE)
    train_data <- df_p[index, ]
        test_data  <- df_p[-index, ]

    head(test_data)
    str(train_data)

    #How I used ROSE for under-sampling  

library(ROSE)
ovun.sample(formula = train_data$eth_id ~ ., data = train_data, method="under", N = 250,seed = 123)$data

How I used ROSE for "both"

ovun.sample(formula = train_data$eth_id ~ . , data = train_data, method="both",
            na.action=options("na.omit")$na.action,p=0.5,seed = 123)$data

How I used SMOTE

SMOTE(form = train_data$eth_id ~., data = train_data, perc.over = 100, k = 5, perc.under = 200)

I am keep on getting : 1) for ROSE: Error in [<-.data.frame(*tmp*, , indY, value = c(1L, 1L, 1L, 1L, 1L, : missing values are not allowed in subscripted assignments of data frames

2) for SMOTE: Error in names(dn) <- dnn : attempt to set an attribute on NULL

  • I am also confused changing all the factors into numeric value would make it still valid.

Thank you and thank you for sharing your knowledge ahead.

  • can you upload the dataset df? I cannot find the column eth_id when i download the data. I think the problem lies with the re-factoring – StupidWolf Oct 25 '19 at 10:49
  • @StupidWolf - Thank you for the comment, Yes, I had changed the column names and sorry for not including them to the question. I edited my question above, I included a section #How I renamed the columns! Furthermore, I agree on your point that there might be errors when I did the re-factoring and the problem is that I am still a novice to the field of R and statistics and I cant figure out "where" huhu thank you for your kind heart! – Hyelim_kim1028 Oct 27 '19 at 23:30
  • how did you remove NAs? if you do this: df_f <- df[,-c(1,2,5,6,8,9,11,12,14,15,17,18,20,21,25,26,27,29,32,33,34,35,37)] ; dim(df_f[rowSums(is.na(df_f))==0,]) then you are only left with 14 rows – StupidWolf Oct 28 '19 at 11:59
  • Hi @StupidWolf! I edited my question section so that it will be more convenient for you to read. I was left with 14 columns after removing those columns and I removed the unnecessary rows which are deemed outliers (2 rows), I changed the NA values from columns 2:7 (of changed df) to 0 since they are evaluations of their ethnic identity and then I omitted NAs from columns 8:14. I treated the columns 2:7 as one set and 8:14 other set! – Hyelim_kim1028 Oct 29 '19 at 00:45

0 Answers0