-1

I am working on a data set that has 21 attributes. 16 are categorical, 3 are ordinal factors and 2 are date/ time(target variable). Number of rows are 14512.

What I what to achieve: This data set is basically about daily office incidents closed by different teams, and we are trying to predict the time that will be taken in case of certain predictor variables.

I am using R-Studio for the analysis.

Work done: So I thought to use Knn for computation and converted all predictors to binary dummy variables and target variable to A, B,C classed categorical.

Issue: Now once I apply the knn function example:

RPS_test_pred <- knn(train = RPS_train, test = RPS_test,cl = RPS_train_labels, k=1121)

keeping k as 1121(as we have 14513 rows in the data set, also training and test data divided to 70:30 ratio)

R studio crashes and closes stating - a fatal error occurred.

Please suggest any other way to compute this data or any other modelling technique that I should use which will suit this type of data more with example.

Community
  • 1
  • 1
Abhinav Sharma
  • 45
  • 1
  • 2
  • 8
  • 1
    Pro-tips for posting: (a) use your Shift key at the start of sentences, and also when referring to yourself ("I"); (b) use paragraphs to split up sentences for increased readability, (c) don't beg volunteers for priority attention, (d) format code using the editor tools provided; (e) read [ask] and [mcve] in order to see what sort of questions work well here. – halfer Nov 27 '17 at 20:08
  • Please read [Under what circumstances may I add “urgent” or other similar phrases to my question, in order to obtain faster answers?](//meta.stackoverflow.com/q/326569) - the summary is that this is not an ideal way to address volunteers, and is probably counterproductive to obtaining answers. Please refrain from adding this to your questions. – halfer Nov 27 '17 at 20:08
  • Hi can this be please proccessed now as i have made some modifications and made my requirement more clear.. – Abhinav Sharma Nov 29 '17 at 13:49
  • I've cast a reopen vote, so it should be seen in the Reopen Queue. If this is not reopened in a few hours, you can ask a question [in this chatroom](https://chat.stackoverflow.com/rooms/41570/so-close-vote-reviewers). – halfer Nov 29 '17 at 13:57

1 Answers1

0

In the past I have worked with datasets containing many ordinal and categorical variables and have found success in doing some transformations to make them numerical. Here are some examples from work with housing price data.

Ordinal Variables I would start by recommending to change your ordinal variables into numerical values based on their relative order:

train$Exter.Quality[train$ExterQual == "Excellent"] <- 4
train$Exter.Quality[train$ExterQual == "Good"] <- 3
train$Exter.Quality[train$ExterQual == "Nominal"] <- 2
train$Exter.Quality[train$ExterQual == "Fair"] <- 1

Categorical Variables Has worked to utilize group rankings based on the mean of the response variable you are looking at(Sale Price in my case):

nbhdprice <- summarize(group_by(train, Neighborhood),
          mean(SalePrice, na.rm=T))


nbhdprice_lo <- filter(nbhdprice, nbhdprice$`mean(SalePrice, na.rm = T)` < 140000)
nbhdprice_med <- filter(nbhdprice, nbhdprice$`mean(SalePrice, na.rm = T)` < 200000 &
                          nbhdprice$`mean(SalePrice, na.rm = T)` >= 140000 )
nbhdprice_hi <- filter(nbhdprice, nbhdprice$`mean(SalePrice, na.rm = T)` >= 200000)

train$nbhd_price_level[train$Neighborhood %in% nbhdprice_lo$Neighborhood] <- 1
train$nbhd_price_level[train$Neighborhood %in% nbhdprice_med$Neighborhood] <- 2
train$nbhd_price_level[train$Neighborhood %in% nbhdprice_hi$Neighborhood] <- 3

More examples can be found in the code space here: https://www.kaggle.com/skirmer/fun-with-real-estate-data/code

Dannellyz
  • 31
  • 2
  • In my categorical data, the levels are a lot in number. One attribute has 1000 levels, one has 120 levels ect. – Abhinav Sharma Nov 28 '17 at 04:50
  • what will be the best approach to deal with the data set with 14k plus rows and 15 plus attributes comprising of Ordinal , and categorical predictors and date time target variable? – Abhinav Sharma Dec 01 '17 at 10:07