-1

I have an easy task, but I'm not able to solve my problem.

I have a huge Dataframe and want to execute a KNN, but can't do that since I get following Error:

Error: factor predictors must have at most 32 levels

So far so good.. My Idea was to aggregate the column, so I get less Factors.

str(only_savings_medium$MaterialGroupCode)

Factor w/ 40 levels "1A","1B","1C",..: 11 11 11 15 15 15 15 15 15 15 ...

There are 40 levels of "Codes" in form of "1A", "1B", ..., "2B", "2D", ..., "3A",... "3D", "4B", "4C",..., "5A", .., "5Z". Basically I want to check whether the factor contains a 1,2,3,4 or 5 and assign that to the new column. All Codes with 1(any letter) would be assigned to 1, 2(any letter) to 2 and so on. In the end, there should be a new column with only 5 factors, each containing all smaller factors. I'm not sure how to explain that and hope that you understand my problem.

Edit: I'll try to expand my explanation. Here is s a part of the dataframe:

Dataframe

As you can see, there is a Column with different Material Group Codes. There are 40 levels. What I need: create new column for this dataframe. This column contains 5 levels (1,2,3,4 or 5). If we take the example of my screenshot - we would have a new coulmn with following levels: 2,2,2,2,2,1,1,1,1,1,1,3,3,3,3,3 ..., 3. Basically every 1A - 1Z, gets assigned to level 1 of the new column, every 2A - 2Z gets assigned to 2 and so on..

warunapww
  • 966
  • 4
  • 18
  • 38

3 Answers3

0

Like so?

MGC <- as.factor(c("1A", "2Y", "1e", "5e"))

firstplace <- function(x) strsplit(as.character(x), "")[[1]][1]
sapply(MGC, firstplace)

This extracts the first position (in your case: a number) of a vector (maybe a column in a data.frame). Right now, returns are of type character. See, if you need to as.factor() them.

Bernhard
  • 4,272
  • 1
  • 13
  • 23
  • I don't understand how to apply this to my problem... Basically I would do 5 Vectors MGC1 <- as.factor(c("1A", "1B", "1C", "1D", "1E", )), MGC2, ... MGC5, analog.. And then? Still have to create a new column and check which row belongs to MGC1 and which for example to MGC5.. – Pixelements Aug 18 '16 at 13:11
0

Basically you want to reduce the number of levels. Here some guidelines ( since you don't provide a reproducible example)

  1. Create a correspondence data.frame that maps betwen the first factor with 40 levels with a new factor with fewer levels.
  2. use merge , to merge your data with this corespondance data.frame.

Here an example :

## the long factor , in your case 40 levels
origin_factors <- c(LETTERS[1:5],LETTERS[6:10],LETTERS[11:15])
## the target one 
dest_factors <- c("l1","l2","l3")
## the correspondence matrix
corrs <- data.frame(
  x=c(LETTERS[1:5],LETTERS[6:10],LETTERS[11:15]),
  nx=c(rep("l1",5),rep("l2",5),rep("l3",5))
  )
## create a reproducible example 
ex <- sample(sample(origin_factors),100,replace=T)
dat <- data.frame(x=ex)
## merge to reduce the number of levels. 
merge(dat,corrs)
agstudy
  • 119,832
  • 17
  • 199
  • 261
0

Ok, I finally was able to solve my problem.. Since I'm a beginner, the code you provided me was too complicated for me.. Here is what I did:

I've copied the whole column "MaterialGroupCode" and binded it to the same DF with a different name. So basically I had the same DF + a copy of "MaterialGroupCode"-column with the name "MDC".

my_df$MDC <- substring(my_df$MDC,1 ,1)

So I've made a substring, since I only had to remove the letter. In the end it was a character, so the only thing I had to do is:

my_df$MDC <- as.factor(my_df$MDC)

Now I have a new column MDF, which is a factor with 5 levels and corresponds to 1A ... 1Z as 1, 2B ... 2Z as 2 and so on..