0

i'm working on clustering for mixed data. To test my algorithm, i need to do some simulation using generated data. i know to generate numerical attribute using rnorm, and for categorical using sample of letter maybe? But the problem is to make the relationship between one to another columns (numerical and categorical attribute). i cannot just make random value and the attributes and don't have any relationship. the relationship must make sense. for example if i just generated random value, let say i have product variables and price.

product  price
pen      $500

it doesnt make sense right, the relationship will be mess up. any suggest?

i make this code, but seem not good enough

n   <- 500
prb <- 0.90
c1 = sample(2:5, 1)
c2 = sample(7:10, 1)
c3 = sample(12:15, 1)

x1 <- sample(c("A","B"), 1.5*n, replace = TRUE, prob = c(prb, 1-prb))
x1 <- c(x1, sample(c("A","B"), 1.5*n, replace = TRUE, prob = c(1-prb, prb)))
x1 <- as.factor(x1)

x2 <- sample(c("C","D","E"), n, replace = TRUE, prob = c(0.90, 0.05, 0.05))
x2 <- c(x2, sample(c("C","D","E"), n, replace = TRUE, prob = c(0.05, 0.9, 0.05)))
x2 <- c(x2, sample(c("C","D","E"), n, replace = TRUE, prob = c(0.05, 0.05, 0.9)))
x2 <- as.factor(x2)

x3 <- sample(c("X","Y"), 1.5*n, replace = TRUE, prob = c(0.6, 0.4))
x3 <- c(x3, sample(c("X","Y"), 1.5*n, replace = TRUE, prob = c(0.4, 0.6)))
x3 <- as.factor(x3)

x4 <- c(rnorm(n, mean = c1), rnorm(n, mean = c2), rnorm(n, mean = c3))
x5 <- c(rnorm(n, mean = c1+20), rnorm(n, mean = c2+30), rnorm(n, mean = c3+40))


x <- data.frame(x1,x2,x3,x4,x5)

1 Answers1

0

Your question mentions two variables, product and price. Your code above creates a data.frame with 5 variables. I am not 100% sure what you are after, but I think that you need something like this.

For each product, you can generate a mean and standard deviation. You can pick products at random and then use the appropriate mean and standard deviation to generate a value from the distribution for that product.

You do not provide any data, so I will illustrate using the iris data. Think Species = product and Petal.Length = price.

## First collect statistics from the original data
MEANS = aggregate(iris$Petal.Length, list(iris$Species), mean)
SD = aggregate(iris$Petal.Length, list(iris$Species), sd)
NumSpecies = length(levels(iris$Species))

Now we can randomly generate a Species and generate a Petal.Length from the distribution for that Species.

NumNew = 10
RS = sample(NumSpecies, NumNew, replace=TRUE)
NewSpecies     = levels(iris$Species)[RS]
NewPetalLength = rnorm(NumNew, MEANS$x[RS], SD$x[RS])
NewData = data.frame(NewSpecies, NewPetalLength)
NewData
   NewSpecies NewPetalLength
1   virginica       5.826106
2  versicolor       3.711405
3   virginica       5.136330
4  versicolor       3.979712
5  versicolor       3.379810
6  versicolor       4.017866
7  versicolor       4.141408
8   virginica       5.817107
9      setosa       1.563924
10  virginica       5.456761
G5W
  • 36,531
  • 10
  • 47
  • 80
  • I mentioned two variabeles just for example to show about random effect. But you have explained it well, using distribution of old data, to make new data. But in my case, it might be more than 2 variables, and also i dont have the old data, so i cant make the distribution to build new data. So, in my case it is like you make a fictitious data or generate data using R function. The problem is how to keep correlation each column one another, so the data still make sense even it just a fictitious data... ( In case more than two variables) – Jack shephard May 04 '18 at 17:36