0

I tried to search this online, but couldn't exactly figure out what my issue was. Here is my code:

n = 10000
x1 <- runif(n,0,100) 
x2 <- runif(n,0,100) 
y1 <- 10*sin(x1/10) + 10 + rnorm(n, sd = 1)
y2 <- x2 * cos(x2) - 2 * rnorm(n, sd = 2)
x <- c(x1, x2)
y <- c(x1, x2)
start1 = list(a = 10, b = 5)
start2 = list(a = 30, b = 5)
library(flexmix)
library(flexmixNL)

modelNL <- flexmix(y~x, k =2, 
                   model = FLXMRnlm(formula = y ~ a*x/(b+x), 
                                    family = "gaussian", 
                                    start = list(start1, start2))) 

plot(x, y, col = clusters(modelNL))

and before the plot, it gives me this error:

Error in matrix(1, nrow = sum(groups$groupfirst)) : data is too long

I checked google for similar errors, but I don't quite understand what is wrong with my own code that results in this error.

As you can already tell, I am very new to R, so please explain it in the most layman terms possible. Thank you in advance.

Semzem
  • 73
  • 9
  • Looks to me that the helper function requires start parameters. You should review the help pages and run through the examples therein. – IRTFM Jun 25 '21 at 16:08
  • I used it with start as well, let me update the question if that causes confusion. It still gives the same error. – Semzem Jun 25 '21 at 16:11
  • I had run it with start values that were different and a data argument and gotten an error message about singular gradient that made me think this example data was a poor fit to this method. Your starting value succeeded in letting it run. – IRTFM Jun 25 '21 at 17:00

1 Answers1

0

Ironically (in the context of an error message saying data is "too long") I think the proximate cause of that error is no data argument. If you give it the data in the form of a dataframe, you still get an error but its not the same one as you are experiencing. When you plot the data, you get a rather bizarre set of values at least from a statistical distribution standpoint and it's not clear why you are trying to model this with this formula. Nonetheless, with those starting values and a dataframe argument to data, one sees results.

> modelNL <- flexmix(y~x, k =2,  data=data.frame(x=x,y=y),
+                    model = FLXMRnlm(formula = y ~ a*x/(b+x), 
+                                     family = "gaussian", 
+                                     start = list(start1, start2)))
> modelNL

Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
    a * x/(b + x), family = "gaussian", start = list(start1, start2)))

Cluster sizes:
    1     2 
 6664 13336 

convergence after 20 iterations
> summary(modelNL)

Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
    a * x/(b + x), family = "gaussian", start = list(start1, start2)))

       prior  size post>0 ratio
Comp.1 0.436  6664  20000 0.333
Comp.2 0.564 13336  16306 0.818

'log Lik.' -91417.03 (df=7)
AIC: 182848.1   BIC: 182903.4 

Most R regression functions first check for the matchng names in formulae within the data= argument. Apparently this function fails when it needs to go out to the global environment to match formula tokens.

I tried a formula suggested by the plot of the data and get convergent results:

> modelNL <- flexmix(y~x, k =2,  data=data.frame(x=x,y=y),
+                    model = FLXMRnlm(formula = y ~ a*x*cos(x+b), 
+                                     family = "gaussian", 
+                                     start = list(start1, start2)))
> modelNL

Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
    a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))

Cluster sizes:
    1     2 
 9395 10605 

convergence after 17 iterations
> summary(modelNL)

Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
    a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))

       prior  size post>0 ratio
Comp.1 0.521  9395  18009 0.522
Comp.2 0.479 10605  13378 0.793

'log Lik.' -78659.85 (df=7)
AIC: 157333.7   BIC: 157389 

The reduction in AIC seems huge compare to the first formula.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • The reason why such an odd formula is because I was trying to just get the code to work, so I played safe by copying and pasting from the documentation. The original flexmix package works fine without specifying the data, so this is odd behavior to me. Thank you very much! – Semzem Jun 25 '21 at 16:59
  • The `FLXMRnlm` function seems to be an S4 creature. They sometimes seem more picky about how they access data. – IRTFM Jun 25 '21 at 17:07