Data is too long Error in R FlexmixNL package

Question

I tried to search this online, but couldn't exactly figure out what my issue was. Here is my code:

n = 10000
x1 <- runif(n,0,100) 
x2 <- runif(n,0,100) 
y1 <- 10*sin(x1/10) + 10 + rnorm(n, sd = 1)
y2 <- x2 * cos(x2) - 2 * rnorm(n, sd = 2)
x <- c(x1, x2)
y <- c(x1, x2)
start1 = list(a = 10, b = 5)
start2 = list(a = 30, b = 5)
library(flexmix)
library(flexmixNL)

modelNL <- flexmix(y~x, k =2, 
                   model = FLXMRnlm(formula = y ~ a*x/(b+x), 
                                    family = "gaussian", 
                                    start = list(start1, start2))) 

plot(x, y, col = clusters(modelNL))

and before the plot, it gives me this error:

Error in matrix(1, nrow = sum(groups$groupfirst)) : data is too long

I checked google for similar errors, but I don't quite understand what is wrong with my own code that results in this error.

As you can already tell, I am very new to R, so please explain it in the most layman terms possible. Thank you in advance.

Looks to me that the helper function requires start parameters. You should review the help pages and run through the examples therein. — IRTFM, Jun 25 '21 at 16:08
I used it with start as well, let me update the question if that causes confusion. It still gives the same error. — Semzem, Jun 25 '21 at 16:11
I had run it with start values that were different and a data argument and gotten an error message about singular gradient that made me think this example data was a poor fit to this method. Your starting value succeeded in letting it run. — IRTFM, Jun 25 '21 at 17:00

IRTFM · Accepted Answer · 2021-06-25T17:06:44.367

Ironically (in the context of an error message saying data is "too long") I think the proximate cause of that error is no data argument. If you give it the data in the form of a dataframe, you still get an error but its not the same one as you are experiencing. When you plot the data, you get a rather bizarre set of values at least from a statistical distribution standpoint and it's not clear why you are trying to model this with this formula. Nonetheless, with those starting values and a dataframe argument to data, one sees results.

> modelNL <- flexmix(y~x, k =2,  data=data.frame(x=x,y=y),
+                    model = FLXMRnlm(formula = y ~ a*x/(b+x), 
+                                     family = "gaussian", 
+                                     start = list(start1, start2)))
> modelNL

Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
    a * x/(b + x), family = "gaussian", start = list(start1, start2)))

Cluster sizes:
    1     2 
 6664 13336 

convergence after 20 iterations
> summary(modelNL)

Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
    a * x/(b + x), family = "gaussian", start = list(start1, start2)))

       prior  size post>0 ratio
Comp.1 0.436  6664  20000 0.333
Comp.2 0.564 13336  16306 0.818

'log Lik.' -91417.03 (df=7)
AIC: 182848.1   BIC: 182903.4

Most R regression functions first check for the matchng names in formulae within the data= argument. Apparently this function fails when it needs to go out to the global environment to match formula tokens.

I tried a formula suggested by the plot of the data and get convergent results:

> modelNL <- flexmix(y~x, k =2,  data=data.frame(x=x,y=y),
+                    model = FLXMRnlm(formula = y ~ a*x*cos(x+b), 
+                                     family = "gaussian", 
+                                     start = list(start1, start2)))
> modelNL

Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
    a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))

Cluster sizes:
    1     2 
 9395 10605 

convergence after 17 iterations
> summary(modelNL)

Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~ 
    a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))

       prior  size post>0 ratio
Comp.1 0.521  9395  18009 0.522
Comp.2 0.479 10605  13378 0.793

'log Lik.' -78659.85 (df=7)
AIC: 157333.7   BIC: 157389

The reduction in AIC seems huge compare to the first formula.

The reason why such an odd formula is because I was trying to just get the code to work, so I played safe by copying and pasting from the documentation. The original flexmix package works fine without specifying the data, so this is odd behavior to me. Thank you very much! — Semzem, Jun 25 '21 at 16:59
The `FLXMRnlm` function seems to be an S4 creature. They sometimes seem more picky about how they access data. — IRTFM, Jun 25 '21 at 17:07

Data is too long Error in R FlexmixNL package

1 Answers1