I'm trying to deploy a gbdt model with synapseml lightgbm[0.9.5] on google dataproc[2.0-debian10]. I use Spark StringIndexer to index string categorical columns and assemble all columns as a vector. With categorical features setting, I found the model error doesn't converge and there are lots of warnings:
DEFAULT [LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
It's strange that I checked all categorical features are in [0.0, 72234.0] which are in the range of Int32 https://github.com/microsoft/LightGBM/issues/1359
Then I removed the categorical meta info and treat all features as numeric features. The warning is gone but the metric seems still wierd.
The model works on local spark environment. So I guess there is something wrong with data shared from JVM to C on DataProc. Can anybody help?