Questions tagged [dask-ml]

79 questions
2
votes
0 answers

Is sklearn learning_curve function supported by dask?

I'm computing learning curves out of random forests using sklearn. I need to do it for lot of different RFs, therefore I want to use a cluster and Dask to reduce the time of the RFs fits. Currently I implemented the following algorithm: from…
H4dr1en
  • 277
  • 2
  • 11
1
vote
1 answer

Sagemaker Notebook instance error AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations'

I have a dask cluster active from dask.distributed import Client, progress client = Client() client When I try to encode my data I get the error: AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations' I encoded the data…
1
vote
1 answer

Dask still Slower than Pandas on Large Dataset 3.2 Go

I am currently Trying Dask locally (parallel processing) for the first Time on a large Dataset (3.2 Go). I am comparing Dasks speed with pandas on simple computations. Using Dask seems to result in slower execution time in any task beside reading…
1
vote
1 answer

Apply dask QuantileTransformer to a calculated field in the same dataframe

I'm trying to apply a dask-ml QuantileTransformer transformation to a percentage field, and create a new field percentage_qt in the same dataframe. But I get the error Array assignment only supports 1-D arrays. How to make this work? import pandas…
ps0604
  • 1,227
  • 23
  • 133
  • 330
1
vote
1 answer

Compute list of dask delayed object

I have gone all similar question and solutions provided, but not getting desired output. I have a list of dask delayed objects. for y in ys: projection = Projection(data, X, y) fi = projection.decode() var.append(fi) where Projection class…
ipj
  • 67
  • 7
1
vote
1 answer

Issues with dask compute() on labels predicted by KMeans

I am trying to use sklearn MiniBatchKMeans to cluster a fairly large dataset (150k samples and 150k features). I thought I could make things much faster using Incremental from dask_ml to fit my data in chunks. Here is a snippet of my code on a dummy…
coolbeans
  • 11
  • 2
1
vote
1 answer

Dask with tensor flow is failing with `CRITICAL - Failed to Serialize` error

I have installed dask[complete], tensorflow, scikeras, deplayed, dask-ml. I am running the same example link in my local. There are no stack traces in worker logs as well. Please help me with inputs to degug further. The code is failing with…
1
vote
0 answers

Dask ML - GaussianNB returns length mismatch error

I am trying to predict my test set using a GaussianNb classifier with Dask. This is how my setup looks like: X_train = pd.DataFrame.sparse.from_spmatrix(vectorizer.fit_transform(training['X_trn'])) y_train =…
mendy
  • 191
  • 1
  • 12
1
vote
1 answer

KilledWorker Exception

I am using coiled to spin up a cluster and using dask to do some manipulation on a csv read from an S3 bucket. However, at some point my workers are getting killed. When I inspected the logs, the following task is killing them. distributed.scheduler…
QuantNoob
  • 13
  • 3
1
vote
0 answers

Dask-ml LabelEncoder.fit_tranform() threw AttributeError: 'bool' object has no attribute 'astype'

So I tried to apply LabelEncoder() function to columns that have object dtype on my Dask dataframe: le = dm.LabelEncoder() #dm is dask-ml module for column in df.columns: if df[column].dtype == type(object): df[column]…
1
vote
1 answer

Impute mean of single column in dask-ml

Calculating and imputing the mean using dask-ml works fine when changing all the columns that are np.nan: imputer = impute.SimpleImputer(strategy='mean') data = [[100, 2], [np.nan, np.nan], [70, 7]] df = pd.DataFrame(data, columns = ['Weight',…
ps0604
  • 1,227
  • 23
  • 133
  • 330
1
vote
1 answer

Installing dask-ml throws "Solving Environment" error

I'm getting the following errors when trying to install dask-ml with conda. Any ideas how to fix this? (env3) C:\>conda install -c conda-forge dask-ml Collecting package metadata (current_repodata.json): done Solving environment: failed with initial…
ps0604
  • 1,227
  • 23
  • 133
  • 330
1
vote
1 answer

Problems implementing Dask MinMaxScaler

I am having problems normalizing a dask.dataframe.core.DataFrame using Dask.dask_ml.preprocessing.MinMaxScaler, I am able to use sklearn.preprocessing.MinMaxScaler however I wish to use dask to scale up. Minimal, Reproducible Example: # Get data ddf…
AmyChodorowski
  • 392
  • 2
  • 14
1
vote
0 answers

How to reduce the `dask_ml.xgboost` worker memory consumption?

I've been testing the dask_ml.xgboost regressor on a synthetic 10GB dataset. When training, the memory usage of the workers exceeds the amount available on my local laptop. I am aware that I can try running on an online dask cluster with larger…
Joseph
  • 11
  • 1
1
vote
0 answers

How much memory need for XGBoost model?

Background: Training set with 100m rows and about 50 columns, and i have cast the dtype to the minimum types. still, the dataframe is like 8-10Gb when loaded. Run training on AWS ec2 instances(one is 36CPU + 72RAM. another is 16CPU +…
Argos.LEE
  • 139
  • 2
  • 6