Questions tagged [cudf]

Use this tag for questions specifically related to the cuDF Library, or cuDF DataFrame manipulations.

From PyPI: The RAPIDS cuDF library is a GPU DataFrame manipulation library based on Apache Arrow that accelerates loading, filtering, and manipulation of data for model training data preparation. The RAPIDS GPU DataFrame provides a pandas-like API that will be familiar to data scientists, so they can now build GPU-accelerated workflows more easily.

146 questions
0
votes
1 answer

how to use tqdm progress bar in dask_cudf and cudf

I can use tqdm progress bar in pandas for example: tqdm.pandas() df = df['var'].progress_apply(lambda x: something(x)) can i do same in thing cudf or dask_cudf if not then how can i use tqdm progress bar in it,
user14954599
0
votes
1 answer

searching index with cudf dataframe doesn't work with numpy

I just loaded the csv file with cudf (rapidsai) to reduce the time it takes. An issue comes up when I try to search index with an condition where df['X'] = A. here is my code example: import cudf, io, requests df = cudf.read_csv('fileA.csv') # X…
Brian Lee
  • 173
  • 3
  • 14
0
votes
1 answer

Gaps in nvvp timeline when running rapids with spark

I'm running some sql query against a CSV, generated with tpch-dbgen. I am running it with one thread/task for simplicity, and see the gaps in the timeline as shown in the attached image. Is it disk operations? can this overhead be somehow relaxed or…
0
votes
1 answer

Out of memory error with Dask and cudf loop

I am using Dask and Rapidsai to run an xgboost model on a large (6.9GB) dataset. The hardware is 4x 2080 TIs with 11 GB of memory each. The raw dataset has a few dozen target columns that have been one-hot encoded, so I am trying to run a loop that…
datahappy
  • 826
  • 2
  • 11
  • 29
0
votes
1 answer

AttributeError: 'cupy.core.core.ndarray' object has no attribute 'iloc'

i am trying to split data into training and validation data, for this i am using train_test_split from cuml.preprocessing.model_selection module. but got an…
Sudhanshu
  • 704
  • 1
  • 9
  • 24
0
votes
1 answer

RAPIDS: How to use one dataframe in a UDF called with apply_rows of another dataframe?

For each row in dataframe A, I need to query DF B. I need to do something like this: filter B rows by values in column b1 (B.b1) which are in a range defined by columns A.a1 and A.a2 and assign combined values to column A.a3. In pandas that would be…
Peter
  • 3
  • 2
0
votes
1 answer

cuDF: an alternative of Pandas Groupby + Shift?

I have a DF that I want to use Groupby + Shift. I can do this in pandas, but I cannot do it in cuDF because it is not implemented yet: see the issue Issue #7183. The feature request was long ago, so it seems like they will not implement this in the…
Minh-Long Luu
  • 2,393
  • 1
  • 17
  • 39
0
votes
1 answer

hdbscan error when inside rapids container

I am using rapids UMAP in conjunction with HDBSCAN inside a rapidsai docker container : rapidsai/rapidsai-core:0.18-cuda11.0-runtime-ubuntu18.04-py3.7 import cudf import cupy from cuml.manifold import UMAP import hdbscan from sklearn.datasets…
Igna
  • 1,078
  • 8
  • 18
0
votes
2 answers

TypeError: data must be list or dict-like in CUDF

I am implementing CUDF to speed up my python process. Firstly, I import CUDF and removed multiprocessing code, and initialize variables with CUDF. After changing into CUDF it gives a dictionary error. How I can remove these loops to make effective…
Khawar Islam
  • 2,556
  • 2
  • 34
  • 56
0
votes
1 answer

from numba import cuda, numpy_support and ImportError: cannot import name 'numpy_support' from 'numba'

I am changing pandas into cudf to make faster aggregating and reduce the processing speed. I figure out one library which works on GPU with pandas. "CUDF LINK" https://github.com/rapidsai/cudf When I entered the below to install in my project it…
Khawar Islam
  • 2,556
  • 2
  • 34
  • 56
0
votes
1 answer

cuDF low GPU utilization

I have a task that involves running many queries on a dataframe. I compared the performance of running these queries on a Xeon CPU (Pandas) vs. RTX 2080 (CUDF). For a dataframe of 100k rows, GPU is faster but not by much. Looking at nvidia-smi…
Yuriy S
  • 1
  • 1
0
votes
0 answers

RAPIDS out of memory when merging cuda dataframe and distance calculations

I'm trying out RAPIDS cudf and cuspatial, wonder what are the better ways cross join two dataframes that result in 27billion rows? I've got two datasets - one from New York City taxi trip data (14.7million rows) containing longitude/latitude of pick…
byc
  • 121
  • 10
0
votes
1 answer

Pandas DF - Cut time b/w 2 timestamps into hour bins

Say I have data of this format in a df id sta end dur 40433 2020-01-08 05:06:01 2020-01-08 05:08:14 133 40433 2020-09-22 12:01:26 2020-09-22 12:31:34 1808 40433 2020-09-22 12:05:00 2020-09-22…
oompaloompa
  • 3
  • 1
  • 5
0
votes
1 answer

Python modified groupby ngroup in cuDF with list comprehension

I am trying to write a function that does something similar to pandas's groupby().ngroups() function. The difference is that I want each subgroup count to restart at 0. So given the following data: | EVENT_1 | EVENT_2 | | ------- | ------- | | …
Kyle
  • 461
  • 3
  • 13
0
votes
1 answer

Memory allocation error on worker 0: std::bad_alloc: CUDA error

DESCRIPTION I am just trying to gave a trainign and a test set for the model but I get the following errors 1st data package - train_data = xgboost.DMatrix(data=X_train, label=y_train) Up until I run just this and do training and anything with,…
sogu
  • 2,738
  • 5
  • 31
  • 90