Questions tagged [cudf]

Use this tag for questions specifically related to the cuDF Library, or cuDF DataFrame manipulations.

From PyPI: The RAPIDS cuDF library is a GPU DataFrame manipulation library based on Apache Arrow that accelerates loading, filtering, and manipulation of data for model training data preparation. The RAPIDS GPU DataFrame provides a pandas-like API that will be familiar to data scientists, so they can now build GPU-accelerated workflows more easily.

146 questions
1
vote
3 answers

cuDF for text / string

I am new to cuDF and may not have understood the purpose of construct so this is a very generic question that I have. I have a dataset that has mostly string columns and I was hoping to use apply_rows to perform the processing of the strings,…
Mayukh
  • 117
  • 1
  • 4
1
vote
1 answer

Expected a bytes object, got a 'int' object erro with cudf

I have a pandas dataframe, all the columns are objects type. I am trying to convert it to cudf by typing cudf.from_pandas(df) but I have this error: ArrowTypeError: Expected a bytes object, got a 'int' object I don't understand why even that…
el abed houssem
  • 350
  • 1
  • 7
  • 16
1
vote
1 answer

`pip install cudf-cuda100` results in "ERROR: No matching distribution found for cudf-cuda100"

I run Windows 10 and have installed Anaconda. I am trying to install cudf but I repeatedly fail: (tf2) C:\WINDOWS\system32>pip install cudf-cuda100 ERROR: Could not find a version that satisfies the requirement cudf-cuda100 (from versions:…
user8270077
  • 4,621
  • 17
  • 75
  • 140
1
vote
1 answer

How to install library in the google plat form - ai platform - notebook instance

I currently a data science undergraduate student and try to use google could platform - AI platform - notebook instance to do data science project. The following image shows what I am talking about. I have no problem running the instance and…
Rui
  • 49
  • 1
  • 10
1
vote
1 answer

How to ensure number of `partitions` is equally distributed across workers with dask and dask-cudf?

I am trying to do a basic ETL workflow on large files across workers using dask-cudf across a large amount of workers . Problem: Initially the scheduler schedules equal amounts of partitions to be read across workers but during the pre-processing…
Vibhu Jawa
  • 88
  • 9
1
vote
1 answer

CUDF error processing a large number of parquet files

I have 2000 parquet files in a directory. Each parquet file is roughly 20MB in size. The compression used is SNAPPY. Each parquet file has rows that look like the following: +------------+-----------+-----------------+ | customerId | productId |…
chochim
  • 1,710
  • 5
  • 17
  • 30
1
vote
1 answer

Convert cuDF data frame column to 1 or 0 for “true”/“false” values

I am using RAPIDS (0.9 release) docker container. How can I do the following with RAPIDS cuDF? df['new_column'] = df['column_name'] > condition df[['new_column']] *= 1
rnyai
  • 25
  • 3
1
vote
1 answer

How to use cudf.Series.applymap()?

Can someone please provide a few examples of how to use the applymap method on a cuDF Series? Below is copied from the docs and here is a link to the documentation. applymap(self, udf, out_dtype=None) Apply a elemenwise function to transform the…
gumdropsteve
  • 70
  • 1
  • 14
1
vote
3 answers

How to apply if condition in GPU DataFrame- cuDF to filter the DataFrame?

I'd like to filter a cuDF data frame based on a column value, and then create a new column based on a condition specified. Basically, how can I apply the following in cuDF? df.loc[df.column_name condition, 'new column name'] = 'value if condition is…
rnyai
  • 25
  • 3
1
vote
2 answers

How to drop columns with NA using cudf?

Pandas: data = data.dropna(axis = 'columns') I am trying to do something similar using a cudf dataframe but the apis don't offer this functionality. My solution is to convert to a pandas df, do the above command, then re-convert to a cudf. Is…
Sterls
  • 723
  • 12
  • 22
0
votes
1 answer

error of memory leakage on dask when running a job on multiple GPUs

I would like to process some textual data with “sentence-transformers” (generated embeddings for textual data) on multiple GPUs (2 T4, 15 GB per GPU) and 16 vCPUs (with 60 GB RAM) on GCP from Jupyter notebook. The data size is not large but the…
mtnt
  • 31
  • 5
0
votes
2 answers

error of accessing an attribute of dask_cudf Series data structure when it is called from a user defined function

My question is relevant to my previous one at Error of using parallelizing data processing by "sentence_transformers" on 2 GPUs from Jupyter notebook. I have tried a new solution because I got an error for the proposed one.   I would like to use…
mtnt
  • 31
  • 5
0
votes
0 answers

error of adding a new column to dask cudf data frame from a 2-d numpy.darray

I would like to assign a new column to a dask cudf data frame from Jupyter notebook. The new column is a 2-dimension numpy.ndarray. My code: import cudf import dask_cudf import numpy as np from random import random df = cudf.DataFrame( { …
mtnt
  • 31
  • 5
0
votes
1 answer

Troubleshooting cudf.tokenize(): 'Length Mismatch' error with non-space delimiters

Cudf Tokenize Element Length Mismatch This is the expected result for tokenize(' ') on space character: 0 Due 0 to 0 being 0 on 0 FMLA …
0
votes
0 answers

calculating dispersion_norm using CUDF

I've been working on building a gpu accelerated package based on Scanpy using the CUDA toolkit ( cudf=23.02, cuml=23.02 ,cugraph=23.02 cudatoolkit=11.8). I'm currently implementing the highly variable genes function but I'm running into some strange…