Change Pandas code into CUDF for GPU utilization

Question

I am making pairs of images by mixing positive and negative pairs. This process is quite computationally and takes a lot of RAM and processor. To speed up, I want to use GPU and change pandas code into CUDF. Now, the documentation of CUDF is very limited and I want to change below code into CUDF.

positives = pd.DataFrame()
for value in tqdm(identities.values(), desc="Positives"):
    positives = positives.append(pd.DataFrame(itertools.combinations(value, 2), columns=["file_x", "file_y"]),
                                 ignore_index=True)
positives["decision"] = "Yes"
print(positives)
samples_list = list(identities.values())
negatives = pd.DataFrame()
######################====================Functions=============##############

def compute_cross_samples(x):
    return pd.DataFrame(itertools.product(*x), columns=["file_x", "file_y"])

####################################
if __name__ == "__main__":
    if Path("positives_negatives.csv").exists():
        df = pd.read_csv("positives_negatives.csv")
    else:
        with ProcessPoolExecutor() as pool:
            # take cpu_count combinations from identities.values
            for combos in tqdm(more_itertools.ichunked(itertools.combinations(identities.values(), 2), cpu_count())):
                # for each combination iterator that comes out, calculate the cross
                for cross_samples in pool.map(compute_cross_samples, combos):
                    # for each product iterator "cross_samples", iterate over its values and append them to negatives
                    negatives = negatives.append(cross_samples)

        negatives["decision"] = "No"

negatives = negatives.sample(positives.shape[0])
df = pd.concat([positives, negatives]).reset_index(drop=True)
df.to_csv("positives_negatives.csv", index=False)`

Multiprocessing Pools don't apply to CUDA. cudf array has a method to convert from pandas. — Sergey Bushmanov, Feb 05 '21 at 12:03
No problem you can delete multiprocessing code I just want to run code on GPU. Multiprocessing takes 9 days then gives an error. I am facing this problem for the last 2 months. Help required — Khawar Islam, Feb 05 '21 at 12:07
The problem i have to construct a very big list and the time of list creation is a lot. I have to minimize it through GPU utilization. — Khawar Islam, Feb 05 '21 at 13:08
The community may be better able to help you if you create a minimal, complete, reproducible example. https://stackoverflow.com/help/minimal-reproducible-example — Nick Becker, Feb 05 '21 at 15:43

TaureanDyerNV · Answer 1 · 2021-02-05T18:26:35.630

With your code there are two things you need to consider:

Due to the API similarity, the first place to start is importing cudf. Then, where you use pd (your pandas import variable name) you, replace it with cudf. While this is a start, please check out this guide that will help you understand the basics of the transition. Coding wise, begin with the cudf and dask cuDF tutorial notebooks, especially this one.
As the comes say, on top of removing your CPU processing code, you want to refactor your functions to not require for loops. cuDF and the other RAPIDS libraries do a lot under the hood to parallelize your code for the GPU. Adding for loops makes the process serial and slows you down.
Finally, please read our official docs docs here, which should help with your CPU -> GPU refactor: https://docs.rapids.ai/api/cudf/stable/api.html

I removed multiprocessing and change all thing but still getting an error https://stackoverflow.com/questions/66073491/typeerror-data-must-be-list-or-dict-like-in-cudf — Khawar Islam, Feb 06 '21 at 04:11

Change Pandas code into CUDF for GPU utilization

1 Answers1