-4

I am making pairs of images by mixing positive and negative pairs. This process is quite computationally and takes a lot of RAM and processor. To speed up, I want to use GPU and change pandas code into CUDF. Now, the documentation of CUDF is very limited and I want to change below code into CUDF.

positives = pd.DataFrame()
for value in tqdm(identities.values(), desc="Positives"):
    positives = positives.append(pd.DataFrame(itertools.combinations(value, 2), columns=["file_x", "file_y"]),
                                 ignore_index=True)
positives["decision"] = "Yes"
print(positives)
samples_list = list(identities.values())
negatives = pd.DataFrame()
######################====================Functions=============##############

def compute_cross_samples(x):
    return pd.DataFrame(itertools.product(*x), columns=["file_x", "file_y"])

####################################
if __name__ == "__main__":
    if Path("positives_negatives.csv").exists():
        df = pd.read_csv("positives_negatives.csv")
    else:
        with ProcessPoolExecutor() as pool:
            # take cpu_count combinations from identities.values
            for combos in tqdm(more_itertools.ichunked(itertools.combinations(identities.values(), 2), cpu_count())):
                # for each combination iterator that comes out, calculate the cross
                for cross_samples in pool.map(compute_cross_samples, combos):
                    # for each product iterator "cross_samples", iterate over its values and append them to negatives
                    negatives = negatives.append(cross_samples)

        negatives["decision"] = "No"

negatives = negatives.sample(positives.shape[0])
df = pd.concat([positives, negatives]).reset_index(drop=True)
df.to_csv("positives_negatives.csv", index=False)`
Khawar Islam
  • 2,556
  • 2
  • 34
  • 56

1 Answers1

1

With your code there are two things you need to consider:

  1. Due to the API similarity, the first place to start is importing cudf. Then, where you use pd (your pandas import variable name) you, replace it with cudf. While this is a start, please check out this guide that will help you understand the basics of the transition. Coding wise, begin with the cudf and dask cuDF tutorial notebooks, especially this one.

  2. As the comes say, on top of removing your CPU processing code, you want to refactor your functions to not require for loops. cuDF and the other RAPIDS libraries do a lot under the hood to parallelize your code for the GPU. Adding for loops makes the process serial and slows you down.

  3. Finally, please read our official docs docs here, which should help with your CPU -> GPU refactor: https://docs.rapids.ai/api/cudf/stable/api.html

TaureanDyerNV
  • 1,208
  • 8
  • 9
  • I removed multiprocessing and change all thing but still getting an error https://stackoverflow.com/questions/66073491/typeerror-data-must-be-list-or-dict-like-in-cudf – Khawar Islam Feb 06 '21 at 04:11