2

Here are my code for comparison between cudf and pandas performance :

gpuDF2 = cudf.DataFrame({'col_1': np.arange(0, 10_000_000), 'col_2': np.arange(0, 10_000_000)})
pandasDF2= pd.DataFrame({'col_1':np.arange(0,10_000_000), 'col_2':np.arange(0,10_000_000)})
gpuDF2['log_2'] = np.log(gpuDF2['col_1'])
pandasDF2['log_1'] = np.log(pandasDF2['col_1'])

enter image description here

How can I have consistency between the two computation ?

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
fransua
  • 501
  • 2
  • 18

1 Answers1

1

I can reproduce the original post, but for consistent results you will want to use cupy instead of numpy. Fixing that generates the same answer:

import cudf
import pandas as pd
import cupy

gpuDF2 = cudf.DataFrame({'col_1': np.arange(0, 10_000_000), 'col_2': np.arange(0, 10_000_000)})
pandasDF2= pd.DataFrame({'col_1':np.arange(0,10_000_000), 'col_2':np.arange(0,10_000_000)})
gpuDF2['log_2'] = cupy.log(gpuDF2['col_1'])
pandasDF2['log_1'] = np.log(pandasDF2['col_1'])

# this passes
cupy.testing.assert_array_almost_equal(pandasDF2['log_1'], gpuDF2['log_2'])
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46