0

I am wanting to convert the following code (which runs in pandas) to code that runs in cuDF.

Sample data from .head() of Series being manipulated is plugged into OG code in the 3rd code cell down -- should be able to copy/paste run.

Original code in pandas

# both are float columns now
# rawcensustractandblock
s_rawcensustractandblock = df_train['rawcensustractandblock'].apply(lambda x: str(x))

# adjust/set new tract number 
df_train['census_tractnumber'] = s_rawcensustractandblock.str.slice(4,11)

# adjust block number
df_train['block_number'] = s_rawcensustractandblock.str.slice(start=11)
df_train['block_number'] = df_train['block_number'].apply(lambda x: x[:4]+'.'+x[4:]+'0' )
df_train['block_number'] = df_train['block_number'].apply(lambda x: int(round(float(x),0)) )
df_train['block_number'] = df_train['block_number'].apply(lambda x: str(x).ljust(4,'0') )

Data being manipulated

# series of values from df_train.['rawcensustractandblock'].head()
data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401, 
                  60372963.002002, 60590423.381006])

Code adjusted to start with this sample data

Here's how the code looks when using the above provided data instead of the entire dataframe.

Based on errors encountered when trying to convert, this issue is at the Series level, so the converting the cell below to execute in cuDF should solve the problem.

import pandas as pd

# series of values from df_train.['rawcensustractandblock'].head()
data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401, 
                  60372963.002002, 60590423.381006])

# how the first line looks using the series
s_rawcensustractandblock = data.apply(lambda x: str(x))

# adjust/set new tract number 
census_tractnumber = s_rawcensustractandblock.str.slice(4,11)

# adjust block number
block_number = s_rawcensustractandblock.str.slice(start=11)
block_number = block_number.apply(lambda x: x[:4]+'.'+x[4:]+'0' )
block_number = block_number.apply(lambda x: int(round(float(x),0)) )
block_number = block_number.apply(lambda x: str(x).ljust(4,'0') )

Expected changes (output)

df_train['census_tractnumber'].head()

# out
0    1066.46
1    0524.22
2    4638.00
3    2963.00
4    0423.38
Name: census_tractnumber, dtype: object

df_train['block_number'].head()

0    1001
1    2024
2    3004
3    2002
4    1006
Name: block_number, dtype: object
gumdropsteve
  • 70
  • 1
  • 14

2 Answers2

1

You can use cuDF string methods (via nvStrings) for almost everything you're trying to do. You will lose some precision converting these floats to strings in cuDF (though it may not matter in your example above), so for this example I've simply converted beforehand. If possible, I would recommend initially creating the rawcensustractandblock as a string column rather than a float column.

import cudf
import pandas as pd
​
gdata = cudf.from_pandas(pd_data.astype('str'))
​
tractnumber = gdata.str.slice(4,11)
blocknumber = gdata.str.slice(11)
blocknumber = blocknumber.str.slice(0,4).str.cat(blocknumber.str.slice(4), '.')
blocknumber = blocknumber.astype('float').round(0).astype('int')
blocknumber = blocknumber.astype('str').str.ljust(4, '0')
​
tractnumber
0    1066.46
1    0524.22
2    4638.00
3    2963.00
4    0423.38
dtype: object

blocknumber
0    1001
1    2024
2    3004
3    2002
4    1006
dtype: object
Nick Becker
  • 4,059
  • 13
  • 19
0

for loop solution

pandas (original code)

import pandas as pd

# data from df_train.rawcensustractandblock.head()
pd_data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401, 
                     60372963.002002, 60590423.381006])

# using series instead of dataframe
pd_raw_block = pd_data.apply(lambda x: str(x))

# adjust/set new tract number 
pd_tractnumber = pd_raw_block.str.slice(4,11)

# set/adjust block number
pd_block_number = pd_raw_block.str.slice(11)
pd_block_number = pd_block_number.apply(lambda x: x[:4]+'.'+x[4:]+'0')
pd_block_number = pd_block_number.apply(lambda x: int(round(float(x),0)))
pd_block_number = pd_block_number.apply(lambda x: str(x).ljust(4,'0'))


# print(list(pd_tractnumber))
# print(list(pd_block_number))

cuDF (solution code)

import cudf

# data from df_train.rawcensustractandblock.head()
cudf_data = cudf.Series([60371066.461001, 60590524.222024, 60374638.00300401, 
                         60372963.002002, 60590423.381006])

# using series instead of dataframe
cudf_tractnumber = cudf_data.values_to_string()
# adjust/set new tract number
for i in range(len(cudf_tractnumber)):
  funct = slice(4,11)
  cudf_tractnumber[i] = cudf_tractnumber[i][funct]

# using series instead of dataframe
cudf_block_number = cudf_data.values_to_string()
# set/adjust block number
for i in range(len(cudf_block_number)):
  funct = slice(11, None)
  cudf_block_number[i] = cudf_block_number[i][funct]
  cudf_block_number[i] = cudf_block_number[i][:4]+'.'+cudf_block_number[i][4:]+'0'
  cudf_block_number[i] = int(round(float(cudf_block_number[i]), 0))
  cudf_block_number[i] = str(cudf_block_number[i]).ljust(4,'0')


# print(cudf_tractnumber)
# print(cudf_block_number)
gumdropsteve
  • 70
  • 1
  • 14