I'm attempting to generate a code, some of which has to follow certain predefined rules(see commentary). I only need as many as there are df rows passed in - the code is assigned back to that same df on a simple per-row basis. Returning a list 'seems' less ideal than assigning directly to the df within the function, but i've not been able to achieve this. Unfortunately i need to pass in 3 df's separately due to other constraints in processing elsewhere, but each time they will have a different single character suffix(e.g. X|Y|Z). The codes do not 'need' to be sequential between the different df's, although having some sequencing in for each could be useful...and is the way i've attempted thus far.
However, my current 'working' attempt here, though functional....takes far too long. I am hopeful that someone can point out some possible wins for optimising any part of this. Typically each df is <500k, more usually 100-200k.
Generate an offer code
Desired outcome:
Sequence that takes the format: YrCodeMthCode+AAAA+99+[P|H|D] Where:
- YrCode and Mth code are supplied*
- AAAA a generated psuedo unique char sequence*
- 99 should not contain zeros, and is always 2 digits* (Any, Incl non-sequential)
- P|H|D is a defined identifier argument, must be passed in
- Typically the df.shape[0] dimensions are never more than 65. But happy to create blank/new and merge with existing if faster.
*The uniqueness of YrCodeMthCode+AAA+99 only needs to cover 500k records each month(as MthCode will change/refresh x12)
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(200, 3), columns=list('ABC'))
offerCodeLength = 6
allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
campaignMonth = 'January'
campaignYear = 2021
yearCodesDict = {2021:'G',2022:'H',2023:'I', 2024:'J', 2025:'K', 2026:'L', 2027:'M'}
monthCodesDict = {'January':'A','Febuary':'B','March':'C',
'April':'D','May':'E','June':'F',
'July':'G', 'August':'H','September':'I',
'October':'J','November':'K','December':'L'}
OfferCodeDateStr = str(yearCodesDict[campaignYear])+str(monthCodesDict[campaignMonth])
iterator = 0
breakPoint = df.shape[0]
def generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, OfferCodeSuffix):
allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
iterator = 0 # limit amount generated
offerCodesList = []
for item in itertools.product(allowedOfferCodeChars, repeat=offerCodeLength):
# generate a 2 digit number, with NO zeros (to avoid 0 vs o call centre issues)
psuedoRandNumc = str(int(''.join(random.choices('123456789',k=randint(10,99))))%10**2)
if iterator < breakPoint: # breakpoint as length of associated dataframe/number of codes required
OfferCodeString = "".join(item)
OfferCodeString = OfferCodeDateStr+OfferCodeString+psuedoRandNum+OfferCodeSuffix # join Yr,Mth chars to generated rest
offerCodesList.append(OfferCodeString)
iterator +=1
return offerCodesList
generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, 'P')
- Pretty sure this is less than ideal as'k=randint(10,99))))%10**2' but unsure as to how to better optimise....sliced string?
- I'm only defining the breakpoint outside as when i used .shape[0] directly it appeared even slower.
- I'm aware that my loop use is probably poor, and there has to be a more vectorised solution in only creating what i need and applying it directly back to the passed df.
Example timings on mine: (OffercodeLength set to just 4) x100 : 5.99 s ± 227 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Wall time: 47.5 s
x1000 : 5.87 s ± 243 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Wall time: 46.4 s