IS it possible to optimise this sequence/code generation within the given format/syntax constraints?

Question

I'm attempting to generate a code, some of which has to follow certain predefined rules(see commentary). I only need as many as there are df rows passed in - the code is assigned back to that same df on a simple per-row basis. Returning a list 'seems' less ideal than assigning directly to the df within the function, but i've not been able to achieve this. Unfortunately i need to pass in 3 df's separately due to other constraints in processing elsewhere, but each time they will have a different single character suffix(e.g. X|Y|Z). The codes do not 'need' to be sequential between the different df's, although having some sequencing in for each could be useful...and is the way i've attempted thus far.

However, my current 'working' attempt here, though functional....takes far too long. I am hopeful that someone can point out some possible wins for optimising any part of this. Typically each df is <500k, more usually 100-200k.

Generate an offer code

Desired outcome:

Sequence that takes the format: YrCodeMthCode+AAAA+99+[P|H|D] Where:

YrCode and Mth code are supplied*
AAAA a generated psuedo unique char sequence*
99 should not contain zeros, and is always 2 digits* (Any, Incl non-sequential)
P|H|D is a defined identifier argument, must be passed in
Typically the df.shape[0] dimensions are never more than 65. But happy to create blank/new and merge with existing if faster.

*The uniqueness of YrCodeMthCode+AAA+99 only needs to cover 500k records each month(as MthCode will change/refresh x12)

import numpy as np
import pandas as pd

np.random.seed(0)
df = pd.DataFrame(np.random.randn(200, 3), columns=list('ABC'))



offerCodeLength = 6
allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

campaignMonth = 'January'
campaignYear = 2021

yearCodesDict = {2021:'G',2022:'H',2023:'I', 2024:'J', 2025:'K', 2026:'L', 2027:'M'}
monthCodesDict = {'January':'A','Febuary':'B','March':'C',
                  'April':'D','May':'E','June':'F',
                  'July':'G', 'August':'H','September':'I',
                  'October':'J','November':'K','December':'L'}


OfferCodeDateStr = str(yearCodesDict[campaignYear])+str(monthCodesDict[campaignMonth])

iterator = 0
breakPoint = df.shape[0]



def generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, OfferCodeSuffix):
    
    allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    
    iterator = 0 # limit amount generated
    offerCodesList = []
    for item in itertools.product(allowedOfferCodeChars, repeat=offerCodeLength): 
        
        # generate a 2 digit number, with NO zeros (to avoid 0 vs o call centre issues)
        psuedoRandNumc = str(int(''.join(random.choices('123456789',k=randint(10,99))))%10**2)
        
        if iterator < breakPoint: # breakpoint as length of associated dataframe/number of codes required
            OfferCodeString = "".join(item)
            OfferCodeString = OfferCodeDateStr+OfferCodeString+psuedoRandNum+OfferCodeSuffix # join Yr,Mth chars to generated rest

            offerCodesList.append(OfferCodeString) 

            iterator +=1 

    return offerCodesList

generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, 'P')

Pretty sure this is less than ideal as'k=randint(10,99))))%10**2' but unsure as to how to better optimise....sliced string?
I'm only defining the breakpoint outside as when i used .shape[0] directly it appeared even slower.
I'm aware that my loop use is probably poor, and there has to be a more vectorised solution in only creating what i need and applying it directly back to the passed df.

Example timings on mine: (OffercodeLength set to just 4) x100 : 5.99 s ± 227 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Wall time: 47.5 s

x1000 : 5.87 s ± 243 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Wall time: 46.4 s

Why use `k=randint(10,99))))%10**2` instead of k=2 if you just need a 2-digit number? — not_speshal, Nov 22 '21 at 18:16
In truth, oversight due to poor understanding. But yes, i see that now. Thanks. — John, Nov 23 '21 at 09:56

score 1 · Accepted Answer · answered Nov 22 '21 at 18:35

IIUC, you could try:

def generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, offerCodeSuffix):
    seen = set()
    offerCodesList = list()
    
    for i in range(breakPoint):
        psuedoRandNum = ''.join(random.choices('123456789', k=2))
        OfferCodeString = "".join(random.choices("ABCDEFGHIJKLMNOPQRSTUVWXYZ", k=6))
        while OfferCodeString in seen:
            OfferCodeString = "".join(random.choices("ABCDEFGHIJKLMNOPQRSTUVWXYZ", k=6))
        seen.add(OfferCodeString)
        offerCodesList.append(f"{OfferCodeDateStr}{OfferCodeString}{psuedoRandNum}{offerCodeSuffix}")
    return offerCodesList

df["offerCode"] = generateOfferCode(YrCodeMthCode, 6, df.shape[0], 'P')

>>> df
            A         B         C    offerCode
0    1.764052  0.400157  0.978738  GAZGCPGE28P
1    2.240893  1.867558 -0.977278  GADYNNWU69P
2    0.950088 -0.151357 -0.103219  GAEQUFPI48P
3    0.410599  0.144044  1.454274  GAUCSCHW76P
4    0.761038  0.121675  0.443863  GAFMVTBP28P
..        ...       ...       ...          ...
195 -0.470638 -0.216950  0.445393  GAOXGTOU88P
196 -0.392389 -3.046143  0.543312  GAXPQOFI25P
197  0.439043 -0.219541 -1.084037  GACBKIJV93P
198  0.351780  0.379236 -0.470033  GAVYQEQL46P
199 -0.216731 -0.930157 -0.178589  GALNKYVE23P

Performance

>>> %timeit generateOfferCode(YrCodeMthCode, 6, 500000, 'P')
%timeit generateOfferCode(YrCodeMthCode, 6, df.shape[0]*1000, 'P')
829 ms ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is a massive improvement on process time! Amongst other things i can better see that the sequential char generation was taking it's toll on proc time even with your inclusion of the while-seen in counteracting that aspect. This is really helpful, thank you. — John, Nov 23 '21 at 12:10

IS it possible to optimise this sequence/code generation within the given format/syntax constraints?

Generate an offer code

Desired outcome:

1 Answers1

Performance