0

I am facing a weird issue with pyspark-on-pandas. I am trying to use regex to replace abbreviations with their full counterparts. The function I am using is the following (Simplified it a bit):

def resolve_abbreviations(job_list: pspd.Series) -> pspd.Series:
    """
    The job titles contain a lot of abbreviations for common terms.
    We write them out to create a more standardized job title list.

    :param job_list: df.SchoneFunctie during processing steps
    :return: SchoneFunctie where abbreviations are written out in words
    """
    abbreviations_dict = {
        "1e": "eerste",
        "1ste": "eerste",
        "2e": "tweede",
        "2de": "tweede",
        "3e": "derde",
        "3de": "derde",
        "ceo": "chief executive officer",
        "cfo": "chief financial officer",
        "coo": "chief operating officer",
        "cto": "chief technology officer",
        "sr": "senior",
        "tech": "technisch",
        "zw": "zelfstandig werkend"
    }

    #Create a list of abbreviations
    abbreviations_pob = list(abbreviations_dict.keys())

    #For each abbreviation in this list
    for abb in abbreviations_pob:
        # define patterns to look for
        patterns = [fr'((?<=( ))|(?<=(^))|(?<=(\\))|(?<=(\())){abb}((?=( ))|(?=(\\))|(?=($))|(?=(\))))',
                    fr'{abb}\.']
        # actual recoding of abbreviations to written out form
        value_to_replace = abbreviations_dict[abb]
        for patt in patterns:
            job_list = job_list.replace(to_replace=fr'{patt}', value=f'{value_to_replace} ', regex=True)

    return job_list

When I call this code with a 'pyspark.pandas.series.Series' like so:

df['CleanedUp'] = resolve_abbreviations(df['SchoneFunctie'])

The following error is thrown:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2021.3\plugins\python\helpers\pydev\pydevd.py", line 1496, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2021.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:\path_to_python_file\python_file.py", line 180, in <module>
    df['SchoneFunctie'] = resolve_abbreviations(df['SchoneFunctie'])
  File "C:\path_to_python_file\python_file.py", line 164, in resolve_abbreviations
    job_list = job_list.replace(to_replace=fr'{patt}', value=f'{value_to_replace} ', regex=True)
  File "C:\Users\MyUser\.conda\envs\Anaconda3.9\lib\site-packages\pyspark\pandas\series.py", line 4492, in replace
    raise NotImplementedError("replace currently not support for regex")
NotImplementedError: replace currently not support for regex
python-BaseException

But when I look in the pyspark.pandas.Series documentation, I do see a replace function which should be implemented, and also I am quite certain I've used it correctly. Link to documentation: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.Series.replace.html

I am using pyspark version 3.3.1.

What is going wrong here?

Psychotechnopath
  • 2,471
  • 5
  • 26
  • 47

1 Answers1

-1

You can use str.replace instead like this:

for patt in patterns:
            job_list = job_list.str.replace(patt, f'{value_to_replace} ')
tamarajqawasmeh
  • 253
  • 1
  • 7