I am facing a weird issue with pyspark-on-pandas. I am trying to use regex to replace abbreviations with their full counterparts. The function I am using is the following (Simplified it a bit):
def resolve_abbreviations(job_list: pspd.Series) -> pspd.Series:
"""
The job titles contain a lot of abbreviations for common terms.
We write them out to create a more standardized job title list.
:param job_list: df.SchoneFunctie during processing steps
:return: SchoneFunctie where abbreviations are written out in words
"""
abbreviations_dict = {
"1e": "eerste",
"1ste": "eerste",
"2e": "tweede",
"2de": "tweede",
"3e": "derde",
"3de": "derde",
"ceo": "chief executive officer",
"cfo": "chief financial officer",
"coo": "chief operating officer",
"cto": "chief technology officer",
"sr": "senior",
"tech": "technisch",
"zw": "zelfstandig werkend"
}
#Create a list of abbreviations
abbreviations_pob = list(abbreviations_dict.keys())
#For each abbreviation in this list
for abb in abbreviations_pob:
# define patterns to look for
patterns = [fr'((?<=( ))|(?<=(^))|(?<=(\\))|(?<=(\())){abb}((?=( ))|(?=(\\))|(?=($))|(?=(\))))',
fr'{abb}\.']
# actual recoding of abbreviations to written out form
value_to_replace = abbreviations_dict[abb]
for patt in patterns:
job_list = job_list.replace(to_replace=fr'{patt}', value=f'{value_to_replace} ', regex=True)
return job_list
When I call this code with a 'pyspark.pandas.series.Series' like so:
df['CleanedUp'] = resolve_abbreviations(df['SchoneFunctie'])
The following error is thrown:
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm 2021.3\plugins\python\helpers\pydev\pydevd.py", line 1496, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2021.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:\path_to_python_file\python_file.py", line 180, in <module>
df['SchoneFunctie'] = resolve_abbreviations(df['SchoneFunctie'])
File "C:\path_to_python_file\python_file.py", line 164, in resolve_abbreviations
job_list = job_list.replace(to_replace=fr'{patt}', value=f'{value_to_replace} ', regex=True)
File "C:\Users\MyUser\.conda\envs\Anaconda3.9\lib\site-packages\pyspark\pandas\series.py", line 4492, in replace
raise NotImplementedError("replace currently not support for regex")
NotImplementedError: replace currently not support for regex
python-BaseException
But when I look in the pyspark.pandas.Series documentation, I do see a replace function which should be implemented, and also I am quite certain I've used it correctly. Link to documentation: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.Series.replace.html
I am using pyspark version 3.3.1.
What is going wrong here?