I'm processing text data in a pyspark dataframe. i have so far managed to tokenize the data as a column of arrays and produce the table below:
print(df.schema)
StructType(List(StructField(_c0,IntegerType,true),StructField(pageid,IntegerType,true),StructField(title,StringType,true),StructField(text,ArrayType(StringType,true),true)))
df.show(5)
+---+------+-------------------+--------------------+
|_c0|pageid| title| text|
+---+------+-------------------+--------------------+
| 0|137277| Sutton, Vermont|[sutton, is, town...|
| 1|137278| Walden, Vermont|[walden, is, town...|
| 2|137279| Waterford, Vermont|[waterford, is, t...|
| 3|137280|West Burke, Vermont|[west, burke, is,...|
| 4|137281| Wheelock, Vermont|[wheelock, is, to...|
+---+------+-------------------+--------------------+
only showing top 5 rows
Then i tried to lemmatize it with udf functions
def get_wordnet_pos(treebank_tag):
"""
return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v)
"""
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
# As default pos in lemmatization is Noun
return wordnet.NOUN
def postagger(p):
import nltk
x = list(nltk.pos_tag(p))
return x
sparkPosTagger = udf(lambda z: postagger(z),ArrayType(StringType()))
def lemmer(postags):
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
x = [lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag)) for [word,pos_tag] in nltk.pos_tag(postags)]
return x
sparkLemmer = udf(lambda z: lemmer(z), ArrayType(StringType()))
#df = df.select('_c0','pageid','title','text', sparkPosTagger("text").alias('lemm'))
df = df.select('_c0','pageid','title','text', sparkLemmer("lemm").alias('lems'))
which returns this error:
PicklingError: args[0] from __newobj__ args has the wrong class
I believe the error primarily comes from an incompatibility with the object that nltk.pos_tag(postags) produces. Normally, when given a list of tokens, nltk.pos_tag() produces a list of tuples.
I am stuck on working out a workaround though. As you can see from the code, i tried to split up the process beforehand by pos_tagging separately, only to receive the same error.
Is there a way to make this work?