1

I'm processing text data in a pyspark dataframe. i have so far managed to tokenize the data as a column of arrays and produce the table below:

print(df.schema)

StructType(List(StructField(_c0,IntegerType,true),StructField(pageid,IntegerType,true),StructField(title,StringType,true),StructField(text,ArrayType(StringType,true),true)))

df.show(5)

+---+------+-------------------+--------------------+
|_c0|pageid|              title|                text|
+---+------+-------------------+--------------------+
|  0|137277|    Sutton, Vermont|[sutton, is, town...|
|  1|137278|    Walden, Vermont|[walden, is, town...|
|  2|137279| Waterford, Vermont|[waterford, is, t...|
|  3|137280|West Burke, Vermont|[west, burke, is,...|
|  4|137281|  Wheelock, Vermont|[wheelock, is, to...|
+---+------+-------------------+--------------------+
only showing top 5 rows

Then i tried to lemmatize it with udf functions


def get_wordnet_pos(treebank_tag):
    """
    return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
    if treebank_tag.startswith('J'):
            return wordnet.ADJ
    elif treebank_tag.startswith('V'):
            return wordnet.VERB
    elif treebank_tag.startswith('N'):
            return wordnet.NOUN
    elif treebank_tag.startswith('R'):
            return wordnet.ADV
    else:
    # As default pos in lemmatization is Noun
        return wordnet.NOUN


def postagger(p):
    import nltk
    x =  list(nltk.pos_tag(p))
    return x

sparkPosTagger = udf(lambda z: postagger(z),ArrayType(StringType()))

def lemmer(postags):
    import nltk
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    x = [lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag)) for [word,pos_tag] in nltk.pos_tag(postags)]
    return x

sparkLemmer = udf(lambda z: lemmer(z), ArrayType(StringType()))

#df = df.select('_c0','pageid','title','text', sparkPosTagger("text").alias('lemm'))
df = df.select('_c0','pageid','title','text', sparkLemmer("lemm").alias('lems'))


which returns this error:

PicklingError: args[0] from __newobj__ args has the wrong class

I believe the error primarily comes from an incompatibility with the object that nltk.pos_tag(postags) produces. Normally, when given a list of tokens, nltk.pos_tag() produces a list of tuples.

I am stuck on working out a workaround though. As you can see from the code, i tried to split up the process beforehand by pos_tagging separately, only to receive the same error.

Is there a way to make this work?

Saleem Khan
  • 700
  • 2
  • 6
  • 20

2 Answers2

2

Contrary to what i suspected, the problem was actually due to the initial function:

def get_wordnet_pos(treebank_tag):
    """
    return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
    if treebank_tag.startswith('J'):
            return wordnet.ADJ
    elif treebank_tag.startswith('V'):
            return wordnet.VERB
    elif treebank_tag.startswith('N'):
            return wordnet.NOUN
    elif treebank_tag.startswith('R'):
            return wordnet.ADV
    else:
    # As default pos in lemmatization is Noun
        return wordnet.NOUN

which in regular python works fine. In pyspark, however, there is drama when importing nltk, and therefore calling on wordnet is problematic. There have been similar issues when others have attempted to import stopwords:

pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python

whilst i haven't solved the root cause, i have redesigned the code from what i saw online as a practical workaround to remove references to wordnet (was unnecessary anyway):

def get_wordnet_pos(treebank_tag):
    """
    return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
    if treebank_tag.startswith('J'):
            return 'a'
    elif treebank_tag.startswith('V'):
            return 'v'
    elif treebank_tag.startswith('N'):
            return 'n'
    elif treebank_tag.startswith('R'):
            return 'r'
    else:
    # As default pos in lemmatization is Noun
        return 'n'


def lemmatize1(data_str):
    # expects a string
    list_pos = 0
    cleaned_str = ''
    lmtzr = WordNetLemmatizer()
    #text = data_str.split()
    tagged_words = nltk.pos_tag(data_str)
    for word in tagged_words:
        lemma = lmtzr.lemmatize(word[0], get_wordnet_pos(word[1]))
        if list_pos == 0:
            cleaned_str = lemma
        else:
            cleaned_str = cleaned_str + ' ' + lemma
        list_pos += 1
    return cleaned_str

sparkLemmer1 = udf(lambda x: lemmatize1(x), StringType())
Saleem Khan
  • 700
  • 2
  • 6
  • 20
1

Nice answer by Saleem Khan! I'd just add that it is good to have the lemmatized output like this (array format):

sparkLemmer1 = udf(lambda x: lemmatize1(x), ArrayType(StringType()))

instead of this:

sparkLemmer1 = udf(lambda x: lemmatize1(x), StringType())

to be able to create e.g. ngrams and do further preprocessing in pyspark.

buddemat
  • 4,552
  • 14
  • 29
  • 49