I'm new to Python. I'm also new to pysaprk. I'm trying to run a code line that takes (kv[0], kv[1]) and then run an ngrams() function on kv[1].
Also here is the sample layout of the mentions
data that the code works on:
Out[12]:
[{'_id': u'en.wikipedia.org/wiki/Kamchatka_Peninsula',
'source': 'en.wikipedia.org/wiki/Warthead_sculpin',
'span': (100, 119),
'text': u' It is native to the northern.'},
{'_id': u'en.wikipedia.org/wiki/Warthead_sculpin',
'source': 'en.wikipedia.org/wiki/Warthead_sculpin',
'span': (4, 20),
'text': u'The warthead sculpin ("Myoxocephalus niger").'}]
This is the code that I'm working with:
def build(self, mentions, idfs):
m = mentions\
.map(lambda (source, target, span, text): (target, text))
.flatMapValues(lambda v: ngrams(v, self.max_ngram))
.map(lambda v: (v, 1))
.reduceByKey(add)\
How should the data from the previous step be formulated to resolve this error?? Any help or guidance will be truly appreciated.
I'm using python 2.7 and pyspark 2.3.0.
Thank you,