TypeError: tuple indices must be integers, not str using pyspark and RDD

Question

I'm new to Python. I'm also new to pysaprk. I'm trying to run a code line that takes (kv[0], kv[1]) and then run an ngrams() function on kv[1].

Also here is the sample layout of the mentions data that the code works on:

Out[12]: 
[{'_id': u'en.wikipedia.org/wiki/Kamchatka_Peninsula',
  'source': 'en.wikipedia.org/wiki/Warthead_sculpin',
  'span': (100, 119),
  'text': u' It is native to the northern.'},
 {'_id': u'en.wikipedia.org/wiki/Warthead_sculpin',
  'source': 'en.wikipedia.org/wiki/Warthead_sculpin',
  'span': (4, 20),
  'text': u'The warthead sculpin ("Myoxocephalus niger").'}]

This is the code that I'm working with:

    def build(self, mentions, idfs):
            m = mentions\
                .map(lambda (source, target, span, text): (target, text))
                .flatMapValues(lambda v: ngrams(v, self.max_ngram))
                .map(lambda v: (v, 1))
                .reduceByKey(add)\

How should the data from the previous step be formulated to resolve this error?? Any help or guidance will be truly appreciated.

I'm using python 2.7 and pyspark 2.3.0.

Thank you,

score 1 · Accepted Answer · edited May 18 '18 at 20:03

1

mapValues can be applied only on a RDD of (key, value) pairs (RDD where each element is a tuple of length equal to 2, or some object that behaves as one - How to determine if object is a valid key-value pair in PySpark)

You data is a dictionary, so it doesn't qualify. It is not clear what you expect there, but you suspect you want:

from operator import itemgetter

(mentions
  .map(itemgetter("_id", "text"))
  .flatMapValues(lambda v: ngrams(v, self.max_ngram))
  .map(lambda v: (v, 1)))

edited May 18 '18 at 20:03

Alper t. Turker

34,230
9
83
115

answered May 18 '18 at 15:14

May I ask why you suggested the usage of flatMapValues then mapValue? I added the rest of the method for a clearer vision (it counts tfidf). Thank you in advance – user3446905 May 18 '18 at 15:18
Thank you for your help. I did what you suggested and it caused a type error. I edit the question to show the triggered error. – user3446905 May 19 '18 at 12:29

TypeError: tuple indices must be integers, not str using pyspark and RDD

1 Answers1