1

I'm a newbie to python just trying to understand why the output is like this.

1)

strr=['asdfasdf','asdf','sdf','sdf']

stak=map(lambda l:l.split(','),strr)

print(stak)

When i execute the first set, the output would be of the format as below which is understandable because, the split returns a list and the collect as well returns a list and hence the lists inside the list

[['asdfasdf'], ['asdf'], ['sdf'], ['sdf']]

2)

str='asdfasdf,asdf,sdf,sdf'

sc.parallelize(str).map(lambda l:l.split(',')).collect()

print(str)

Now check the 2nd case, the below action should have also done the same as above and should have given the similar output. But instead its giving the output as below. What i don't get it, why has the characters been separated into separate lists. Could anyone please explain why difference in outputs in 1 and 2?

[['a'],
 ['s'],
 ['d'],
 ['f'],
 ['a'],
 ['s'],
 ['d'],
 ['f'],
 ['', ''],
 ['a'],
 ['s'],
 ['d'],
 ['f'],
 ['', ''],
 ['s'],
 ['d'],
 ['f'],
 ['', ''],
 ['s'],
 ['d'],
 ['f']]

1 Answers1

1
  1. split returns list in Python, since no element in the list has comma (,), it returns list for every value with only one element value itself as you are mapping split to every element in the list. The output is list of lists.

  2. Since you are using parallelize on string, it breaks the string into RDD of characters and then map split on every character, which again returns you list of lists as explained in #1. Only when it encounters comma as character you get list with two empty strings and it gets splitted. for more info check this

satyam soni
  • 259
  • 1
  • 9