Filtering RDD computation dataset

Question

I was practicing with apache spark and i tried doing some computations. Although, i was able to achieve my desired result, but i had to try two different methods before it worked.

I have an existing dataset which i created an RDD from.

"RT @NigeriaNewsdesk: Chibok schoolgirls were swapped for 5 Boko Haram commanders via @todayng"

I wanted to filter and get the words that starts with @ so i created an RDD from an existing dataset.

usernameFile = sc.parallelize(tweets)
username = usernameFile.flatMap(lambda line: line.split()).filter(lambda x: x.startswith('@')).collect()
print(username)

I got something like this

[u'R', u'T', u' ', u'@', u'N', u'i', u'g', u'e', u'r', u'i', u'a', u'N', u'e', u'w', u's', u'd', u'e', u's', u'k', u':', u' ', u'C', u'h', u'i', u'b', u'o', u'k', u' ', u's', u'c', u'h', u'o', u'o', u'l', u'g', u'i', u'r', u'l', u's', u' ', u'w', u'e', u'r', u'e', u' ', u's', u'w', u'a', u'p', u'p', u'e', u'd', u' ', u'f'

I will also attach it On the second attempt, i did something like this

tweets = tweets.split(" ")
usernameFile = sc.parallelize(tweets)
username = usernameFile.flatMap(lambda line: line.split()).filter(lambda x: x.startswith('@')).collect()
print(username)
print("username done")

The second attempt worked absolutely fine, but my question is why did i have to split it before parallelizing the dataset?

Can i achieve the same thing without doing this first?

tweets = tweets.split(" ")

Thank you.

user8115764 · Accepted Answer · 2017-06-05T18:17:06.170

1

Just map directly like this:

import re

tweets = sc.parallelize([
    "RT @foo abc @bar"
])

tweets.flatMap(lambda s: re.findall("@\w+", s))

It doesn't get simpler than that :)

edited Jun 05 '17 at 18:17

answered Jun 05 '17 at 18:11

user8115764

101
2

Filtering RDD computation dataset

1 Answers1