spark parallelise on iterator with a function

Question

I have an iterator which operates on sequence of WARC documents and yields modified lists of tokens for each document:

class MyCorpus(object):
def __init__(self, warc_file_instance):
    self.warc_file = warc_file_instance
def clean_text(self, html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text
def __iter__(self):
    for r in self.warc_file:
        try:
            w_trec_id = r['WARC-TREC-ID']
            print w_trec_id
        except KeyError:
            pass
        try:
            text = self.clean_text(re.compile('Content-Length: \d+').split(r.payload)[1])
            alnum_text = re.sub('[^A-Za-z0-9 ]+', ' ', text)
            yield list(set(alnum_text.encode('utf-8').lower().split()))
        except:
            print 'An error occurred'

Now I apply apache spark paraellize to further apply desired map functions:

warc_file = warc.open('/Users/akshanshgupta/Workspace/00.warc')
documents = MyCorpus(warc_file) 
x = sc.parallelize(documents, 20)
data_flat_map = x.flatMap(lambda xs: [(x, 1) for x in xs])
sorted_map = data_flat_map.sortByKey()
counts = sorted_map.reduceByKey(add)
print(counts.max(lambda x: x[1]))

I have following doubts:

Is this the best way to achieve this or there is a simpler way?
When I parallelise the iterator does the actual processing happen in parallel? Is is still sequential?
What if I have multiple files? How can I scale this to a very large corpus say TB's?

x = sc.parallelize(documents, 20) Make sure that you RDD with number of partition is equal to number of cores in the cluster by this all partition will process parallel and resources are also used equally.Also if you are looking to set global parameters that affect every row, then you can use a broadcast variable. — devesh, Aug 25 '18 at 16:27

thebluephantom · Answer 1 · 2018-08-25T20:44:07.680

spark parallelise on iterator with a function

1 Answers1