0

I have an iterator which operates on sequence of WARC documents and yields modified lists of tokens for each document:

class MyCorpus(object):
def __init__(self, warc_file_instance):
    self.warc_file = warc_file_instance
def clean_text(self, html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text
def __iter__(self):
    for r in self.warc_file:
        try:
            w_trec_id = r['WARC-TREC-ID']
            print w_trec_id
        except KeyError:
            pass
        try:
            text = self.clean_text(re.compile('Content-Length: \d+').split(r.payload)[1])
            alnum_text = re.sub('[^A-Za-z0-9 ]+', ' ', text)
            yield list(set(alnum_text.encode('utf-8').lower().split()))
        except:
            print 'An error occurred'

Now I apply apache spark paraellize to further apply desired map functions:

warc_file = warc.open('/Users/akshanshgupta/Workspace/00.warc')
documents = MyCorpus(warc_file) 
x = sc.parallelize(documents, 20)
data_flat_map = x.flatMap(lambda xs: [(x, 1) for x in xs])
sorted_map = data_flat_map.sortByKey()
counts = sorted_map.reduceByKey(add)
print(counts.max(lambda x: x[1]))

I have following doubts:

  1. Is this the best way to achieve this or there is a simpler way?
  2. When I parallelise the iterator does the actual processing happen in parallel? Is is still sequential?
  3. What if I have multiple files? How can I scale this to a very large corpus say TB's?
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • x = sc.parallelize(documents, 20) Make sure that you RDD with number of partition is equal to number of cores in the cluster by this all partition will process parallel and resources are also used equally.Also if you are looking to set global parameters that affect every row, then you can use a broadcast variable. – devesh Aug 25 '18 at 16:27
  • Any benefit from the answer, out of interest? – thebluephantom Aug 30 '18 at 09:59

1 Answers1

0

More from Scala context, but:

  1. One doubt I have is doing sortByKey before reduceByKey.
  2. Processing is in parallel if using map, foreachPartition, Dataframe Writer, etc. or reading via sc and sparksession, and the Spark paradigm is generally suited to non-sequential dependent algorithms. mapPartitions and other APIs generally used for improving performance. That function should be part of mapPartitions I would think or used in conjunction with map or within map closure. Note serializable issues, see:

  3. More computer resources allows more scaling with better performance, throughput.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83