Need to count the number of documents in a particular directory using python - MapReduce

Question

Please find the below program that I'm using. It is compiling but not giving any output. Request to help with error.

import gzip
import warc
import os
from mrjob.job import MRJob


class DocumentCounter(MRJob):
    def mapper(self, _, line):
        entries = os.listdir("C://Users//HP//WARCDataset")
        for entry in entries:
            yield 1,1

    def reducer(self, key, values):

        yield key, sum(values)

if __name__ == '__main__':
     DocumentCounter.run()

The screenshot of the IDE and the output window. The result is not displayed even though the program runs to success.

I'm not sure what is wrong with your code, but do you really need a class to count number of documents? You can do it easily by: `b = len([x for x in os.listdir(folder) if x.endswith(file_extension)])`. — NotAName, Jan 14 '20 at 04:44
@pavel: Like you said, it is possible to do it with Python internals function, but she wants to use *MapReduce* algorithm, so she needs a class. — codrelphi, Jan 14 '20 at 08:22
@NachiketDeo: Your code seems to be right. Can you show how your run the code from your Terminal ? — codrelphi, Jan 14 '20 at 08:47
@codrelphi Thanks for the review. I'm running the code on local machine on Enthought- Canopy IDE by pressing 'Run' button. I'm not using any command to run the file. Please let me know if there's any command that can be used to run. — Nachiket Deo, Jan 14 '20 at 09:11
It is possible to run the script from your Terminal and to specify where the outputs should be located. You can check the documentation here https://mrjob.readthedocs.io/en/latest/guides.html — codrelphi, Jan 14 '20 at 10:30
@codrelphi. I have attached the screenshot of my IDE in the body of the question. The program runs but doesn't provide any output. Hence, I feel that something must be wrong in my code.Some minor mistake as I'm novice in Python MapReduce — Nachiket Deo, Jan 15 '20 at 07:02
@NachiketDeo Your output should be in one of the folders listing beside the line *job output is in* or *streaming final output from*. You can check those folders. — codrelphi, Jan 16 '20 at 07:29
@codrelphi The output folders don't exist. They get removed at run time.I researched the issue and it seems that mrjob requires input from STDIN. — Nachiket Deo, Jan 17 '20 at 18:08
Try to check the *outputs folder* by using your Terminal (*Command Line Interface*). Also, by checkng the *documentation*, you will know how to specify the *outputs folder* in your Terminal. — codrelphi, Jan 17 '20 at 19:13
Your code doesn't work if multiple mappers are started. You'll be recounting things. Therefore, you're required to use one mapper, therefore not really taking full advantage of MapReduce parallelism — OneCricketeer, Jan 19 '20 at 16:12

score 0 · Accepted Answer · answered Jul 13 '20 at 06:23

0

class DocumentCounter(MRJob):
   
    def mapper_raw(self,_,line):
        for fname in os.listdir(WARC_PATH):
            yield "total_documents",1

    def combiner(self, key, values):
    """
    Sums up count for each mapper.
    """
        yield key, sum(values)
                

    def reducer(self, key, values):
    ##TOTAL_DOUCMENTS = sum(values)
        NumberofDocuments = sum(values)     
        yield key, NumberofDocuments


       
    if __name__ == '__main__':
         DocumentCounter.run()

The above code uses os.listdir function to iterate to all files at given path

answered Jul 13 '20 at 06:23

Nachiket Deo

35
6

Is this a solution to your problem or is this supposed to be part of the question? Seems to me the latter – shuberman Jul 13 '20 at 06:45
This is a solution to the problem. This code when given directory correct path provide answer to number of documents in that particular directory – Nachiket Deo Jul 13 '20 at 17:55

Need to count the number of documents in a particular directory using python - MapReduce

1 Answers1