Multiple file reading loop and distinguishing between .pdf and .doc files

Question

Am writing a Java program in Eclipse to scan keywords from resumes and filter the most suitable resume among them, apart from showing the keywords for each resume. The resumes can be of doc/pdf format.

I've successfully implemented a program to read pdf files and doc files seperately (by using Apache's PDFBox and POI jar packages and importing libraries for the required methods), display the keywords and show resume strength in terms of the number of keywords found.

Now there are two issues am stuck in:

(1) I need to distinguish between a pdf file and a doc file within the program, which is easily achievable by an if statement but am confused how to write the code to detect if a file has a .pdf or .doc extension. (I intend to build an application to select the resumes, but then the program has to decide whether it will implement the doc type file reading block or the pdf type file reading block)

(2) I intend to run the program for a list of resumes, for which I'll need a loop within which I'll run the keyword scanning operations for each resume, but I can't think of a way as because even if the files were named like 'resume1', 'resume2' etc we can't assign the loop's iterable variable in the file location like : 'C:/Resumes_Folder/Resume[i]' as thats the path.

Any help would be appreciated!

duffymo · Accepted Answer · 2019-09-13T13:49:38.217

You can use a FileFilter to read only one type or another, then respond accordingly. It'll give you a List containing only files of the desired type.
The second requirement is confusing to me. I think you would be well served by creating a class that encapsulates the data and behavior that you want for a parsed Resume. Write a factory class that takes in an InputStream and produces a Resume with the data you need inside.

You are making a classic mistake: You are embedding all the logic in a main method. This will make it harder to test your code.

All problem solving consists of breaking big problems into smaller ones, solving the small problems, and assembling them to finally solve the big problem.

I would recommend that you decompose this problem into smaller classes. For example, don't worry about looping over a directory's worth of files until you can read and parse an individual PDF and DOC file.

Create an interface:

public interface ResumeParser {
    Resume parse(InputStream is) throws IOException;
}

Implement different implementations for PDF and Word Doc.

Create a factory to give you the appropriate ResumeParser based on file type:

public class ResumeParserFactory {
    public ResumeParser create(String fileType) {
        if (fileType.contains(".pdf") {
           return new PdfResumeParser();            
        } else if (fileType.contains(".doc") {
           return new WordResumeParser();
        } else {
           throw new IllegalArgumentException("Unknown document type: " + fileType);
        }
    }
}

Be sure to write unit tests as you go. You should know how to use JUnit.

score 1 · Answer 2 · answered Sep 13 '19 at 13:31

1

Another alternative to using a FileFilter is to use a DirectoryStream, because Files::newDirectoryStream easily allows to specify relevant file endings:

try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir, "*.{doc,pdf}")) {
           for (Path entry: stream) {
               // process files here
           }
       } catch (DirectoryIteratorException ex) {
           // I/O error encounted during the iteration, the cause is an IOException
           throw ex.getCause();
       }
}

answered Sep 13 '19 at 13:31

lema

344
7
15

1

A nice use of a newer JDK feature. – duffymo Sep 13 '19 at 13:33
Thanks, some of Java's stream features are really elegant. – lema Sep 13 '19 at 13:40
I don't think to investigate these enough. I tend to fall back on the older idioms that I know well. Good incentive to look beyond lambdas. – duffymo Sep 13 '19 at 13:48

score 0 · Answer 3 · answered Sep 13 '19 at 13:26

You can do something basic like:

// Put the path to the folder containing all the resumes here
File f = new File("C:\\");
ArrayList<String> names = new ArrayList<> 
(Arrays.asList(Objects.requireNonNull(f.list())));

for (String fileName : names) {
   if (fileName.length() > 3) {
       String type = fileName.substring(fileName.length() - 3);
       if (type.equalsIgnoreCase("doc")) {
           // doc file logic here
       } else if (type.equalsIgnoreCase("pdf")) {
           // pdf file logic here
       }
    }
}

But as DuffyMo's answer says, you can also use a FileFilter (it's definitely a better option than my quick code).

Hope it helps.

Multiple file reading loop and distinguishing between .pdf and .doc files

3 Answers3