0

I've been struggling with this for a few days, and I was wondering maybe someone can help me with it.

What I am trying to accomplish is to process a text file which has a set of questions and answers. The contents of the file (.doc or .docx) look like this:

Document Name
1. Question one:
a. Answer one to question one
b. Answer two to question one
c. Answer three to question one
2. Question two:
a. Answer one to question two
c. Answer two to question two
e. Answer three to question two

What I have tried so far is:

Reading the contents of the document via Apache POI like this:

fis = new FileInputStream(new File(FilePath));
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor extract = new XWPFWordExtractor(doc);
String extractorText = extract.getText();

So, till now, I have the contents of the document. Next, I've tried to create a regex pattern that will match the numbers and the dot at the start of the question (1., 12.) and to continue until it matches the colon by this:

Pattern regexPattern = Pattern.compile("^(\\d|\\d\\d)+\\.[^:]+:\\s*$", Pattern.MULTILINE);
Matcher regexMatcher = regexPattern.matcher(extractorText);

However, when I try to loop thru the result set, I cannot find any questions text:

while (regexMatcher.find()) {
    System.out.println("Found");
    for (int i = 0; i < regexMatcher.groupCount() - 2; i += 2) {
        map.put(regexMatcher.group(i + 1), regexMatcher.group(i + 2));
        System.out.println("#" + regexMatcher.group(i + 1) + " >> " + regexMatcher.group(i + 2));
    }
}

I am not sure where I am going wrong since I am a newbie in Java, and was hoping someone can help me out.

Also, if anyone has a better approach on how to create a map with the questions and the answers related to them, it will be very much appreciated.

Thank you in advance.

Edit: I am trying to obtain something like a Map which will contain the key (the question text) and another list of strings which will represent the set of answers related to that question, something like:

Map<String, List<String>> desiredResult = new HashMap<>();
    desiredResult.entrySet().forEach((entry) -> {
        String       questionText = entry.getKey();
        List<String> answersList  = entry.getValue();

        System.out.println("Now at question: " + questionText);

        answersList.forEach((answerText) -> {
            System.out.println("Now at answer: " + answerText);
        });
    });

Which would generate the following output:

Now at question: 1. Question one:
Now at answer: a. Answer one to question one
Now at answer: b. Answer two to question one
Now at answer: c. Answer three to question one
FuTwo10
  • 3
  • 3

1 Answers1

1

After some thinking I've come up with an answer. By splitting the document by a new line we get an array containing all lines.

When then iterating over that array we just need to decide if a line is a question or an answer. I've done that with 2 different regexes:

For the questions:

\d{1,2}\..+

For the answers:

[a-z]\..+

According to that we then can decide if a new question has begun, or that line needs to be added to the result.

The code can be found below:

// the read document
String document = "Document Name\n" +
    "1. Question one:\n" +
    "a. Answer one to question one\n" +
    "b. Answer two to question one\n" +
    "c. Answer three to question one\n" +
    "2. Question two:\n" +
    "a. Answer one to question two\n" +
    "c. Answer two to question two\n" +
    "e. Answer three to question two";

// splitting by lines
String[] lines = document.split("\r?\n");

// the regex patterns
Pattern questionPattern = Pattern.compile("\\d{1,2}\\..+");
Pattern answerPattern = Pattern.compile("[a-z]\\..+");

// intermediate holding variable
String lastLine = null;

// the result    
Map<String, List<String>> result = new HashMap<>();

for(int lineNumber = 0; lineNumber < lines.length; lineNumber++){
    String line = lines[lineNumber];

    if(questionPattern.matcher(line).matches()){
        result.put(line, new LinkedList<>());
        lastLine = line;
    } else if(answerPattern.matcher(line).matches()){
        result.get(lastLine).add(line);
    } else{
        System.out.printf("Line %s is not a question nor an answer!%n", lineNumber);
    }
}
Lino
  • 19,604
  • 6
  • 47
  • 65
  • Thank you very much for your answer! This seems to be what I am looking for. Edit: I made a few tests and it works perfectly except for one thing, if the question or the answer text is longer than one line, it won't be taken in account. How can I make it ignore new lines until it finds the colon in question texts or answers? Thank you very much once again! – FuTwo10 Jul 18 '18 at 15:58
  • @FuTwo10 you'd have to check what the last line was, answer or question (boolean flag). And then if the line doesn't match any of the patterns, you'd have to append it, to the answer / question which was put in the map beforehand. This is rather inefficient, because you have lots of over- and rewrites when many multiline questions / answers are present – Lino Jul 19 '18 at 06:38