I'm actually developping a text parser in Java and I was asked to enhance it by parsing HTML with it. The parser's purpose is to divide the file parsed into 3 other files, one with all the words contained in the file, one with all sentences and the other with all questions.
The *.txt part works perfectly, but I got a problem when parsing HTML.
I create a temporary file with *.txt extension and pass it in my text parser, but if I pass an URL with HTML file linked which is formed like this:
<!DOCTYPE html>
<head>
... some HTML here ...
</head>
<body>
<ul class="some_menu">
<li class="some_menu_item">n1</li>
<li class="some_menu_item">n2</li>
<li class="some_menu_item">n2</li>
</ul>
<div>
This is a question ?
This is a sentence .
... some other text ...
</div>
</body>
</html>
the question file will be filled with: n1 n2 n3 This is a question
So, I just was wondering, is there a way to parse with JSoup tags by tags so I can add a line feed each time a block is closed?
If you need some new informations, don't bother to ask!
Edit: I should have 3 output files, which are, for this example:
One with all the words
n1 n2 n3 This is a question sentence ... some other words ...
One with all the sentences
This is a sentence
One with all the questions
This is a question
TimmyM