JSoup - Parse HTML tag by tag

Question

I'm actually developping a text parser in Java and I was asked to enhance it by parsing HTML with it. The parser's purpose is to divide the file parsed into 3 other files, one with all the words contained in the file, one with all sentences and the other with all questions.

The *.txt part works perfectly, but I got a problem when parsing HTML.

I create a temporary file with *.txt extension and pass it in my text parser, but if I pass an URL with HTML file linked which is formed like this:

<!DOCTYPE html>
    <head>
        ... some HTML here ...
    </head>
    <body>
        <ul class="some_menu">
            <li class="some_menu_item">n1</li>
            <li class="some_menu_item">n2</li>
            <li class="some_menu_item">n2</li>
        </ul>
        <div>
            This is a question ?
            This is a sentence .
            ... some other text ...
        </div>
    </body>
</html>

the question file will be filled with: n1 n2 n3 This is a question

So, I just was wondering, is there a way to parse with JSoup tags by tags so I can add a line feed each time a block is closed?

If you need some new informations, don't bother to ask!

Edit: I should have 3 output files, which are, for this example:

One with all the words

n1
n2
n3
This
is
a
question
sentence
... some other words ...

One with all the sentences
```
This is a sentence
```
One with all the questions
```
This is a question
```

TimmyM

Yes you can iterate through the tags one by one and get the text separately. However, I don't really understand what you are trying to do here. Can you give an example of what you want out of this HTML? — mbbce, Jan 28 '16 at 10:07

score 0 · Accepted Answer · edited May 23 '17 at 11:52

0

To get all the text in an html body, you can use:

Document doc = Jsoup.connect(url).get();
Elements body = doc.select("body");
String allText = body[0].text();

You can then split the text to get each word separate. To get the text in the div tag, you can use:

Elements div = doc.select("div");
String divText = div[0].text();

You can then split the divText to get each sentence.

Notice that the return type of the select query is actually a list of Element i.e., Elements. That's because there can be more than one elements matching you select query. In this case, since there is only one element for each case we access it by accessing the index 0 of the returned array.

Edit: In order to iterate through all elements check this answer. Basically

Elements elements = doc.body().select("*");

for (Element element : elements) {
    System.out.println(element.text());
}

Though there might be elements with no texts so you can put a check on that.

edited May 23 '17 at 11:52

Community

1
1

answered Jan 28 '16 at 10:36

mbbce

2,245
1
19
31

That's actually what I'm doing right now, but the real purpose here is to add on my generated file a `\n`each time I got a closing tag. doing so will permit my parser to separate menus' text from actual text for example. So I was wondering if there's not a generic way, so I could iterate in each tags on the page – TimmyMdfck Jan 28 '16 at 10:44
Check my edit that points to another answer that might help you in this case. – mbbce Jan 28 '16 at 10:48
Thanks a lot! Gonna check that! Cheers – TimmyMdfck Jan 28 '16 at 12:50

score -2 · Answer 2 · answered Jan 28 '16 at 10:54

-2

There are quite a lot HTML parser available in market like

HTMLUnit
HTMLCleaner
Jericho
JSoup

https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers

Thanks, Vineet

answered Jan 28 '16 at 10:54

Vineet kaushik

351
3
4

1

The question was not about available parsers. It's about how to do it in Jsoup. This answer is completely unrelated. – mbbce Jan 28 '16 at 11:14
unrelated answer – Himanshu Punetha Jul 18 '19 at 12:11

JSoup - Parse HTML tag by tag

2 Answers2

Linked