I would recommend that you first convert your PDF to normalized HTML by using this setting:
"conversion_target": "normalized_html"
and inspect the generated HTML. Look for the places where headings (<h1>, <h2>, ..., <h6>
) are detected. Those are the tags that will be used to split by answer units when you switch back to answer_units
.
The reason you are currently seeing each chapter being split as an answer unit is because each chapter probably starts with a heading, but no headings are detected within each chapter.
In order to generate more answer units, you will need to tweak the PDF input configurations as described here, so that more headings are generated from the PDF to HTML conversion step and hence more answer units are generated.
For example, the following configuration will detect headings at 6 different levels, based on certain font characteristics for each level:
{
"conversion_target": "normalized_html",
"pdf": {
"heading": {
"fonts": [
{"level": 1, "min_size": 24},
{"level": 2, "min_size": 18, "max_size": 23, "bold": true},
{"level": 3, "min_size": 14, "max_size": 17, "italic": false},
{"level": 4, "min_size": 12, "max_size": 13, "name": "Times New Roman"},
{"level": 5, "min_size": 10, "max_size": 12, "bold": true},
{"level": 6, "min_size": 9, "max_size": 10, "bold": true}
]
}
}
}
You can start with a configuration like this and keep tweaking it until the produced normalized HTML contains the headings at the places that you expect the answer units to be. Then, take the tweaked configuration, switch to answer_units
and put it all together:
{
"conversion_target": "answer_units",
"answer_units": {
"selector_tags": ["h1", "h2", "h3", "h4", "h5", "h6"]
},
"pdf": {
"heading": {
"fonts": [
{"level": 1, "min_size": 24},
{"level": 2, "min_size": 18, "max_size": 23, "bold": true},
{"level": 3, "min_size": 14, "max_size": 17, "italic": false},
{"level": 4, "min_size": 12, "max_size": 13, "name": "Times New Roman"},
{"level": 5, "min_size": 10, "max_size": 12, "bold": true},
{"level": 6, "min_size": 9, "max_size": 10, "bold": true}
]
}
}
}
Regarding your second question about tables, unfortunately there is no way to convert table content into answer units. As explained above, answer unit generation is based on heading detection. That being said, if there is a table between two detected headings, that table will be part of the answer unit as any other content between the two headings.