Transformation of tabular data into natural language for indexing for a search engine

Question

How to transform tabular data that has various columns / rows as shown below into a more readable (natural language) so that it can be indexed for the downstream tasks of a search engine. I am aware that we have TAPAS (TAPAS: Weakly Supervised Table Parsing via Pre-training), a variant of BERT (Google) that is specifically designed for tabular data QnA (Question answering). But, the problem is we have an existing search service hosted in cloud that is capable of reading natural language and answer text based on that. Therefore, while indexing whole data (text, tables), we are losing valuable information in tables as the inherent relationships between rows & columns is lost. Result is poor quality answers for the information inside the table or no answer at all.

Following is an example: Which transformation is better for the tabular data into a readable (natural language) format for the semantic search without losing context. Currently, we do have a working solution, but the context is lost as the relationship inherent within the elements of columns / rows is lost. Therefore, producing poor quality / no answers. If we could somehow, preserve this inherent relationship while feeding as a natural language to semantic search, it will improve the answer quality.

Please refer to the below table example.

Sample 1:

Question: How much of a feature 2 is allowed at PREMIUM_COMPANY for Name 4

Answer: Integer value

Sample 2:

Question: Is feature 2 allowed at PREMIUM_COMPANY for Name 7 / Name 8

Answer: Allowed in a list 1 / Not allowed at Name 8

While answering manually, we are able to preserve the relationship between two parameters within a column/row whereas it is lost when we convert these html tables into normal text for indexing. Our problem here is to address that. There is a considerable amount of tabular data that is valuable.

Possible idea, but tough to integrate in existing service is to create a separate data structure (index) for the tabular data and apply TAPAS on it to retrieve the answers. We still need to know how to flag tabular data to trigger it when there is a possible answer exist for a question.

Could you please answer if you have any expertise in this area.

Almost yes. TAPAS works better with pipelines built from hugging face. I also tried with alternatives TableQA (converts qs. to SQL queries) and found that trade-offs exists between the both. TAPAS is weakly supervised & has great accuracy if memory & computation issues aren't constraints. TableQA has lower accuracy on semantics, but tend to be faster over large corpus. Currently, there's no SOTA tech. that satisfies our problem, so we adopted knowledge graphs & managed to get decent results. Hope this helps. — Sai_Vyas, Sep 06 '22 at 23:31

Transformation of tabular data into natural language for indexing for a search engine

0 Answers0