I try to extract qualitative data from companies' balance sheet / income statement. They are excel(.xls) file.
Unfortunately, the content structures vary from one company to another.
For example,
To extract revenue value:
In company A, it locates next to "Revenue" column.
In company B, it locates next to "Income from goods and sales" column and it is listed as thousand dollar per unit.
In company C, it is even worse. You need to look for the row that contains wording "Revenue from sales of goods and" and move down the next row that contains wording "rendering of services", add value next to that column to the data in a row that contains wording "Scrap sales", "Gain on exchange rates" and "Gain on equipment disposal".
With more than 500+ available companies, and more than 20 yeas of past data to extract(the content can be varied from year to year, too), this become problematic. I don't know how to handle each case. The data is so unstructured.
So, what I wanna ask is that are there any library/API out there(preferably JAVA) that extract this kind of fuzzy information? I don't want reinvent the wheel if someone already done this. Are there any ready-to-use machine learning API for this kind of thing? Also, theses companies are not listed in US or other well-known stock exchanges so there is no available data provider.
Thank you for your reply.