-1

I try to extract qualitative data from companies' balance sheet / income statement. They are excel(.xls) file.

Unfortunately, the content structures vary from one company to another.

For example,

To extract revenue value:

In company A, it locates next to "Revenue" column.

In company B, it locates next to "Income from goods and sales" column and it is listed as thousand dollar per unit.

In company C, it is even worse. You need to look for the row that contains wording "Revenue from sales of goods and" and move down the next row that contains wording "rendering of services", add value next to that column to the data in a row that contains wording "Scrap sales", "Gain on exchange rates" and "Gain on equipment disposal".

With more than 500+ available companies, and more than 20 yeas of past data to extract(the content can be varied from year to year, too), this become problematic. I don't know how to handle each case. The data is so unstructured.

So, what I wanna ask is that are there any library/API out there(preferably JAVA) that extract this kind of fuzzy information? I don't want reinvent the wheel if someone already done this. Are there any ready-to-use machine learning API for this kind of thing? Also, theses companies are not listed in US or other well-known stock exchanges so there is no available data provider.

Thank you for your reply.

user1560335
  • 79
  • 2
  • 11

2 Answers2

0

The bad news: I'm pretty sure that there is no such library/API, because the things you want are too complicated and (at least now) cannot be done automatically, especially in cases like C: there is too much domain-specific semantics that is very hard to be encoded.

The good news: I suppose that 80/20 rule remains true for your case - most tables have clear structure like A or B and you can write simple scripts for extracting values for them, while others have to be done manually. I'd advise to develop such scripts incrementally: start for the case A, then launch program for all tables. For failed tables, choose the simplest cases and adapt the code for them; and so on. I believe this way to be the fastest, although not so exciting.

A little more interesting approach for semi-automatic extraction of needed info from tables is described in this paper (sorry for self-citation). Unfortunately, there is no working library or API, but the idea is rather simple and can be easily coded, I guess.

Nikita Astrakhantsev
  • 4,701
  • 1
  • 15
  • 26
0

Companies often provide this information in a computer-readable format based on XML called XBRL. This format allows you to programmatically extract the semantic information you're talking about. Being related to XML, the spec is naturally pretty dense, but the information is there.

As a random example, ExxonMobil freely publishes their data on their investors site.

George Hilliard
  • 15,402
  • 9
  • 58
  • 96