Machine Learning for extracting text from bunch of files

Question

I have a case where I have lots of specification files and I need to extract a specific kind of information from them (a block of text). It couldn't be done through RegExp solution because the files are quite irregular (could be done but with great effort to create a RegExp string and I do not want to do that). My first thought was to use information extraction (I have lots of examples which could be used to learn a model) from the machine learning branch. My main language is C# so I've checked ML.NET but it appears there isn't such functionality in the library. So my question is, are there any libraries which could allow me to achieve the goal? Or does anyone have an idea to automate such task without writing a complex RegExp?

Machine learning or regex could work, but if the integrity of the data is important, you'd be better off doing it manually as some bits may be lost or modified. I would recommend several regular expressions to strip the excess that you know for sure won't touch your useful data, then it would be easier to clean up the last bits manually. ML can and probably will miss some information. Same with Regex capture. — nelsontruran, May 28 '19 at 19:05
Well, I'm not afraid about information loss as it will be verified by an user in the end. This service is being written to help end user with providing the data to the system so ML will be totally fine and it will provide more sample data gradually. But the problem is that I do not know any framework for .net which enables me IE. — Jaume, May 28 '19 at 20:36
If you are allowed to make a service call (SOAP or REST), you should not constraint yourself to only .Net libraries. For example, you may write a restful service which is using Stanford NLP library written in java and get the response. Going on from the example, you may use conditional random fields in Stanford NLP to label the text to be extracted. For this you should have a labelled dataset in the first place. — berkin, May 29 '19 at 13:27

Machine Learning for extracting text from bunch of files

0 Answers0