6

I am looking at taking unstructured data in the form of files, processing it and storing it in a database for retrieval. The data will be in natural language and the queries to get information will also be in natural language. Ex: the data could be "Roses are red" and the query could be "What is the color of a rose?"

I have looked at several nlp systems, focusing more on open-source information extraction and relation extraction system and the following seems apt and easy for quick start: https://www.npmjs.com/package/mitie

This can give data in the form of (word,type) pairs. It also gives a relation as result of running the the processing (check the site example).

I want to know if sql is good database to save this information. For retrieving the information, I will need to convert the natural language query also to some kind of (word, meaning) pairs and for using sql I will have to write a layer that converts natural language to sql queries.

Please suggest if there are any open source database that work well in this situation. I'm open to suggestions for databases that work with other open-source information extraction and relation extraction systems if not MITIE.

GOVINDA MAHAJAN
  • 43
  • 2
  • 12
Swati Pardeshi
  • 599
  • 9
  • 12
  • sphinx would be useable as a processing/matching engine you could store it in a raw mysql database if you wanted but sphinx can process the raw files directly and parse them out rather than you having to store in sql if you wanted. – Dave May 24 '17 at 08:09
  • Thanks @Dave . Could you share a link for the same? – Swati Pardeshi May 24 '17 at 08:15
  • http://sphinxsearch.com/ has all the docs etc how to use it and some download links etc. – Dave May 24 '17 at 16:00

1 Answers1

5

SQL wont be an appropriate choice for your problem. You can use NLP or rules to extract relationships and then store that relationship in a Triple Store or a Graph database. There are many good open source Graph Databases like Neo4j and Apache Titan. You can query Google for Triple-stores, I suppose Apache Jena should be a good choice. After storing your data you can query your graphs using any of the Graph Query Languages like Gremlin or Cypher etc. (like SQL). Note that the heart of your system would be a Knowledge Graph.

You may also setup a Lucene/Solr based Search System on your unstructured data which may help you with answering your queries in conjunction with Graph Databases. All of these (NLP, IR, Graph DB/Triplestores etc.) would coexist to solve your problem.

It would be like an ensemble. No silver bullets :) However to start with look at Graph DB's or Triple-stores.

Yavar
  • 11,883
  • 5
  • 32
  • 63