Deep learning and text analysis / extraction

Question

i am trying to build a model based on deep learning to extract specific text from long sentences.

Let's suppose a text of 200 words, and a table where i have my client name and surname. I am trying to build a model to extract from these 200 words the specific client name/surname using deep learning.

I've read about CNN and RNTN models, semantic parsing and word2vec models, but clearly i am not a pro in that field.

My thoughts are :

step 1 : make a 1st model where input = client surname , output = class surname
step 2 : make a 2nd model where input = client name , output = class name
step 3 : make a 3rd model where input = client name + surname and surname + name, output = class client
step 4 : make a 4th model where i send bag of words in input and find a way to find the client class in output.

The same way we can find noun/adverbs/verbs/ ... we should be able to create a sort of new "semantic sort" as client, address, ....

Can anyone give me some advices about my way of thinking ? or tell me what part i should change / improve ?

Thanks a lot.

what you are looking for is making a system that can't detect the client name and surname in a sentance. working with language processing is a massive field. what I would advise you is to make a grand truth by labeling some of the data and learn a model on them, your classes should be two or 3 maximum. First of all try to implement some methods to clean up your data and structure the text before applying any model such as "RNN,NN". — Feras, Aug 25 '16 at 12:30
Thank you for your reply. After some search, i am looking forward to implement a CRF to solve my problem, with approriate BILOU NER tagging. I'll be using tensorflow' Sequence-to-Sequence model. I let you know how things go on. By the way, do you have any tools to suggest me so i can clean the data and tag it for train ? i've got a CSV list with company names on a first column and address on a second column. Thank you for your help. — lovefinearts, Aug 29 '16 at 15:05

score 0 · Answer 1 · answered Aug 29 '16 at 22:21

You could use Named Entity Recognition (NER) after building a model which will be tough to build/very time consuming; however if you know the client name and surname there is a much quicker way to identify them in sentences. Just use a simple SQL Query with a table parameter to locate the client names and surnames. I use something like this in SQL Server 2012. In this example you could pass in X number of clients as a table value parameter to isolate sentences. I had the same issue with a project I was working on, and this was the solution...There is always an alternative and in this case it's something you could setup in minutes instead of weeks:

ALTER Procedure [dbo].[Get_Sentences_Token_Table_Value_Parameter] 
@id_file int,
@sentiment nvarchar(50),
@client_list [dbo].[client_list] READONLY
AS
SELECT        TOP (1000) sentence_id, pos_remaining_token, sentiment AS Sentiment, sentence AS Sentence, id_file, pos_token
FROM            chat_Facets
GROUP BY sentence_id, pos_remaining_token, sentiment, sentence, id_file, pos_token
HAVING        (id_file = @id_file) AND (sentiment = @sentiment) AND chat_Facets.pos_remaining_token IN (SELECT pos_remaining_token FROM @client_list)
ORDER BY pos_remaining_token, Sentence

This is exactly where i am heading to : using seq2seq from tensorflow to implement a NER tagger. I know about the SQL solution... we are talking about long sentences of hundred of words, coming from OCR output, in a database of millions of people... we have strong datacenters, thus time/computing consumption is not a problem :) i am just trying to make a POT/POC available and working for the next few weeks. And the main problem with the SQL solution, is these calls to SQL server that will be saturating the server, when we can make a smart deep learning model that solves everything. :) — lovefinearts, Aug 30 '16 at 14:20

Deep learning and text analysis / extraction

1 Answers1