0

I want to match similar strings with same significant word.

Problem:

I have two files one master and one input file. I have to iterate through the input file and find similar record from master. Currently I have indexed the master file in ElasticSearch and try to get similar records from ElasticSearch but since the Master contains of many similar records it return many records and finding the appropriate one from them is the problem.

Sample Input record:

1.  H1 Bulbs Included

Sample Output From ElasticSearch:

1.  Included H1 [Correct One]
2.  H7 Bulbs Included
3.  H8 Bulbs Provided
4.  H1 not Included[Should not match this]

I have tried using POS tagger to get the important terms but it does not work well.

POS Tagger Output:

1.   H1/NNP Included/NNP
2.   H8/NNP Bulbs/NNP Provided/NNP

How to proceed with this?

Edit:

In the above example H1 is the significant term

Sample Input Record:

1. H1 Bulbs included

Sample Output from ElasticSearch:

1.   H2 Bulbs Included
2.   H3 Bulbs Included
3.   H1 [Correct One]

Initially I need to identify the Significant word. There is currently no pattern in the significant word.

i.e.)

1.H1 bulbs [H1]
2.9600 added [9600]
3.It has H8 [H8]
4.1/2 wire for 4500 bulb [4500]
The6thSense
  • 8,103
  • 8
  • 31
  • 65

1 Answers1

1

I'm not familiar with elasticsearch, but doing this but using standard python should be straightforward. From your criteria above it's not clear which are the really significant words in 'H1' 'Included' and 'Bulbs' and what the processing criteria are, but as a simple case:

inputstr = 'H1 Bulbs Included'
keywords = ('H1','Bulbs','Included')
result = [x for x in keywords if x in inputstr]

>>> ['H1','Bulbs','Included']

alternatively, if you want to do some maths on it you could do

result = [bool(x) for x in keywords if x in inputstr]
>>> [True,True,True]

sum(result)
>>> 3

and then if some words are super critical, you can just use multiply for the critical words, if you need 2 out of 3 you can just check the sum, etc

for filtering out 'not', you can just check 'not' not in inputstr, ie

result = 'not' not in inputstr * result
>>> True
Marcin
  • 1,889
  • 2
  • 16
  • 20
  • Sorry for confusing you. Please find the edit.Elastic-search is query engine which return Similar record for the query term given. Let me know if you need more information. – The6thSense Nov 29 '17 at 06:50
  • Ok, so this really is an elastic search question. You want elasticsearch to do the filtering. I get it now. – Marcin Nov 29 '17 at 06:53
  • Actually I get a subset of results from ElasticSearch from that subset I am trying to match the correct record using python for which I need to identify the significant word and compare it with the Output from ElasticSearch. – The6thSense Nov 29 '17 at 06:55
  • If you break the output into individual line stings, would 'H1' in line Not be a solution, or if you wanted to only check for the start of the line, using regex with re.search(r'^significant_word',line) – Marcin Nov 29 '17 at 07:06
  • H1 in line could be a solution but I need to identify the significant term H1. Since they don't occur in the same order I cannot identify the significant term. – The6thSense Nov 29 '17 at 07:09
  • I think I'm not understanding it, so sorry if I'm wasting your time. The way I understand the requirement now is that you have a string input with some number of terms, one of which is significant, but it may not be the first one. Q: Does the significant term follow a specific pattern (ie is it always H) or is there some other input that identifies the significant term? – Marcin Nov 29 '17 at 07:29
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/160057/discussion-between-the6thsense-and-marcin). – The6thSense Nov 29 '17 at 07:30