Trying to create a searchable dashboard for end users with full text search capability on a csv dataset containing research topics using ElasticSearch with python.
Search will return row index of the relevant csv rows. There are multiple columns namely _id, topic
If I try to query the dataset for "cyber security"
. I get most of the results containing words "cyber security"
or "cyber-security"
but there are other rows returned which deal with food security and army security.
How to avoid this for a general search term?
Moreover search term "cyber
" or "cyber security"
does not pick up some topics containing words like "cybersecurity"
or "cybernetics"
How would I go about writing a condition which can capture these?
Do keep in mind that this needs to work the other way too i.e if I search for "food security"
the cyber topics shouldn't come up.
def test_search():
client = Elasticsearch()
q = Q("multi_match", query='cyber security',
fields=['topic'],
operator='or')
s = Search(using=client, index="csvfile").query(q) \
# .filter('term', name="food")
# .exclude("match", description="beta")
EDIT: Adding a sample requirement as requested in comments
The csv file can be as given below.
_id,topic
1,food security development in dairy
2,securing hungry people by providing food
3,cyber security in army
4,bio informatics for security
5,cyber security in the world
6,food security in the world
7,cyberSecurity in world
8,army security in asia
9,cybernetics in the world
10,cyber security in the food industry.
11,cyber-information
12,cyber security
13,secure secure army man
14,crytography for security
15,random stuff
Acceptable
Search term is cyber
-> 3,5,7,9,10,11,12
Search term is security
-> everything except 11,14,15
Search term is cyber security
or cybersecurity
-> 3,5,7,9,10,11,12 (in this case cyber needs to have a higher priority, user won't be interested in other security types)
Search term is food security
->1,2
Perfect Case
Search term is cyber
or cyber security
or cybersecurity
-> 3,4,5,7,9,10,11,12,14
considering Cryptography and Bio Informatics are pretty much cyber security related, should I be using clustering of documents to achieve this (ML techniques)?