Scanning texts for specific words

Question

I want to create an algorithm that searches job descriptions for given words (like Java, Angular, Docker, etc). My algorithm works, but it is rather naive. For example, it cannot detect the word Java if it is contained in another word (such as JavaEE). When I check for substrings, I have the problem that, for example, Java is recognized in the word JavaScript, which I want to avoid. I could of course make an explicit case distinction here, but I'm more looking for a general solution.

Are there any particular techniques or approaches that try to solve this problem?

Unfortunately, I don't have the amount of data necessary for data-driven approaches like machine learning.

You want to include "JavaEE" but exclude "JavaScript", and you believe that recognising all words that contain "Java" then excluding words that contain "JavaScript" is not enough of a "general solution"? — Stef, Mar 22 '22 at 10:09
That is just one example. Another example would be something like ReactJS. In other languages, such as german, this can happen very frequently (e.g. Angularentwicklung). So yes, this is not enough of a general solution for me. — EustassX, Mar 22 '22 at 10:19
I don't understand your example. You searched for the substring "Java" and you accidentally found "ReactJS"? — Stef, Mar 22 '22 at 10:29
You could do an interactive script that gives you the list of words it found that contained the substrings you were searching for; then you can validate or unvalidate every word manually; then it updates its internal list of known good matches and known bad matches. So when encountering a word containing "Java", there are three possibilities: it's a known good match, such as JavaEE, or a known bad match, such as JavaScript; or it is unknown and the user needs to be asked about it. — Stef, Mar 22 '22 at 10:32
Of course, the example with ReactJS does not refer to the word Java, but to the word React. I'm just trying to give examples to show that the problem is more general. The idea of known good and bad matches is good idea. However, with my question I wanted to find out about solutions that take a smart approach to tackle this problem in general. Approaches where not every case distinction has to be implemented explicitly. A general solution would be better, since the list of words for which the job descriptions are searched changes frequently in my use case. — EustassX, Mar 22 '22 at 11:53

score 1 · Answer 1 · answered Mar 22 '22 at 11:41

1

Train a simple word2vec language model with your whole job description text data. Then use your own logic to find the keywords. When you find a match, if it's not an exact match use your similar words list.

For example you're searching for Java but find also javascript, use your word vectors to find if there is any similarity between them (in another words, if they ever been used in a similar context). Java and JavaEE probably already used in a same sentence before but java and javascript or Angular and Angularentwicklung been not.

It may seems a bit like over-engineering, but its not :).

answered Mar 22 '22 at 11:41

Kemal Can Kara

416
2
15

That really sounds like what I was looking for. Thank you. I'm just unsure if I have enough data. How many job descriptions would be appropriate? I have a maximum of 56, of which about 14 contain relevant/important words. – EustassX Mar 22 '22 at 12:05
56 wont be enough. But it's very easy to scrap job descriptions from an online job boards. You can even find a script for that. – Kemal Can Kara Mar 22 '22 at 17:43
I don't think this will solve your problem at all, as I have seen many job descriptions where they require java and javascript both, so your model will give good similarity score b/w java and javascript. My suggestions would be to go with static list, because number of programming languages are limited and you can extend this list whenever a new one comes. For ML approach you have to train the model again and again. – dheeraj Mar 23 '22 at 05:10

score 0 · Accepted Answer · answered Mar 28 '22 at 12:29

I spent some time researching my problem, and I found that identifying certain words, even if they don't match 1:1, is not a trivial problem. You could solve the problem by listing synonyms for the words you are looking for, or you could build a rule-based named entity recognition service. But that is both error-prone and maintenance-intensive.

Probably the best way to solve my problem is to build a named entity recognition service using machine learning. I am currently watching a video series that looks very promising for the given problem. --> https://www.youtube.com/playlist?list=PL2VXyKi-KpYs1bSnT8bfMFyGS-wMcjesM

I will comment on this answer when I am done with my work to give feedback to those who are facing the same problem.

NER with spacy has solved half of my problem. It is a lot of work to create the training data. But with a 186 texts and 8 different labels I was able to achieve an accuracy of 0.83. For some labels it was even over 90%. But to really get synonyms for the same programming language or tool, you have to implement entity linking in addition. Possible solutions for this are again nlp and knowledge bases. — EustassX, Jul 21 '22 at 08:36

Scanning texts for specific words

2 Answers2