3

Re-written:

I have a corpus of computer science related documents. I want to extract domain specific keywords. for example JAVA, C#, HTML, OOP, UML, Unity, etc. I was looking for a source similar to Oxford dictionary for computing, however their API is not up and running yet. I have also tried Webopedia for computer science terms but that one is not as inclusive and updated ( e.g. it doesn’t include some words in my documents such as F#)  or in case of Wikipedia all terms are not listed all together. Is there a more inclusive source or appropriate approach to extract those keywords?  I am using Python with NLTK . For example, tf-idf wasn’t helpful because some domain specific words are common almost in all documents so those words don’t get a high rating. I think it would be helpful if I could use the POS-tagging but I’m not sure which option would be the best for my application. Take the string below as an example:

“Expert level capabilities in JavaScript, JSON, and AJAX, and a deep knowledge of JavaScript frameworks such as JQuery “ Here I want to extract these words : [‘JavaScript’, ‘JSON’, ‘AJAX’, ‘Frameworks’, ‘JQuery’] but when I search for nouns using POS-tagging of NLTK, I get ‘level’, ‘capability’, ‘knowledge’ … as well. Thanks for your help.

Mina
  • 738
  • 1
  • 6
  • 26
  • 2
    recruitment database? – Mitch Wheat Jan 27 '14 at 01:02
  • 2
    "all the concepts and skills necessary" - How are D3, three.js, or F# "necessary"? – user2357112 Jan 27 '14 at 01:02
  • 1
    I'm not sure why this question is being down voted. @user2357112 as you might know, knowing a language such as F# , APIs and libraries are listed as skills in job postings all the time so I am not sure what confuses you in case of my question. – Mina Jan 27 '14 at 01:34
  • 1
    @Mina After the rewrite your question is much clearer, and I have voted for a re-open. If I recall correctly there must be four other votes before it actually is reopened. – some Jan 27 '14 at 04:53
  • 1
    Voted to reopen. Dunno how common it is for questions to reach the reopen threshold or get answers after a reopen, but the new version of the question is much clearer. – user2357112 Jan 27 '14 at 06:46
  • @user2357112 awesome! Hopefully it will. – Mina Jan 27 '14 at 18:06
  • @MitchWheat: I wrote a new version of my question to clarify the problem by explain the specific programming question I have. I hope it is clarifies the question now. – Mina Jan 27 '14 at 18:09
  • 1
    I understand it was re-written, but what exactly is your question? (I read it twice, and I can't see what you are asking). – jww Feb 01 '14 at 02:48
  • @noloader: I had an example: As an example take this string: “Expert level capabilities in JavaScript, JSON, and AJAX, and a deep knowledge of JavaScript frameworks such as JQuery “ Here I want to extract these words : [‘JavaScript’, ‘JSON’, ‘AJAX’, ‘Frameworks’, ‘JQuery’] How can I do that? I have strings similar to this example with keywords related to software engineering and programming. How can I extract those words? Isn't it clear from my question? I am really surprised that even with the example it is not clear! It is a programming question so how is it off topic? – Mina Feb 11 '14 at 02:38

1 Answers1

7

Why don't you download the StackOverflow data dumps and write a program to filter the tags?

They just have been released on archive.org, see here

Of course, it would not include all terms and there would be some false positives, but I assume this is about as close as you will get.

Community
  • 1
  • 1
Uli Köhler
  • 13,012
  • 16
  • 70
  • 120