0

I have 4 topics and 10 keywords representing each of those 4 topics. I now want to classify all the documents in my dataset in one of these 4 topics using the keywords extracted for each topic.

topic0 = ["gene","rna","expression","mouse","assay","activity","concentration","target","ace","lung"]

topic1 = ["age","pneumonia","hospital","risk","outcome","incidence","diagnosis","strain","lung","child"]

topic2 = ["intervention","wuhan","city","contact","people","scenario","peak","confirmed_case","quarantine","daily"]

topic3 = ["sequence","genome","host","structure","gene","specie","rna","read","strain","mutation"]

These are the keywords for each topic and I have 1200 documents in my datatset. How do I classify them now?

Maybe some sort of similarity algorithm can be used for this. Please help!! Im confused

  • https://datascience.stackexchange.com/ is more appropriate for design questions, SO is for programming questions. "topic modelling" and "clustering" refers specifically to unsupervised methods, i.e. not from keywords or anything. However an option would be to start with topic modelling (typically with LDA), this would return 4 topics together with the words most associated with them. Then you could match the keywords with these topics words in order to match the most similar topics to your classes. Note that it's not sure that this would work, unsupervised don't always separate data the... – Erwan May 08 '22 at 11:26
  • .. way that one expects. There are other options of course, for example simple string matching and/or measuring similarity directly between the documents and the lists of keywords. – Erwan May 08 '22 at 11:28

0 Answers0