Extracting content from documents

Question

I want to extract the content from resumes having various sections like skills, certifications, work experience etc. with NLP and tag them as per their category. While I can write basic rules to extract text on various punctuation marks, but it may not work in some cases. Will Automatic segmentation help in this case. What is the proper approach to solve this problem?

SKILL SET 
    Machine learning, Deep learning, Python, Julia, NLP

CERTIFICATIONS   
Coursera: R Programming, The Data Scientist Toolbox  2015
Galvanize: Data science & big data analytics 2017

PROFESSIONAL TRAINING 
    MIT Professional education program in MACHINE LEARNING and text processing

PROFESSIONAL RECOGNITIONS        
   Microsoft Cheers Award, Microsoft Excellence award

PROFESSIONAL ROLES AND RESPONSINBILITIES   
    Building scalable system architecture for distributed applications
    Training junior developers in advance ML
    Prototyping and testing data driven products

joel · Answer 1 · 2018-01-10T12:30:20.003

0

I used a dictionary to lookup the common headings that are present in the resumes and then segment the text if that word is present or not. This solution will need dictionaries for different sections, generally present in the resume.

edited Jan 10 '18 at 12:30

answered Jan 09 '18 at 10:07

joel

1,156
3
15
42

score 0 · Answer 2 · answered Jan 09 '18 at 13:30

0

If your use-case is to segment the resumes by their category. You can try using unsupervised clustering machine learning algorithm. Because making dictionary and rules will need more time to prepare.
I will recommend the below steps to achieve your use-case:

Create a database of resumes: developer, devops, data scientist, full stack, etc.
Train a K-means model
Upload user resume and predict user cluster, distance from centroid, etc.
Display result

answered Jan 09 '18 at 13:30

Bhuvanesh

34
6

Hi @Bhuvanesh, the problem is regrading extracting content from resume and not assigning resume a category. – joel Jan 10 '18 at 12:28
@joe you could cluster the section titles (based on some vector representation) if the types of sections (and hence their number) is fixed. You could even train a classifier if you manage to gather some labeled data. – dada Jan 16 '18 at 16:50

Extracting content from documents

2 Answers2