Using BERT to generate technical skills from a set of activities just outputs the input data

Question

I'm trying to use jobspanbert to generate technical IT skills from a column of job_activities, which is textual data describing the activities the employee does at his job.

The model ran for 2 hours straight so you can imagine how excited I was to having it work. To my surprise, it generated horrible results. It basically returned the same thing as the input.

For example the row value is as follows

"development tailormade reporting zend framework design development project datamarts realization custom count extraction management email sm paper campaign targeting daily monitoring database tool update database interface etc design validation database data processing process team development automatic data processing process final quality control database management optimization large volume mysql database implementation loyalty program managed realtime web service segmentation rfm rf product owner retailer platform writing technical documentation culture story feasibility study proposal appropriate solution requested development acceptance technical crossvalidation development carried customer support use tool web integration psdcss3html5 integration responsive design project management followup agile scrum methodology svn git process recruitment new developer integration supervision new recruit "

The generated skills are as follows

"[CLS] development tailormade reporting zend framework design development project datamarts custom count extraction management email sm paper campaign targeting daily monitoring database update database interface design validation database data processing process team development automatic data processing process final quality control database management optimization large volume mysql database implementation program managed realtime web service segmentation rfm rf product owner retailer platform writing technical documentation culture story feasibility study proposal appropriate solution requested development acceptance technical crossvalidation development carried customer support use tool web integration psdcss3html5 integration responsive design project managementup agile scrum methodology svn git process recruitment new developer integration supervision new recruit [SEP] "

I really don't know what to do about this. Like I said, I'm trying to generate technical IT Skills only.

This is what I'm currently doing,

import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

# Load the pre-trained model and tokenizer
model_name = "jjzha/jobspanbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Set the maximum sequence length
max_length = 512

# Load your dataset
df = pd.read_csv("preprocessed_dataset.csv")

# Extract technical skills using the model
skills_list = []
for activity in df["processed_activities"]:
    try:
        # Truncate the input sequence to the maximum sequence length
        tokens = tokenizer.encode(activity, add_special_tokens=True, max_length=max_length, truncation=True)
        # Make a tensor of input_ids and token_type_ids
        inputs = torch.tensor([tokens])
        token_type_ids = torch.zeros_like(inputs)
        # Feed the inputs to the model for inference
        outputs = model(inputs, token_type_ids=token_type_ids)
        # Get the predicted labels (skills) for the tokens
        predicted_labels = torch.argmax(outputs[0], dim=2)[0]
        # Convert the predicted labels to skill tokens
        skill_tokens = [tokenizer.convert_ids_to_tokens([token])[0] for idx, token in enumerate(tokens) if predicted_labels[idx] == 1]
        # Post-process the skill tokens to extract actual skills
        skills = " ".join(skill_tokens).replace(" ##", "")
        skills_list.append(skills)
    except:
        # Handle any errors that occur during encoding or inference
        skills_list.append("")

# Add the extracted skills to your dataset
df["skills"] = skills_list

# Save the updated dataset
df.to_csv("updated_dataset.csv", index=False)

Can you clarify what expected outputs should look like? The problem is that the JobBert model is not designed specifically to only extract skills; rather, it also annotates the "knowledge" associated with a particular skill, as per the paper. This may not be what you are looking for, and it seems you also do not have labeled data available, correct? — dennlinger, Apr 04 '23 at 13:42

Using BERT to generate technical skills from a set of activities just outputs the input data

0 Answers0