1

I have a set of features to build labelling functions (set A) and another set of features to train a sklearn classifier (set B)

The generative model will output a set of probabilisitic labels which i can use to train my classifier.

Do i need to add in the features (set A) that i used for the labelling functions into my classifier features? (set B) Or just use the labels generated to train my classifier?

I was referencing the snorkel spam tutorial and i did not see them use the features in the labelling function set to train a new classifier.

As seem in cell 47, featurization is done entirely using a CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())

X_dev = vectorizer.transform(df_dev.text.tolist())
X_valid = vectorizer.transform(df_valid.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())

And then straight to fitting a keras model:

# Define a vanilla logistic regression model with Keras
keras_model = get_keras_logreg(input_dim=X_train.shape[1])

keras_model.fit(
    x=X_train,
    y=probs_train_filtered,
    validation_data=(X_valid, preds_to_probs(Y_valid, 2)),
    callbacks=[get_keras_early_stopping()],
    epochs=50,
    verbose=0,
)
jxn
  • 7,685
  • 28
  • 90
  • 172

2 Answers2

3

I asked the same question to the snorkel github page and this is the response :

you do not need to add in the features (set A) that you used for LFs into the classifier features. In order to prevent the end model from simply overfitting to the labeling functions, it is better if the features for the LFs and end model (set A and set B) are as different as possible

https://github.com/snorkel-team/snorkel-tutorials/issues/193#issuecomment-576450705

jxn
  • 7,685
  • 28
  • 90
  • 172
0

From your linked snorkel tutorial, the labeling functions (which maps input to labels ("HAM", "SPAM", "Abstain") are used to provide labels instead of features.

IIUC, the idea is to generate labels when you do not have good quality human labels. Though these "auto-generated" labels would be quite noisy, it could be served as a starting point of a labeled dataset. The learning process is to take this dataset and learn a model, which encodes the knowledge embedded in these labeling functions. Hopefully the model could be more general and the model could be applied to unseen data.

If some of these labeling function (can be considered as fixed rules instead) are very stable (regarding prediction accuracy) in certain conditions, given enough training data, your model should be able to learn that. However, in production system, to overcome the possibility of model instability, one easy fix is to override machine prediction with human labels on seen data. The same idea can be applied too if you think these labeling functions could be used for some specific input (pattern). In this case, the labeling functions would be used to directly get labels to override machine predictions. This process can be implemented as a pre-check before your machine-learned model runs.

greeness
  • 15,956
  • 5
  • 50
  • 80
  • i updated my topic to reflect my question. I was referencing the features more instead of labelling functions. If i have a different set of features to generate labels from labelling functions and another set to train a sklearn classifier, is it better to add the features from the dataset to create labelling functions into my other dataset to train the classifier? – jxn Jan 21 '20 at 04:26
  • yes and no. Yes for the memoization part (if you include these as feature) I bet you get much better training performance but maybe equal or lower performance on unseen (evaluation) data. No for the generalization part. It could hurt your model (or maybe not much). You need to experiment with your own dataset. I would suggest not including them as a start. – greeness Jan 21 '20 at 04:40
  • Thanks, got a similar answer from the snorkel support team as well. – jxn Jan 21 '20 at 07:00