0

I have a pretrained fast text model, I have loaded it into my notebook and want to fit it to my free form text to train a ML classifier.

import pandas as pd
from sklearn.model_selection import train_test_split
from gensim.models import FastText
import pickle
import numpy as np
from numpy.linalg import norm
from gensim.utils import tokenize

model_2 = FastText.load(model_path + 'itsm_fasttext_embeddings_100_dim.model')
tokens = list()
def get_column_vector(model, list_corpus):
    for i in list_corpus:
        svec = np.zeros(100)
        tok_sent = list(tokenize(i))
        count = 0
        for word in tok_sent:
            vec = model.wv[word]
            norm_vec = norm(vec)
            if (norm_vec > 0):
                vec = np.multiply(vec, (1/norm_vec))
                svec = np.add(svec, vec)
                count += 1
        if (count > 0):
            averaged_vec = np.multiply(svec, (1/count))
            tokens.append(averaged_vec)
    return tokens

list_corpus = df["freeformtext_col"].tolist()

# lst = array of vectors for each row of free form text 

lst = get_column_vector(model, list_corpus)

x_text_train, x_text_test, y_train, y_test = train_test_split(lst, y, test_size=0.2, random_state=42)
model_2.fit(x_text_train, y_train, validation_split=0.1, shuffle=True)

I get the error of

    ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [59], in <cell line: 1>()
----> 1 model_2.fit(x_text_train, y_train, validation_split=0.1, 
shuffle=True)

AttributeError: 'FastText' object has no attribute 'fit'

Other documentation showing the initial training of fasttext have the fit function. I am having trouble finding documentation of others who have taken a pre-tained fasttext gensim model and fit it to their text data to ultimately use a classifier

1 Answers1

1

The Gensim FastText implementation offers no .fit() method. (I also don't see any such method in Facebook's Python wrapper of its original C++ FastText implementation. Even in its supervised-classification mode, it has its own train_supervised() method rather than a scikit-learn-style fit() method.)

If you saw some online example using such a method, it must have been using a different FastText implementation - so you should consult the full details of that other example to see which library they were using.

I don't know of any good online examples showing how to 'fine-tune' a pretrained FastText model to a smaller set of new texts, much less any demonstrating benefits, gotchas, & rules-of-thumb for performing such an operation.

If you did see an online example suggesting such an approach, & demonstrating some benefits over other less-complicated approaches, then that source-of-inspiration would also be the model to follow - or to mention/link when trying to debug their approach. Without someone's full working examples as a guide/template, you're in improvised-innovation mode.

Note you don't have to start with someone else's pre-trained model. You can train your own FastText models with your own training texts – and for many domains, & tasks, this could work better than a generic model trained from public sources like Wikipedia texts or large web crawls.

And when you do, you have the option of simply using FastText in its base unsupervised mode – as a way to featurize text – then pass those FastText-modeled features to some other explicit classifier option (such as the many calssifiers in scikit-learn with .fit() methods).

FastText's own -supervised mode builds a different kind of model that combines the word-training with the classification-training. A general FastText language model you find online is unlikely to be a specific -supervised mode model, unless it is explicitly declared to be one. If it's a standard unsupervised model, there's no straightforward way to adapt it into a -supervised model. And if it is already a -supervised model, it will have already been trained for someone else's fixed set of known-labels.

gojomo
  • 52,260
  • 14
  • 86
  • 115