3

I have a trained customised fasttext model (fasttext is a word embedding algorithm developed by Facebook). I managed to get the expected result in a function but now I want to rewrite it into a customised transformer so I can add it into my sklearn pipeline as it only accepts transformer.

The function takes a word and returns vectors of the word:

def name2vector(name=None):
    vec = [np.array(model.get_word_vector(w)) for w in name.lower().split(' ')]
    name_vec = np.sum(vec, axis=0) # If "name" is multiple words, sum the vectors
    return (name_vec)

returned value:

array([-0.01087821,  0.01030535, -0.01402427,  0.0310982 ,  0.08786983,
        -0.00404521, -0.03286128, -0.00842709,  0.03934859, -0.02717219,
         0.01151722, -0.03253938, -0.02435859,  0.03330994, -0.03696496], dtype=float32))

I want the tranformer does the same thing as the function. I know I can use BaseEstimator and TransformerMixin to rewrite it into a transformer by reading the tutorial but I still stuck on this. Some suggestions will be great, thanks.

dimid
  • 7,285
  • 1
  • 46
  • 85
Osca
  • 1,588
  • 2
  • 20
  • 41

2 Answers2

2

Assuming you're working with a pandas DataFrame, you could do something like this:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression



class FastTextTransformer(TransformerMixin, BaseEstimator):
    def __init__(self, model):
        self.model = model
    
    def get_params(self, deep):
        return {'dimension': self.model.get_dimension()}
    
    def fit(self, X, y):
        # We assume the FT model was already fit
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        X_copy = X_copy.apply(self.name2vector)
        
        return pd.DataFrame(X_copy.tolist())

    def name2vector(self, name):
        vec = [np.array(self.model.get_word_vector(w)) for w in name.lower().split(' ')]
        name_vec = np.sum(vec, axis=0) # If "name" is multiple words, sum the vectors
        return name_vec
    
    

To demonstrate the usage let's load a fasttext model and a sample data-set of amazon reviews:

import fasttext as ft

ft_model = ft.load_model('amazon_review_polarity.ftz')
amz_df = pd.read_html('https://huggingface.co/datasets/amazon_polarity/viewer/amazon_polarity/test')[0]
amz_df.rename(columns={'content (string)': 'content', 'label (class label)': 'label'}, inplace=True)
amz_df

enter image description here

And then use it as a bona fide scikit-learn Pipeline.

pipe = Pipeline([
    ('ft', FastTextTransformer(ft_model)),
    ('clf', LogisticRegression()),
])

And now we can fit and predict

pipe.fit(amz_df['content'], amz_df.label)
pipe.predict(pd.Series(['great', 'very cool', 'very disappointed']))

Which returns

array(['positive', 'positive', 'negative'], dtype=object)

N.B. In case you want to compute an average of the words in the sentence, instead of a sum, you can replace name2vector with the built-in method get_sentence_vector. For a supervised model, it'll return the average. For unsupervised ones (CBOW and skipgram), it first divides each vector by its L2 norm, and then averages.

See the discussion here.

Credit: Stefano Fiorucci - anakin87

dimid
  • 7,285
  • 1
  • 46
  • 85
1

The library compress-fasttext (it's a wrapper around Gensim that makes fastText models more lightweight) already has such a transformer:

import compress_fasttext
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from compress_fasttext.feature_extraction import FastTextTransformer

small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load(
    'https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin'
)

classifier = make_pipeline(
    FastTextTransformer(model=small_model), 
    LogisticRegression()
).fit(
    ['banana', 'soup', 'burger', 'car', 'tree', 'city'],
    [1, 1, 1, 0, 0, 0]
)
classifier.predict(['jet', 'train', 'cake', 'apple'])
# array([0, 0, 1, 1])

Under the hood, it finds all "words" (alphanumeric sequences) in the text and averages their fastText embeddings.

Here is the source code.

David Dale
  • 10,958
  • 44
  • 73