0

I created my own custom pipeline for text processing. Inside the .transform() method, I want to remove the target row if there are no tokens.

class SpacyVectorizer(BaseEstimator, TransformerMixin):
  def __init__(
      self, 
      alpha_only: bool = True,
      lemmatize: bool = True, 
      remove_stopwords: bool = True, 
      case_fold: bool = True,
    ):
    self.alpha_only = alpha_only
    self.lemmatize = lemmatize
    self.remove_stopwords = remove_stopwords
    self.case_fold = case_fold
    self.nlp = spacy.load(
      name='en_core_web_sm', 
      disable=["parser", "ner"]
    )
  
  def fit(self, X, y=None):
    return self
  
  def transform(self, X, y):
    # Bag-of-Words matrix
    bow_matrix = []
    
    # Iterate over documents in SpaCy pipeline 
    for i, doc in enumerate(nlp.pipe(X)):
      # Words array
      words = []

      # Tokenize document
      for token in doc:

        # Remove non-alphanumeric tokens
        if self.alpha_only and not token.is_alpha:
          continue
        
        # Stopword removal
        if self.remove_stopwords and token.is_stop:
          continue
        
        # Lemmatization
        if self.lemmatize:
          token = token.lemma_
        
        # Case folding
        if self.case_fold:
          token = str(token).casefold()

        # Append token to words array
        words.append(token)
      
      # Update the Bow representation
      if words:
        # Preprocessed document
        new_doc = ' '.join(words)
        
        # L2-normalized vector of preprocessed document
        word_vec = nlp(new_doc).vector
      
      else:
        # Remove target label
        y.drop(y.index[i], inplace=True)

      # Update the BoW matrix
      bow_matrix.append(word_vec)

    # Return BoW matrix  
    return bow_matrix

Unfortunately, because I cannot pass the y vector to the .transform() method, it does not work.

How can I force the pipeline to pass both X and y parameters? Is there any other workaround on how to do it? I don't want to pass y via .fit_transform(), because test data shouldn't be fitted.

Filip Szczybura
  • 407
  • 5
  • 14
  • 1
    Does this answer your question? [Custom transformer for sklearn Pipeline that alters both X and y](https://stackoverflow.com/questions/25539311/custom-transformer-for-sklearn-pipeline-that-alters-both-x-and-y) – Ben Reiniger Mar 02 '22 at 15:35
  • I've seen and analyzed this post multiple times. Unfortunately it does not help – Filip Szczybura Mar 02 '22 at 22:47
  • As suggested in the linked question, you should just do this as a post-processing step outside of tranform. Why isn't that OK? – polm23 Mar 03 '22 at 06:25
  • Because then I could just pass already preprocessed data to the Pipeline instead of using a Transformer. My point is to do all preprocessing to both X and y in transformers in a pipeline – Filip Szczybura Mar 03 '22 at 06:57
  • As stated in the accepted answer of the proposed duplicate, this is not currently possible in sklearn. As stated in other answers there, you may be able to accomplish it using imblearn, or by hacking your own version of the Pipeline. See also https://stackoverflow.com/q/62819600/10495893 – Ben Reiniger Mar 03 '22 at 15:08
  • But to get to your specific use-case: what behavior do you desire on test data: should rows be removed there too (which would seem to skew your scores) or not (which leaves a sort of default response for no-token inputs, for which you probably would prefer to have learned a correct intercept by not dropping such rows in the training set)? – Ben Reiniger Mar 03 '22 at 15:10
  • I probably now see my thinking issue. Sklearn supports independent variables, that's why in the transform there shouldn't be y. But still, sklearn has this optional y parameter, that's why I thought it might be possible. – Filip Szczybura Mar 03 '22 at 22:43

1 Answers1

-1
def transform(self, X, y=None):

Here you have written y = None, which means if you aren't passing any y value then it's taking a default value as None.

In order to force a pipeline to pass a y value u should write

def transform(self, X, y):
     pass

If you do this then you have to pass a y value, else it will return a error

the space problem I am talking about

class SpacyVectorizer:
    def __init__(
      self, 
      alpha_only: bool = True,
      lemmatize: bool = True, 
      remove_stopwords: bool = True, 
      case_fold: bool = True,
    ):
        self.alpha_only = alpha_only
        self.lemmatize = lemmatize
        self.remove_stopwords = remove_stopwords
        self.case_fold = case_fold
        self.nlp = spacy.load(
          name='en_core_web_sm', 
          disable=["parser", "ner"]
        )
    def transform(self, X, y):
    # Bag-of-Words matrix
        bow_matrix = []

        # Iterate over documents in SpaCy pipeline 
        for i, doc in enumerate(nlp.pipe(X)):
          # Words array
          words = []

          # Tokenize document
          for token in doc:

            # Remove non-alphanumeric tokens
            if self.alpha_only and not token.is_alpha:
              continue

            # Stopword removal
            if self.remove_stopwords and token.is_stop:
              continue

            # Lemmatization
            if self.lemmatize:
              token = token.lemma_

            # Case folding
            if self.case_fold:
              token = str(token).casefold()

            # Append token to words array
            words.append(token)

          # Update the Bow representation
          if words:
            # Preprocessed document
            new_doc = ' '.join(words)

            # L2-normalized vector of preprocessed document
            word_vec = nlp(new_doc).vector

          else:
            # Remove target label
            y.drop(y.index[i], inplace=True)

          # Update the BoW matrix
          bow_matrix.append(word_vec)

        # Return BoW matrix  
        return bow_matrix

The error you are getting might be because of the space problem, as self might be taking x value and X parameter might be taking y value

  • I've tried it. Unfortunately when I put the SpacyVectorizer to the pipeline and call pipe.transform(X, y) I get TypeError: transform() missing 1 required positional argument: 'y' – Filip Szczybura Mar 02 '22 at 22:50
  • You don't have a spacing problem right? because the code you wrote seems to have a spacing problem. As it seems all the function defined inside the SpacyVectorizer doesn't have actual space. – raghav Aggarwal Mar 03 '22 at 05:43
  • 1
    Even if I pass X and y as keyword parameters, I get same error – Filip Szczybura Mar 03 '22 at 06:55
  • You can add any arguments you want there. If the pipeline simply does not pass them to your transformer, there is nothing for the transformer to work with. So the solution is wrong - per current sklearn-1.3.0 pipeline API. – Florin Andrei Jul 14 '23 at 21:11