Text field concatenation in sklearn pipeline

Question

I have a multi line json dataset that contains multiple fields that can or cannot exists and can contain textual data in either string, list of strings or more complicated mapping (list of dicts)

eg.:

{"yvalue":1.0,"field1":"Some text", "field2":"More Text", "field3": ["text","items","in","list"], "field4":[{"id":3,"name":"text"},{"id":4,"name":"text"}]}
{"yvalue":2.0,"field2":"More Text2", "field3": ["text2","items2","in2","list2"], "field4":[{"id":4,"name":"text"},{"id":4,"name":"text"}], "field5":"extra text"}
...

This dataset is needed as input for a sklearn pipeline

First of all I'm reading the file via pandas

df = pandas.read_json(args.input_file, lines=True)

But I'd like to use a pipeline transformer like DataframeMapper to concat all text fields (even the nested ones) to one huge text field. Taking into account that certain fields may be missing, are part of nested structures etc.

The output would look something like:

yvalue | text

1.0 | Some text More Text text items in list text text

2.0 | More Text2 text2 items2 in2 list2 text text extra text

Of course I can use a custom transformer, but since I'm also interested converting the pipeline to mleap or pmml format, I'd rather refrain from using custom transformers as much as possible.

Is there a best practise or even easy way to do this without getting too hacky?

Update

Apparently what I want is maybe a bit too much, but maybe something easier: Is there a way to concat just 2 (or more) string-like fields using a transformer like in pandas:

df[['field1', 'field2']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

I dont think you will get anything ready-made for this. What you can do is keep the processing part separate from pipeline. Something which will convert the json input to a 2d array which can then be passed to Pipeline. Pmml will take care of the pipeline and you can re-code the transformer part in any language you like without much trouble I think. — Vivek Kumar, Apr 16 '18 at 15:58
json to 2-d array will become easy if you can convert the json to a pojo type object and then just use a string representation of that object which returns the required line (containing all the data, and some default or null values for non-existent fields of that json) — Vivek Kumar, Apr 16 '18 at 16:00
Thanks Vivek, for your reaction. I figured that would be the most pragmatic way to go about it. But I was not familiar with all the possible transformers that sklearn had to offer. — Tom Lous, Apr 16 '18 at 18:14
I am sorry I am not aware of any inbuilt transformer which can handle the json to 2-d array conversion if thats what you wanted. — Vivek Kumar, Apr 17 '18 at 04:32

score 1 · Answer 1 · answered Apr 18 '18 at 08:53

Consider refactoring your data pre-processing. The Scikit-Learn pipeline is not a place to do low-level data sanitization/preparation work such as unpacking collections, and (conditionally-) concatenating text fields into a text document.

This is a regular programming task, not a machine learning task. Therefore, you should use regular programming tools, not machine learning tools (eg. Scikit-Learn transformers), to accomplish it. Neither PMML nor MLeap is suited for low-level text processing.

You are of course right. I was just looking for a way to circumvent this — Tom Lous, Apr 18 '18 at 13:19

score 0 · Answer 2 · answered Dec 17 '21 at 20:59

It's reasonable to use pipelines and transformers for easier model interpretability (e.g. shap values), rather than straight Pandas-only preprocessing.

Assuming a dataframe X of text columns:


    class StringConcatTransformer(TransformerMixin, BaseEstimator):
        """Concatenate multiple string fields into a single field.
        """
        
        def __init__(self, missing_indicator=''):
            self.missing_indicator = missing_indicator

        def fit(self, X, y=None, **fit_params):
            return self

        def transform(self, X, y=None):
            return X.fillna(self.missing_indicator).agg(' '.join, axis=1)

Text field concatenation in sklearn pipeline

2 Answers2