Scikit-learn: How to extract features from the text?

Question

Assume I have an array of Strings:

['Laptop Apple Macbook Air A1465, Core i7, 8Gb, 256Gb SSD, 15"Retina, MacOS' ... 'another device description']

I'd like to extract from this description features like:

item=Laptop
brand=Apple
model=Macbook Air A1465
cpu=Core i7
...

Should I prepare the pre-defined known features first? Like

brands = ['apple', 'dell', 'hp', 'asus', 'acer', 'lenovo']
cpu = ['core i3', 'core i5', 'core i7', 'intel pdc', 'core m', 'intel pentium', 'intel core duo']

I am not sure that I need to use CountVectorizer and TfidfVectorizer here, it's more appropriate to have DictVictorizer, but how can I make dicts with keys extracting values from the entire string?

is it possible with scikit-learn's Feature Extraction? Or should I make my own .fit(), and .transform() methods?

UPDATE: @sergzach, please review if I understood you right:

data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]

for d in data:
    for brand in brands:
       if brand in d:
          # ok brand is found
for model in models:
       if model in d:
          # ok model is found

So creating N-loops per each feature? This might be working, but not sure if it is right and flexible.

You can make a list of all brands manually, then extract them (probably with str.lower() and removing unnecessary characters) from the text, then check if they are recognized mostly. Then view to features which were not recognized and decide what to do with them. Then convert them into numeric features using DV.fit_transform, scale them and use them as numbers. — sergzach, May 15 '16 at 14:38
@sergzach Thanks, I've updated my question, could you please review? — Novitoll, May 15 '16 at 15:02
I think you could use `CountVectorizer()` of sklearn as mentioned here: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#tokenizing-text-with-scikit-learn. But in any case you should prepare data for `fit_transform()`. — sergzach, May 15 '16 at 16:11

sergzach · Accepted Answer · 2016-05-15T15:40:17.087

Yes, something like the next.

Excuse me, probably you should correct the code below.

import re

data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]

features = {
    'brand': [r'apple', r'dell', r'hp', r'asus', r'acer', r'lenovo'],
    'cpu': [r'core\s+i3', r'core\s+i5', r'core\s+i7', r'intel\s+pdc', r'core\s+m', r'intel\s+pentium', r'intel\s+core\s+duo']
    # and other features
}

cat_data = [] # your categories which you should convert into numbers

not_found_columns = []

for line in data:
    line_cats = {}

    for col, features in features.iteritems():
        for i, feature in enumerate(features):
            found = False

            if re.findall(feature, line.lower(), flags=re.UNICODE) != []:
                line_cats[col] = i + 1 # found numeric category in column. For ex., for dell it's 2, for acer it's 5.               
                found = True
                break # current category is determined by a first occurence

        # cycle has been end but feature had not been found. Make column value as default not existing feature
        if not found:       
            line_cats[col] = 0
            not_found_columns.append((col, line))

        cat_data.append(line_cats)

# now we have cat_data where each column is corresponding to a categorial (index+1) if a feature had been determined otherwise 0.

Now you have column names with lines (not_found_columns) which was not found. View them, probably you forgot some features.

We can also write strings (instead of numbers) as categories and then use DV. In result the approaches are equivalent.

Thanks. I thought, there is already built-in method in sciki-learn for such as solution. But know it makes more sense. Thanks again. :) — Novitoll, May 15 '16 at 15:45
@Novitoll May be. But many such methods require prepared data in any case. You could wait for another answer, it's interesting for me too what people think. — sergzach, May 15 '16 at 15:47

score 0 · Answer 2 · answered May 16 '16 at 02:15

Scikit Learn's vectorizers will convert an array of strings to an inverted index matrix (2d array, with a column for each found term/word). Each row (1st dimension) in the original array maps to a row in the output matrix. Each cell will hold a count or a weight, depending on which kind of vectorizer you use and its parameters.

I am not sure this is what you need, based on your code. Could you tell where you intend to use this features you are looking for? Do you intend to train a classifier? To what purpose?

Scikit-learn: How to extract features from the text?

2 Answers2