0

I'm fairly new to sklearn's DictVectorizer, and am trying to create a function where DictVectorizer will output feature names from a list of bigrams that I have used to form a from a feature dictionary. The input to my function is a string, and the function should return a list consisting of a formed into dictionaries (something like this).

def features (str) -> List[Dict[Text, Union[Text, int]]]:
   
    # my feature dictionary should have 'bigram' as the key, and the values will be the bigrams themselves.  your feature dict needs to have "bigram" as a key
    # bigram: a form of "w[i]-w[i+1]"
    
    # This is my bigram list (as structured above)
    bigrams: List[Dict[Text, Union[Text, int]]] = []
    
    # here is my code:
    bigrams  = {'bigram':i for j in sentence for i in zip(j.split(" "). 
    [:-1], j.split(" ")[1:])}

    return bigrams

vect = DictVectorizer(sparse=False)

text = str()

feature_catalog = features(text)

vect.fit(feature_catalog)

print(sorted(vectorizer.get_feature_names_out()))

Everything works fine until the code advances to the DictVectorizer blocks (hidden in the class itself). This is what I get:

AttributeError                            Traceback (most recent call last)
/var/folders/pl/k80fpf9s4f9_3rp8hnpw5x0m0000gq/T/ipykernel_3804/266218402.py in <module>
     22 features = get_feature(text)
     23 
---> 24 vectorizer.fit(features)
     25 
     26 print(sorted(vectorizer.get_feature_names()))

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/feature_extraction/_dict_vectorizer.py in fit(self, X, y)
    159 
    160         for x in X:
--> 161             for f, v in x.items():
    162                 if isinstance(v, str):
    163                     feature_name = "%s%s%s" % (f, self.separator, v)

AttributeError: 'str' object has no attribute 'items'

Any ideas? This ultimately going to be used as part of a larger processsing effort on a corpus.

Stef
  • 13,242
  • 2
  • 17
  • 28
Ja4H3ad
  • 51
  • 1
  • 6
  • Hi! I think something went wrong when you copy-pasted the code. There is a `'''` missing to indicate the end of the doc-string, and `bigrams = bigrams = {...` is a bit suspicious – Stef May 05 '22 at 08:29
  • @Stef I removed the orphaned docstring and the redundant bigram variable – Ja4H3ad May 05 '22 at 12:42
  • I'm going to be a bit picky but: (1) Could you please add all relevant `import`s at the top of the code? This makes it easier to copy your code and play around with it. – Stef May 05 '22 at 13:28
  • (2) As a general good-practice in python, avoid shadowing the names of builtins. See this list: [python builtin functions](https://docs.python.org/3/library/functions.html). Avoid naming your variables with a name that's on that list. For instance you have a variable name `str`. This is confusing for people reading your code and used to `str` being something else (it's actually the whole class for strings) **and** it can sometimes result in surprising bugs, because of the way namespaces work in python (they're dynamic rather than scoped, as opposed to most other programming languages – Stef May 05 '22 at 13:31
  • All that being said: `AttributeError: 'str' object has no attribute 'items'` is saying that one of the scikit-learn methods expected a variable to be a `dict` (it's `dict` that have a `.items` method) but actually it was a `str` (a string). – Stef May 05 '22 at 13:34
  • The error was raised at line `for x in X: for f, v in x.items()` in `DictVectorizer.fit`, where `X` is the first parameter passed to `fit`, so apprently, `DictVectorizer.fit` expects its first argument to contain a sequence of `dict` objects, but you passed it a sequence of `str` objects instead. – Stef May 05 '22 at 13:36
  • So apparently `feature_catalog` is a sequence of `str`, but it should be a sequence of `dict`. And you defined `feature_catalog` as the return value from `features`, which is `bigram`. So `bigram` should be a sequence of dicts but right now it is a sequence of str. – Stef May 05 '22 at 13:37
  • Also, (3) Could you please define a sample definition for variable `text` at the top of your code? (If you're actually doing this with a large text, please try it with a very small text, that you can include directly in your question by writing `text = "..."` at the top of your code) – Stef May 05 '22 at 13:39
  • Regarding my comment (2), perhaps `def features(str)` was supposed to be `def features(sentence)` since you later use a variable `sentence` – Stef May 05 '22 at 13:41
  • Also the line where you define `bigram` appears to have an extra `.` in it for no reason, I suppose it's supposed to be `bigrams = { 'bigram': i for j in sentence for i in zip(j.split(" ")[:-1], j.split(" ")[1:]) }` ? – Stef May 05 '22 at 13:43
  • Hold on, I just realised you have this line: `text = str()`. This is defining `text` as the empty string. It's the same as `text = ''`. I suppose `text` should not be the empty string? Please provide a sample text instead. – Stef May 05 '22 at 13:44
  • The documentation https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer.fit says: *"Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype)."*. So, `feature_catalog` should be either a `dict`, or a sequence of `dict`s. In your case it's definitely a `dict`. But the code for `DictVectorizer` appears to assume it should always be a sequence of `dict`s rather than one single `dict`. – Stef May 05 '22 at 13:49
  • Try with `vect.fit([feature_catalog])` instead of `vect.fit(feature_catalog)`. It gets rid of the error. I'm not sure whether it completely fixes the issue or not. – Stef May 05 '22 at 13:49
  • Similar question: [Using DictVectorizer to convert strings](https://stackoverflow.com/questions/47334524/using-dictvectorizer-to-convert-strings) (I'm not sure how helpful that will be for you, but I'm linking it anyway) – Stef May 05 '22 at 13:53

0 Answers0