2

I have a list of lists with every list containing 1 up to 5 tags. I have constructed a list containing the top 50 tags. My goal is to construct a new list of lists where every list contains only the top 50 tags. My approach went like this:

First I constructed a new list of lists with only the top 50 tags:

top_50 = list(np.array(pd.read_csv(os.path.join(dir,"Tags.csv")))[:,1])
train = pd.read_csv(os.path.join(dir,"Train.csv"),iterator = True)
top_50 = top_50[:51]
tags = list(np.array(train.get_chunk(50000))[:,3])

top_50_tags = [[tag for tag in list if tag in top_50] for list in tags]

Then I tried to encode the tags:

    coder = preprocessing.LabelEncoder()  
    coder = coder.fit(top_50)
    tags = [coder.transform(tag) for tag in list for list in top_50_tags]

This however gave me this error:

Traceback (most recent call last):
  File "C:\Users\Ano\workspace\final_submission\src\rf_test.py", line 69, in <module>
    main()
  File "C:\Users\Ano\workspace\final_submission\src\rf_test.py", line 33, in main
    labels = [coder.transform(tag) for tag in list for list in top_50_tags]
  File "C:\Python27\lib\site-packages\sklearn\preprocessing\label.py", line 120, in transform
    raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: ['#']

I think this error rises because some of my lists are empty, since there were no top 50 tags in them. But the error specifically states that ["#"] is the newly seen label. Am I right with my hypothesis? And what should I do with the error message?
Edit: For the people wondering why I am using list as a variable in list comprehension, I actually use a different word as a variable in my real program.

Update

I checked for differences in my top_50 and the tags:

print(len(top_50.difference(tags)))

which gave me a length of 0. This should mean that my empty lists are the problem?

Learner
  • 817
  • 3
  • 15
  • 23
  • I don't know if I can help you, but in the meantime, why are you using `list` as variable name in the list comprehension? – Roberto Dec 18 '13 at 17:24
  • I may be stating the obvious, but that error is raised in `transform` when there is some unique tag in `tag` that is not present in the unique tags in `self.classes_` (an attribute that got set with `coder = coder.fit(top_50)`): https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py So you'd have to check the two lists `top_50` and `tags` and find out why they differ... – Roberto Dec 18 '13 at 17:27
  • 1
    In my real program I don't, but for clarity I use it here. I'll add an edit – Learner Dec 18 '13 at 17:27
  • 2
    you could check if you are right [using a debugger](http://stackoverflow.com/a/4228643/1595865). It's something extremely useful to learn – loopbackbee Dec 18 '13 at 17:27
  • Yes, I definitely have to look into that. – Learner Dec 18 '13 at 17:30

1 Answers1

-1

Maybe you can check this issue: https://github.com/scikit-learn/scikit-learn/issues/3123 In scikit-learn 0.17 version, this bug has been solved.