I have a list of lists with every list containing 1 up to 5 tags. I have constructed a list containing the top 50 tags. My goal is to construct a new list of lists where every list contains only the top 50 tags. My approach went like this:
First I constructed a new list of lists with only the top 50 tags:
top_50 = list(np.array(pd.read_csv(os.path.join(dir,"Tags.csv")))[:,1])
train = pd.read_csv(os.path.join(dir,"Train.csv"),iterator = True)
top_50 = top_50[:51]
tags = list(np.array(train.get_chunk(50000))[:,3])
top_50_tags = [[tag for tag in list if tag in top_50] for list in tags]
Then I tried to encode the tags:
coder = preprocessing.LabelEncoder()
coder = coder.fit(top_50)
tags = [coder.transform(tag) for tag in list for list in top_50_tags]
This however gave me this error:
Traceback (most recent call last):
File "C:\Users\Ano\workspace\final_submission\src\rf_test.py", line 69, in <module>
main()
File "C:\Users\Ano\workspace\final_submission\src\rf_test.py", line 33, in main
labels = [coder.transform(tag) for tag in list for list in top_50_tags]
File "C:\Python27\lib\site-packages\sklearn\preprocessing\label.py", line 120, in transform
raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: ['#']
I think this error rises because some of my lists are empty, since there were no top 50 tags in them. But the error specifically states that ["#"] is the newly seen label. Am I right with my hypothesis? And what should I do with the error message?
Edit:
For the people wondering why I am using list as a variable in list comprehension, I actually use a different word as a variable in my real program.
Update
I checked for differences in my top_50 and the tags:
print(len(top_50.difference(tags)))
which gave me a length of 0. This should mean that my empty lists are the problem?