I want to label-encode a column called article_id which has unique identifiers for an article.
Integer values kind of implicitly have an order to them, because 3 > 2 > 1.
I wonder what is the most reasonable way to sort the values before factorizing them to have a benefit to this natural order. I though about sorting them by their occurence, so that the most common article_id has the highest label representation and the one which occurs the least has the lowest label representation.
Does this make sense and are there more reasonable ways of doing this?
This is what I am doing right now. Sorting by occurence and then factorizing.
df = df.iloc[df.groupby('article_id').article_id.transform('size').argsort(kind='mergesort')]
df['article_id'], article_labels = df['article_id'].factorize()