I am working with a big dataset and thus I only want to use the items that are most frequent.
Simple example of a dataset:
1 2 3 4 5 6 7
1 2
3 4 5
4 5
4
8 9 10 11 12 13 14
15 16 17 18 19 20
4 has 4 occurrences,
1 has 2 occurrences,
2 has 2 occurrences,
5 has 2 occurrences,
I want to be able to generate a new dataset just with the most frequent items, in this case the 4 most common:
The wanted result:
1 2 3 4 5
1 2
3 4 5
4 5
4
I am finding the first 50 most common items, but I am failing to print them out in a correct way. (my output is resulting in the same dataset)
Here is my code:
from collections import Counter
with open('dataset.dat', 'r') as f:
lines = []
for line in f:
lines.append(line.split())
c = Counter(sum(lines, []))
p = c.most_common(50);
with open('dataset-mostcommon.txt', 'w') as output:
..............
Can someone please help me on how I can achieve it?