0

I am working with a big dataset and thus I only want to use the items that are most frequent.

Simple example of a dataset:

1 2 3 4 5 6 7
1 2
3 4 5
4 5
4
8 9 10 11 12 13 14
15 16 17 18 19 20

4 has 4 occurrences,
1 has 2 occurrences,
2 has 2 occurrences,
5 has 2 occurrences,

I want to be able to generate a new dataset just with the most frequent items, in this case the 4 most common:

The wanted result:

1 2 3 4 5
1 2
3 4 5
4 5
4

I am finding the first 50 most common items, but I am failing to print them out in a correct way. (my output is resulting in the same dataset)

Here is my code:

 from collections import Counter

with open('dataset.dat', 'r') as f:
    lines = []
    for line in f:
        lines.append(line.split())
    c = Counter(sum(lines, []))
    p = c.most_common(50);

with open('dataset-mostcommon.txt', 'w') as output:
    ..............

Can someone please help me on how I can achieve it?

the_sunbeam
  • 1
  • 1
  • 2

2 Answers2

0

You have to iterate again the dataset and, for each line, show only those who are int the most common data set.

If the input lines are sorted, you may just do a set intersection and print those in sorted order. If it is not, iterate your line data and check each item

for line in dataset:
    for element in line.split()
        if element in most_common_elements:
            print(element, end=' ')
    print()

PS: For Python 2, add from __future__ import print_function on top of your script

JBernardo
  • 32,262
  • 10
  • 90
  • 115
  • Thank you. What data structure should most_common_elements be? I am getting a list of tuples for p, I tried to convert it to a dict and then say : `if element in p_to_dict.items():` but then I get an empty result and I don't understand why. Can you help me with it? – the_sunbeam Jan 11 '16 at 09:32
  • `with open('flickr-mostcommon.txt', 'w') as output: for line in lines: result = [] for item in line: if item in p_to_dict.items(): result.append(str(item)) output.write(' '.join(result) + '\n')` – the_sunbeam Jan 11 '16 at 09:46
  • @the_sunbeam `most_common_elements` should be just the first item from each tuple in `p` (preferably in a set): `set(x[0] for x in p)`. You can use the print function just like I showed above. If you want it to redirect to a file, just use it with the argument "file": `print(element, end=' ', file=output)` – JBernardo Jan 11 '16 at 11:59
0

According to the documentation, c.most-common returns a list of tuples, you can get the desired output as follow:

with open('dataset-mostcommon.txt', 'w') as output:
    for item, occurence in p:
        output.writelines("%d has %d occurrences,\n"%(item, occurence))