Aggregating table rows by values in multiple columns in a FITS file

Question

I have a FITS file with many columns. A part of a simplified example of the file looks like this:

    A      B      C
   100     1      90
   100     2      90
   100     3      90
   100     4      90
   211     40     70
   211     41     70
   211     42     70
   211     43     70
   211     44     70

If you notice here, the first four rows of column A and C are the same, but column B varies from 1 to 4. Then the next 5 values of column A and C are the same, but column B varies from 40 to 44.

What I would like to do is, write a program that creates a file like this:

   A      B     C
  100     4     90
  211     5     70

That is, column B should contain the number of values for which column A and C were the same!

I would like to know how do to this in Python. It does not necessarily need to deal with a FITS file, I can also convert it to ASCII format if there are some routines that cannot be used in FITS files.

What I have tried so far:

I bumped into the routine called Collections which has a sub-routine called Counter which counts the number of values in a list that are equal and returns them.

I tried:

import collections
counter = collections.Counter(a)
counts = counter.values()

But this only gives me those values in column A that are equal. Could someone tell me how to use this routine to compare those values with column C as well?

What have you tried? What's your initial code? What doesn't work? SO is not here to do your homework, you should try something yourself first :) — favoretti, Jun 29 '14 at 11:32
@favoretti this ain't homework!! I have tried to use enumerate, but that doesn't help. I have also tried to think with for loops, that doesn't help either — Srivatsan, Jun 29 '14 at 11:33
@favoretti since A and C are not equal, i.e does not have the same value, I am quite confused what to use. If they were equal, I could use enumerate and say if A==C, return len(B) — Srivatsan, Jun 29 '14 at 11:36
@Srivatsan show us what you have tried, perhaps you are already close. — Bart Friederichs, Jun 29 '14 at 11:37
@BartFriederichs I just bumped into a package called collections. Can I use counter from that package to count just the values of one column, say A in my case? — Srivatsan, Jun 29 '14 at 12:03

Jon Clements · Accepted Answer · 2014-06-29T13:00:30.927

5

I'd go for something like this...:

from itertools import groupby
from operator import itemgetter

with open('input') as fin, open('output', 'w') as fout:
    fout.write(next(fin, ''))
    rows = (line.split() for line in fin)
    for k, g in groupby(rows, itemgetter(0, 2)):
        fout.write('{} {} {}\n'.format(k[0], sum(1 for _ in g), k[1]))

Write header straight out...
Build a generator to produce rows split by whitespace
Use itertools.groupby to groupby first and third column
Count the number of occurrences in g to get the length of the group
Write out the lines formatted as desired...

You can also use a collections.Counter if the groups are non-contiguous and should be counted as one, you can replace the groupby instead with the following:

counts = Counter((row[0], row[2]) for row in rows)
for k, v in counts.iteritems():
    fout.write('{} {} {}\n'.format(k[0], v, k[1]) # or even...
            # print >> fout, k[0], v, k[1]

edited Jun 29 '14 at 13:00

answered Jun 29 '14 at 12:48

Jon Clements

138,671
33
247
280

I have just made an edit to my post using Collections.counter which seems easy to use. WRT to your answer, should column A and C be sorted because of the use of groupby? – Srivatsan Jun 29 '14 at 12:53
@Srivatsan if you have 6 million rows, probably prohibitive to sort. The trade off is if your file is already sorted, then there's no need to keep a `Counter` in memory. However, if it's not sorted and the order of the entries is not important, then using a `Counter` is easiest... – Jon Clements Jun 29 '14 at 12:55
I would just like to count the number of values for each A and C that are equal and print a new column with those values. WRT to your answer, what should be 'output'?? Should I create a txt file or can I just mention a filename in 'output' which when executed is created by Python – Srivatsan Jun 29 '14 at 12:58
@Srivatsan well, `open('output', 'w')` will create a file called `output` and right to that... feel free to do whatever you want regarding filename, printing to screen, just keeping in memory... etc... The values will be in `counts` - up to you what you want to do – Jon Clements Jun 29 '14 at 13:02

Ian Clark · Answer 2 · 2014-06-29T12:58:40.307

This seems to work as intended:

out = [["A", "B", "C"]]

# Safely handle opening the file
with open("demo.txt") as f:
    a_count = 1
    prev_a = None

    for i, line in enumerate(f):
        a, b, c = [int(x) for x in line.split()]

        if i != 0:
            if prev_a == a and prev_c == c:
                a_count += 1
            else:
                out.append([prev_a, a_count, prev_c])
                a_count = 1

        # Keep track of our last results
        prev_a, prev_c = a, c

    # Collect the final line counts
    if prev_a:
        out.append([prev_a, a_count, prev_c])

# Pretty-print our results
for line in out:
    print "{0: ^6}{1: ^6}{2: ^6}".format(*line)

score 0 · Answer 3 · answered Jun 29 '14 at 11:48

You should add some code to your question, to show what you hava tried. People often will not want to help much if they cannot see you have tried - myself included.

As a hint in terms of logic (basic. there are neater ways of doing this):

Try have a dictionary with a key made up of column A and C's numbers. then, whenever A and C are the same, search the dictionary for a key with that combination and if found, add 1 to the value, if not found, add a new element.

using your example (in pseudo code - so you have something to do :P )

if "100-90" in array:  // Checking if the element is in the array
  array["100-90"] += 1  // Adds one to an existing element
else
  array["100-90] = 1  // Adds a new element into array

score 0 · Answer 4 · answered Jun 29 '14 at 12:08

0

Try this:

lines = """100 1 90
100 2 90
100 3 90
100 4 90
211 40 70
211 41 70
211 42 70
211 43 70
211 44 70""".split("\n")

count = 0
oldA = oldB = oldC = None
for line in lines:

    a,b,c = line.split(" ")
    if None in [oldA, oldB, oldC]:
        oldA,oldB,oldC = a,b,c

    if oldA == a and oldC == c:
        count +=1

    else:
        print oldA,count,oldC
        count = 1
        oldA, oldB, oldC = a,b,c

print oldA,count,oldC

answered Jun 29 '14 at 12:08

Valijon

12,667
4
34
67

The problem is, my file has more than 6 million lines!! So I am searching for a package and I just now bumped into this package called collections, which has a sub routine called counter, I am trying to use this as of now – Srivatsan Jun 29 '14 at 12:09
@Srivatsan Do you need to store `x = len(b)` for each `a,c` combination? – Valijon Jun 29 '14 at 12:15
Yes, I need the values of all A and C that are equal. i.e the len(b) – Srivatsan Jun 29 '14 at 12:17
@Srivatsan OK. 1 question. What do you expect for this input 100 1 90\n 100 2 91\n 100 3 90\n 100 4 90\n 211 40 70\n 211 41 71\n 211 42 70\n 211 43 70\n 211 44 70\n – Valijon Jun 29 '14 at 12:19

Aggregating table rows by values in multiple columns in a FITS file

4 Answers4