0

I have a variable which splits the results of a column based on a condition (group by in others programming languages).

I'm trying to have a variable that counts the NR of each group. If we sum all the groups we should have the NR of the file.

When I try to use NR in the calculation for example NR[variable that splits], I get a fatal error "you tried to use scalar as matrix.

Any ideas how to use NR as a variable, but not counting all the records, only those from each group?

sex, weight

male,50
female,49
female,48
male,66
male,78
female,98
male,74
male,54
female,65

In this case the NR would be 9 BUT, in reality I want a way to get that NR of male is 5 and 4 for female.

I have the total sum of weigth column but struggle to get the avg:

sex= $(f["sex"])   
ccWeight[sex] += $(f["weight"])
avgWeight = ccWeight[sex] / ¿?

Important: I don't need to print the result as of now, just to store this number on a variable.

  • You might have to use & compute another variable than `NR`. – Manfred Apr 30 '22 at 15:39
  • I tried with length for the number of occurrences of a column but is not working. – Jose Antonio Piedehierro Arias Apr 30 '22 at 15:40
  • For context, I'm trying to get the avg of a column, but there is another column that splits the results by male/female. So I need to know how many records are in male, and then on female, for that column, to do the calculation. I cannot use the NR – Jose Antonio Piedehierro Arias Apr 30 '22 at 15:41
  • @markp-fuso there you go. I didn't put any because it was more of a general question. The input is just to give context. Once I have the code I can adapt it to my real input – Jose Antonio Piedehierro Arias Apr 30 '22 at 16:12
  • You can;t use the back part of my script without also using the front part that populates the arrays you're trying to use the the back part. Please take another look at https://stackoverflow.com/a/71976230/1745001 and think about how those arrays are being populated. Add some print statements to see how they're being filled if it's not obvious. – Ed Morton May 01 '22 at 00:25

2 Answers2

1

One awk idea:

awk -F, '
NR>1 { counts[$1]++              # keep count of each distinct sex
       counts_total++            # replace dependency on NR
       weight[$1]+=$2            # keep sum of weights by sex
     }
END  { for (i in counts) {
           printf "%s: (count) %s of %s (%.2f%)\n",i,counts[i],counts_total,(counts[i]/counts_total*100)
           printf "%s: (avg weight) %.2f ( %s / %s )\n",i,(weight[i]/counts[i]),weight[i],counts[i]
       }
     }
' sample.dat

NOTE:

  • OP can add additional code to verify total counts and weights are not zero (so as to keep from generating a 'divide by zero' error)
  • perhaps print a different message if there are no (fe)male records to process?

This generates:

female: (count) 4 of 9 (44.44%)
female: (avg weight) 65.00 ( 260 / 4 )
male: (count) 5 of 9 (55.56%)
male: (avg weight) 64.40 ( 322 / 5 )
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Hi, If I wanted this 4 and 5 stored in a variable? I don't know if I explain myself. Look at the variable for the sum of the weight I put in the question. I would like to have something like --> variable_that_counts_totals_by_group[sex] – Jose Antonio Piedehierro Arias Apr 30 '22 at 16:22
  • or can I store this loop in a variable? (without the printing part, just the calculation to get the 4 or 5) – Jose Antonio Piedehierro Arias Apr 30 '22 at 16:23
  • sorry, had to redesign the answer a couple times as the question/requirements were changing; renamed the `sex[]` array to a more appropriate `counts[]`; the `4` is stored in `counts["female"]` and the `5` is stored in `counts["male"]`; you can certainly copy these values into other variables, alternatively you could replace `counts[$i]++` with something like `if ($1=="male") male_count++; else female_count++`; you could also add `counts_total++` if you don't want to rely on `NR`, and then in the `END {...}` processing block you would replace `(NR-1)` with `counts_total` – markp-fuso Apr 30 '22 at 16:28
  • Sorry I didn't give proper context. See this question. https://stackoverflow.com/questions/71960266/gawk-script-sum-group-by-avg-std-by-condition-calculation I will try your code but if you see you need to change something after see the question, go ahead. – Jose Antonio Piedehierro Arias Apr 30 '22 at 16:31
  • To sum up, I give a script a parameter (country) and it should give me the weight avg. This average needs to be on one side for females and the other males. Thus the necessity to have this number of records for each sex. If you give it a go I will give you the right answer on both questions :) – Jose Antonio Piedehierro Arias Apr 30 '22 at 16:32
  • the arguments to the `printf` calls show you where all of the values are stored so, yes, you could reassign those arguments to additional variables as desired (or just use the variables that are provided as arguments to the `printf` calls), your choice ... – markp-fuso Apr 30 '22 at 16:34
  • sorry, I completely misread the requirement re: avg weight; I've updated the answer – markp-fuso Apr 30 '22 at 16:43
  • Sorry to bother you, but,how would you reassign the count[i] argument to a variable? – Jose Antonio Piedehierro Arias Apr 30 '22 at 16:57
  • `if (i=="male") count_male=count[i]; else female_count=count[i]` – markp-fuso Apr 30 '22 at 17:11
0

GNU datamash might be what you are looking for, e.g.:

<infile datamash -Hst, groupby 1 count 1 sum 2 mean 2 | column -s, -t

Output:

GroupBy(sex)  count(sex)  sum(weight)  mean(weight)
female        4           260          65
male          5           322          64.4
Thor
  • 45,082
  • 11
  • 119
  • 130