Manipulating a huge text file to fetch occurrences of a particular field

Question

I have a huge text file of the following format. I want to manipulate this file to fetch the number of occurrence of the department field. Each section has a field called department: As a result of my program, I need a CSV file of as mentioned in the Expected output section. I appreciate if the solution uses sed or head/tail or awk. The file is really huge. I have about 50,000+ lines of code. So an effective method is much appreciated.

Input format:


# Person1 Perosn2, AADDC Users, dummydata.somecompany.com
dn: CN=Person1 Perosn2,OU=AADDC Users,DC=dummydata,DC=somecompany,DC=com
objectClass: top
department: 234ABC
name: Person1 Perosn2
objectGUID:: MbCDVZpKbEWRxDUA5iN5IA==
userPrincipalName: abcdef@dummydata.somecompany.com
objectCategory: CN=Person,CN=Schema,CN=Configuration,DC=dummydata,DC=somecompany
 ,DC=com
dSCorePropagationData: 16010101000000.0Z
lastLogonTimestamp: 132173602593105876
preferredLanguage: en-US
msDS-AzureADMailNickname: abcdef


# Person1 Perosn2, AADDC Users, dummydata.somecompany.com
dn: CN=Person1 Perosn2,OU=AADDC Users,DC=dummydata,DC=somecompany,DC=com
objectClass: top
department: 234ABC
name: Person1 Perosn2
objectGUID:: MbCDVZpKbEWRxDUA5iN5IA==
userPrincipalName: abcdef@dummydata.somecompany.com
objectCategory: CN=Person,CN=Schema,CN=Configuration,DC=dummydata,DC=somecompany
 ,DC=com
dSCorePropagationData: 16010101000000.0Z
lastLogonTimestamp: 132173602593105876
preferredLanguage: en-US
msDS-AzureADMailNickname: abcdef

# Person3 Perosn4, AADDC Users, dummydata.somecompany.com
dn: CN=Person1 Perosn2,OU=AADDC Users,DC=dummydata,DC=somecompany,DC=com
objectClass: top
department: XYZ012
name: Person1 Perosn2
objectGUID:: MbCDVZpKbEWRxDUA5iN5IA==
userPrincipalName: abcdef@dummydata.somecompany.com
objectCategory: CN=Person,CN=Schema,CN=Configuration,DC=dummydata,DC=somecompany
 ,DC=com
dSCorePropagationData: 16010101000000.0Z
lastLogonTimestamp: 132173602593105876
preferredLanguage: en-US
msDS-AzureADMailNickname: abcdef

Expected output

234ABC,2
XYZ012,1

what I did:

I used this command to grep the file. grep '^department: *' file.txt

But I am not sure if there is a way to get the expected output using single commands like sed, grep etc.

On SO we encourage users to add their efforts which they have put in order to solve their own problem, so kindly do so and let us know then. — RavinderSingh13, Nov 08 '19 at 09:25
@RavinderSingh13: I have added what I tried. Hope this would suffice. — Anish Nagaraj, Nov 08 '19 at 09:41
Thanks for adding them, I have added answer now, lemme know on same in its comments section. — RavinderSingh13, Nov 08 '19 at 09:44

RavinderSingh13 · Accepted Answer · 2019-11-08T12:21:52.177

0

Could you please try following.

awk '
BEGIN{
  OFS=","
}
{
  gsub(/\r/,"")
}
/department:/{
  string=$NF
  sub(/ +$/,"",string)
  if(!a[string]++){
    b[++count]=string
  }
  ++val[string]
}
END{
  for(i=1;i<=count;i++){
    print b[i],val[b[i]]
  }
}
'  Input_file

edited Nov 08 '19 at 12:21

answered Nov 08 '19 at 09:43

RavinderSingh13

130,504
14
57
93

@AnishNagaraj, Could you please try my edited code and lemme know then? – RavinderSingh13 Nov 08 '19 at 12:22

score 0 · Answer 2 · answered Nov 08 '19 at 13:56

This might work for you (GNU sed):

sed -En 's/^department: //;T;G;/^(\S+\n)(\S+\n)*\1/!P;h' file

Ignore lines that do not begin department:. Store the remainder of the line in the hold space and if it is unique to other lines in the hold space, print it.

Manipulating a huge text file to fetch occurrences of a particular field

2 Answers2