0

I'm working with GWAS data. Need help.

My data looks like this:

IID,rs098083,kgp794789,rs09848309,kgp8300747,.....
63,CC,AG,GA,AA,.....
54,AT,CT,TT,AG,.....
12,TT,GA,AG,AA,.....
.
.
.

As above I have a total of 512 rows and 2 Million columns.

Desired output:

SNP,Genotyping
rs098083,{
"CC" : [ 1, 63, 6, 18, 33, ...],
"CT" : [ 2, 54, 6, 7, 8, ...],
"TT" : [ 4, 9, 12, 13, ...],
"AA" : [86, 124, 4, 19, ...],
"AT" : [8, 98, 34, 74, ....],
.
.
.
}     
kgp794789,{
"CC" : [ 1, 63, 6, 18, 33, ...],
"CT" : [ 2, 5, 6, 7, 8, ...],
"TT" : [ 4, 9, 12, 13, ...],
"AA" : [86, 124, 4, 19, ...],
"AT" : [8, 98, 34, 74, ....],
.
.
.

}
rs09848309,{
"CC" : [ 1, 63, 6, 18, 3, ...],
"CT" : [ 2, 5, 6, 7, 8, ...],
"TT" : [ 4, 9, 24 13, ...],
"AA" : [86, 134, 4, 19, ...],
"AT" : [8, 48, 34, 44, ....],
.
.
.

As above after pivoting, I should have a JSON file of 2 million rows & 2 Columns. The SNP column of the row contains the ID of the SNP. The genotyping column will contain a JSON BLOB. This BLOB will be a set of key-value pairs. The key is a particular genotype (e.g., CC, CT, TT, ....) and the value is a list of the IIDs with a genotype matching the key.

Output Format would be " a CSV with embedded JSON"

  • It's not clear from your question at which point you are stuck. Maybe it helps if you sketch the algorithm you have in mind to achieve the desired result. – user1934428 Apr 26 '22 at 06:55
  • What you show and describe as desired output doesn't make sense; please see the updated **Plausible outputs** section of my answer – Fravadona Apr 26 '22 at 17:44

1 Answers1

1

Here's an approach using stedolan/jq:

jq -Rrn '
  [ inputs / "," ] | transpose | .[0][1:] as $h | .[1:][]
  | .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]] += [$t[1]]) | @text]
  | join(", ")
'
rs098083, {"CC":["63"],"AT":["54"],"TT":["12"]}
kgp794789, {"AG":["63"],"CT":["54"],"GA":["12"]}
rs09848309, {"GA":["63"],"TT":["54"],"AG":["12"]}
kgp8300747, {"AA":["63","12"],"AG":["54"]}

Demo

Add tonumber if the IDs should be encoded as JSON numbers

jq -Rrn '
  [ inputs / "," ] | transpose | (.[0][1:] | map(tonumber)) as $h | .[1:][]
  | .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]] += [$t[1]]) | @text]
  | join(", ")
'
rs098083, {"CC":[63],"AT":[54],"TT":[12]}
kgp794789, {"AG":[63],"CT":[54],"GA":[12]}
rs09848309, {"GA":[63],"TT":[54],"AG":[12]}
kgp8300747, {"AA":[63,12],"AG":[54]}

Demo


If your ultimate goal is to have a JSON representation anyways, omit formatting the raw output, and something like this might do:

jq -Rn '
  [ inputs / "," ] | transpose | .[0][1:] as $h | reduce .[1:][] as $t (
    {}; .[$t[0]] = reduce ([$t[1:],$h] | transpose[]) as $i (
      {}; .[$i[0]] += [$i[1]]
    )
  )
'
{
  "rs098083": { "CC": ["63"], "AT": ["54"], "TT": ["12"] },
  "kgp794789": { "AG": ["63"], "CT": ["54"], "GA": ["12"] },
  "rs09848309": { "GA": ["63"], "TT": ["54"], "AG": ["12"] },
  "kgp8300747": { "AA": ["63", "12"], "AG": ["54"] }
}

Demo (formatted manually for easier comaprison with previous solutions)

pmf
  • 24,478
  • 2
  • 22
  • 31
  • Above I have given sample data, actually I have 522 rows and 52 lakh columns CSV file, how to apply above command on csv file –  Apr 26 '22 at 10:51
  • @RajuNatha, that would be `jq -Rrn '...' file.csv`. BTW your sample input and your desired output are not valid CSVs, so the replies you'll get won't read and generate real CSVs. – Fravadona Apr 26 '22 at 11:02
  • In fact, I need to convert CSV file to JSON –  Apr 26 '22 at 11:10
  • @pmf, can you please help me with the command Since I have a CSV file and need output in JSON format like above your results –  Apr 26 '22 at 12:27
  • @RajuNatha My approach works for any number of rows and any number of columns, as long as it is formatted like your sample. It broke with your edit (you have changed the separators). I have updated my answer accordingly. I have also added a more convenient approach, as you mentioned to be looking for a JSON output anyways. – pmf Apr 26 '22 at 19:35
  • @pmf, can you please explain the bash command i.e Add to number if the IDs should be encoded as JSON numbers –  Apr 29 '22 at 05:05
  • @RajuNatha For later reference, the values from the first column are saved as an array into a variable `$h` using `.[0][1:] as $h`. If you `map` the `tonumber` filter onto that array before saving, i.e. by changing it to `(.[0][1:] | map(tonumber)) as $h`, the array's items will be converted to numbers (and retrieved accordingly when referenced). – pmf Apr 29 '22 at 06:07