2

The file is about 45 GB and ends with ".gds" (Genomic Data Structure (GDS) Files). How to read it into rstudio and aws so that I can run statistical analysis on rstudio cloud?

I tried:

library(aws.s3)


gdsfile<-get_object("s3://bucketname.s3.amazonaws.com/example.gds", bucket = "bucketname")

It did not work the way I wanted.

I wanted:

Object of class "SeqVarGDSClass"
File: D:\Program Files\R\R-4.0.2\library\SAIGEgds\extdata\grm1k_10k_snp.gds (694.2K)

+    [  ] *


|--+ description   [  ] *


|--+ sample.id   { Str8 1000 LZMA_ra(12.6%), 625B } *


|--+ variant.id   { Int32 10000 LZMA_ra(9.87%), 3.9K } *


|--+ position   { Int32 10000 LZMA_ra(9.87%), 3.9K } *


|--+ chromosome   { Str8 10000 LZMA_ra(0.71%), 149B } *


|--+ allele   { Str8 10000 LZMA_ra(1.03%), 421B } *


|--+ genotype   [  ] *


|  |--+ data   { Bit2 2x1000x10000 LZMA_ra(13.8%), 675.5K } *


|  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *


|  \--+ extra   { Int16 0 LZMA_ra, 18B }


|--+ phase   [  ]
|  |--+ data   { Bit1 1000x10000 LZMA_ra(0.03%), 333B } *


|  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *


|  \--+ extra   { Bit1 0 LZMA_ra, 18B }


|--+ annotation   [  ]


|  |--+ id   { Str8 10000 LZMA_ra(5.47%), 3.7K } *


|  |--+ qual   { Float32 10000 LZMA_ra(0.38%), 161B } *


|  |--+ filter   { Int32,factor 10000 LZMA_ra(0.38%), 161B } *


|  |--+ info   [  ]


|  \--+ format   [  ]


\--+ sample.annotation   [  ]


   |--+ sex   { Str8 1000 LZMA_ra(9.00%), 97B } *


   \--+ phenotype   { Int32 1000 LZMA_ra(2.75%), 117B } *

So what I should do to retrieve files (in any format) from s3 and read them into rstudio?

I did some research and only found some examples for .csv files. However, my file is apparently not a .csv file.

Thanks in advance.

Edit: for the first one,

> gdsfile<-get_object("s3://bucketname.s3.amazonaws.com/grm1k_10k_snp.gds", bucket = "bucketname")

> seqOpen(gdsfile)


Error in seqOpen(gdsfile) : is.character(gds.fn) is not TRUE

> gdsfile


   [1] 43 4f 52 45 41 52 52 41 59 78 30 41 00 01 01 00 00 00


  [19] 0b 02 00 00 00 80 00 00 00 00 00 00 01 00 00 00 f5 01


  [37] 00 00 00 00 f5 01 00 00 00 00 04 00 08 c6 43 75 4e f6


  [55] 01 0a 00 00 00 01 c7 43 75 17 e5 7d 9a 01 00 00 00 00


  [73] 2a 00 00 00 00 00 03 00 09 02 f5 00 02 00 00 00 09 44


  [91] 74 31 12 02 00 00 00 15 44 c6 60 10 0b 64 65 73 63 72


 [109] 69 70 74 69 6f 6e 28 00 00 00 00 00 03 00 09 02 f5 00


 [127] 03 00 00 00 09 44 74 31 12 00 00 00 00 15 44 c6 60 10


 [145] 09 73 61 6d 70 6c 65 2e 69 64 29 00 00 00 00 00 03 00

For the second one,

library(SAIGEgds)



fn <- system.file("extdata", "grm1k_10k_snp.gds", package="SAIGEgds")


gdsfile <- seqOpen(fn)

Then you would see what I wanted.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Jason
  • 59
  • 8
  • Could you elaborate on what didn't work with the first command? And also on how do you get to the second one? The example shows that the file read from your disk is the instance of some class/object but you don't show how you have read/initialized it. – Andre.IDK Sep 13 '20 at 17:52
  • Based on the new info, the output of `get_object` looks like bytes while `seqOpen` seems to require a file name. You either need to find a function of `SAIGEgds` that accepts bytes or you convert/save these bytes in something consumable by it. – Andre.IDK Sep 13 '20 at 20:57
  • Thanks. I will try to find the function. – Jason Sep 13 '20 at 21:16

0 Answers0