The file is about 45 GB and ends with ".gds" (Genomic Data Structure (GDS) Files). How to read it into rstudio and aws so that I can run statistical analysis on rstudio cloud?
I tried:
library(aws.s3)
gdsfile<-get_object("s3://bucketname.s3.amazonaws.com/example.gds", bucket = "bucketname")
It did not work the way I wanted.
I wanted:
Object of class "SeqVarGDSClass"
File: D:\Program Files\R\R-4.0.2\library\SAIGEgds\extdata\grm1k_10k_snp.gds (694.2K)
+ [ ] *
|--+ description [ ] *
|--+ sample.id { Str8 1000 LZMA_ra(12.6%), 625B } *
|--+ variant.id { Int32 10000 LZMA_ra(9.87%), 3.9K } *
|--+ position { Int32 10000 LZMA_ra(9.87%), 3.9K } *
|--+ chromosome { Str8 10000 LZMA_ra(0.71%), 149B } *
|--+ allele { Str8 10000 LZMA_ra(1.03%), 421B } *
|--+ genotype [ ] *
| |--+ data { Bit2 2x1000x10000 LZMA_ra(13.8%), 675.5K } *
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| \--+ extra { Int16 0 LZMA_ra, 18B }
|--+ phase [ ]
| |--+ data { Bit1 1000x10000 LZMA_ra(0.03%), 333B } *
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| \--+ extra { Bit1 0 LZMA_ra, 18B }
|--+ annotation [ ]
| |--+ id { Str8 10000 LZMA_ra(5.47%), 3.7K } *
| |--+ qual { Float32 10000 LZMA_ra(0.38%), 161B } *
| |--+ filter { Int32,factor 10000 LZMA_ra(0.38%), 161B } *
| |--+ info [ ]
| \--+ format [ ]
\--+ sample.annotation [ ]
|--+ sex { Str8 1000 LZMA_ra(9.00%), 97B } *
\--+ phenotype { Int32 1000 LZMA_ra(2.75%), 117B } *
So what I should do to retrieve files (in any format) from s3 and read them into rstudio?
I did some research and only found some examples for .csv files. However, my file is apparently not a .csv file.
Thanks in advance.
Edit: for the first one,
> gdsfile<-get_object("s3://bucketname.s3.amazonaws.com/grm1k_10k_snp.gds", bucket = "bucketname")
> seqOpen(gdsfile)
Error in seqOpen(gdsfile) : is.character(gds.fn) is not TRUE
> gdsfile
[1] 43 4f 52 45 41 52 52 41 59 78 30 41 00 01 01 00 00 00
[19] 0b 02 00 00 00 80 00 00 00 00 00 00 01 00 00 00 f5 01
[37] 00 00 00 00 f5 01 00 00 00 00 04 00 08 c6 43 75 4e f6
[55] 01 0a 00 00 00 01 c7 43 75 17 e5 7d 9a 01 00 00 00 00
[73] 2a 00 00 00 00 00 03 00 09 02 f5 00 02 00 00 00 09 44
[91] 74 31 12 02 00 00 00 15 44 c6 60 10 0b 64 65 73 63 72
[109] 69 70 74 69 6f 6e 28 00 00 00 00 00 03 00 09 02 f5 00
[127] 03 00 00 00 09 44 74 31 12 00 00 00 00 15 44 c6 60 10
[145] 09 73 61 6d 70 6c 65 2e 69 64 29 00 00 00 00 00 03 00
For the second one,
library(SAIGEgds)
fn <- system.file("extdata", "grm1k_10k_snp.gds", package="SAIGEgds")
gdsfile <- seqOpen(fn)
Then you would see what I wanted.