6

I'm trying to create an R API for StackOverflow. The output is gzipped. For example:

readLines("http://api.stackoverflow.com/0.9/stats/", warn=F)
[1] "\037‹\b"                                                                                                                                                                                                                                                                                         
[2] "\030\002úØÛy°óé½\036„iµXäË–[<üt—Zu[\\VmÎHî=ÜÛݹ×ýz’Í.äûû÷>ý´\a\177Ýh÷\017îÝÛÙwßÚáÿþ«¼þý\027ÅrÝæÔlgüÀëA±\017›ìŽï{M¤û.\020\037�Ë\"¿’\006³ì\032„Úß9¸ÿ`¼ç÷³*~ÿKêˆð¡\006v¦ð²ýô£�ñÃ�ì+ôU�_\026滽�]êt¼·?ÞûÈ4ù%\016~S0^>àe¶ÀG\037½n³éÛôKê缬®‚\016Êê¢úý×u‰fó¶]=º{·aÎšŽ—y{·©î\026‹‹»h5^-/‚W1 |9[UŲõ^§�Ç"
[3] ":¬´¿1M\177ð\"0íö¹ñ…YÞLëbÕ*!~â\027\036§çU�®êê¢ÎˆµhòýæÅ´Zn\036S¶Z•ùv[­§óm´î�"                                                                                                                                                                                                                      
[4] "Í™t˪^d¥£·üÂ?¾ÿ\033'¿$ù\177"  

Is there a good way to gunzip this in R, short of writing the output to file, gunzip'ing it, and reading it back in?

Community
  • 1
  • 1
Shane
  • 98,550
  • 35
  • 224
  • 217
  • I'm looking forward to the package that is bound to fall out the other end of this research! – JD Long Jun 28 '10 at 14:21
  • @JD: Absolutely. I'll post the google code page shortly and am happy to take on collaborators. But my initial feeling is that the SO API isn't very useful. – Shane Jun 28 '10 at 15:58

3 Answers3

11

You could do:

conn <- gzcon(url("http://api.stackoverflow.com/0.9/stats/"))
data <- readLines(conn)
nico
  • 50,859
  • 17
  • 87
  • 112
  • Thanks! Don't forget to close the connection when you're finished. – Shane Jun 27 '10 at 20:13
  • Why double `readLines` is needed? [mbq answer](http://stackoverflow.com/questions/3128422/gunzip-a-file-stream-in-r/3128738#3128738) works too. – Marek Jun 28 '10 at 08:19
  • @Marek: corrected. That was just me trying different things and I must have pasted some extra command. Thanks for pointing that out. – nico Jun 28 '10 at 11:08
5

Try:

p <- gzcon(url("http://api.stackoverflow.com/0.9/stats/"))
readLines(p)
hadley
  • 102,019
  • 32
  • 183
  • 245
mbq
  • 18,510
  • 6
  • 49
  • 72
4

Ideally we should tell the server that we can handle gzipped content, find out from the HTTP headers that the content is actually gzip encoded and then decompress only if it is. The Rcurl library can do this:

library(Rcurl)
getURL("http://api.stackoverflow.com/0.9/stats/",
       .opts=list(encoding="identity,gzip")
Jyotirmoy Bhattacharya
  • 9,317
  • 3
  • 29
  • 38
  • 1
    That's would be the right way to do it, but be aware that the Stack Overflow API team has [decided against obeying the HTTP protocol](http://stackapps.com/questions/729) in this regard; slightly related we won't see [proper HTTP/1.1 cache control](http://stackapps.com/questions/1028) for the time being as well ... – Steffen Opel Jul 10 '10 at 08:17