0

This is the structure of the github stats api data for a repository. I am using dplyr and tidy_json libraries to list the number of commits ("c") ,deletes("d"), lines of code added("a") and the corresponding week("w") for every user in a repository.

      {
        "total": 5,
        "weeks": [
          {
            "w": 1428192000,
            "a": 0,
            "d": 0,
            "c": 0
          },
          {
            "w": 1428796800,
            "a": 0,
            "d": 0,
            "c": 0
          }
        ],
        "author": {
          "login": "ttuser1234",
          "id": 111111111
        }
      },
      {
        "total": 18,
        "weeks": [    
          {
            "w": 1428192000,
            "a": 212,
            "d": 79,
            "c": 5
          },
          {
            "w": 1428796800,
            "a": 146,
            "d": 67,
            "c": 1
          }
        ],
        "author": {
          "login": "coder1234",
          "id": 22222222
        }
      }
}

I am able to extract the weeks and author data separately, but then I am unable to join them together.

inp_file=read_json("The JSON file")
dat=as.tbl_json(inp_file)
dat%>%
  enter_object("weeks") %>%
  gather_array %>%
  spread_values(week=jstring("w"),add=jstring("a"),del=jstring("d"),comm=jstring("c"))


enter_object("author") %>%
  spread_values(handle=jstring("login"))

At no point am I able to jump from the author object to the weeks object to link the 2 of them. Is there any way I can do this? Appreciate any help.

mongofresher
  • 173
  • 1
  • 4
  • 10

2 Answers2

0

tidyjson is nice, but I am not sure it is necessary in this case. Here is one way to achieve what I think is the desired result.

library(jsonlite)
library(dplyr)

df1 <- fromJSON(
  '
[
{
"total": 5,
"weeks": [
{
  "w": 1428192000,
  "a": 0,
  "d": 0,
  "c": 0
},
  {
  "w": 1428796800,
  "a": 0,
  "d": 0,
  "c": 0
  }
],
  "author": {
  "login": "ttuser1234",
  "id": 111111111
  }
  },
  {
  "total": 18,
  "weeks": [    
  {
  "w": 1428192000,
  "a": 212,
  "d": 79,
  "c": 5
  },
  {
  "w": 1428796800,
  "a": 146,
  "d": 67,
  "c": 1
  }
  ],
  "author": {
  "login": "coder1234",
  "id": 22222222
  }
  }
]
'
)

# now the weeks column will actually be nested data.frames
#  we can sort of join the weeks with the author information
#  like this

df_joined <- df1 %>%
  do(
    data.frame(
      .[["author"]],
      bind_rows(.[["weeks"]])
    )
  )
timelyportfolio
  • 6,479
  • 30
  • 33
0

Solution with tidyjson. It looks like your JSON has a bit of trouble in it, and that it perhaps should be an array? Fixed version below.

Using the development version from devtools::install_github('jeremystan/tidyjson')

In any case, it is not necessary to enter_object for both objects. Rather, you can use a more complex path to grab the handle for author before entering the weeks object.

    json <- '[
    {
        "total": 5,
        "weeks": [
            {
                "w": 1428192000,
                "a": 0,
                "d": 0,
                "c": 0
            },
            {
                "w": 1428796800,
                "a": 0,
                "d": 0,
                "c": 0
            }
        ],
        "author": {
            "login": "ttuser1234",
            "id": 111111111
        }
    },
    {
        "total": 18,
        "weeks": [
            {
                "w": 1428192000,
                "a": 212,
                "d": 79,
                "c": 5
            },
            {
                "w": 1428796800,
                "a": 146,
                "d": 67,
                "c": 1
            }
        ],
        "author": {
            "login": "coder1234",
            "id": 22222222
        }
    }
]'

  json %>% as.tbl_json %>%
    gather_array() %>%
    spread_values(handle=jstring('author','login')) %>% ## useful tip
    enter_object("weeks") %>%
    gather_array %>%
    spread_values(week=jstring("w"),add=jstring("a")
    ,del=jstring("d"),comm=jstring("c"))

# A tbl_json: 4 x 8 tibble with a "JSON" attribute
#   `attr(., "JSON")` document.id array.index     handle array.index.2       week   add   del  comm
#               <chr>       <int>       <int>      <chr>         <int>      <chr> <chr> <chr> <chr>
#1 {"w":1428192000...           1           1 ttuser1234             1 1428192000     0     0     0
#2 {"w":1428796800...           1           1 ttuser1234             2 1428796800     0     0     0
#3 {"w":1428192000...           1           2  coder1234             1 1428192000   212    79     5
#4 {"w":1428796800...           1           2  coder1234             2 1428796800   146    67     1

Of course, you can always split the data into two separate pipelines, but this seems like a nicer solution for this example.

cole
  • 1,737
  • 2
  • 15
  • 21