The googledrive
package includes some low-level API functions that can be used for this. For example, here's how we can get a list of revisions for one Google doc:
library(googledrive)
library(tidyverse)
# replace this with the ID of your google doc
# this doc is private, it wont work for you
fileId <- "1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps"
# Get the name of the file and some other metadata
file <- build_request(
path = "drive/v3/files/{fileId}",
method = "GET",
params = list(
fileId = fileId,
fields = "*"
),
token = drive_token()
)
file_ret <- process_response(make_request(file))
# Now for this doc, query the Drive API to get get URLs and other meta-data for all the revisions available to us
req2 <- build_request(
path = "drive/v2/files/{fileId}/revisions",
method = "GET",
params = list(
fileId = fileId
),
token = drive_token()
)
revs2 <- process_response(make_request(req2))
# See
# https://developers.google.com/drive/api/v2/reference/revisions#resource
# for an explanation of each variable that we have here
# tidy revisions into a dataframe
revs2_df <-
map_df(
revs2$items,
`[`,
c(
"kind",
"etag" ,
"id",
"selfLink" ,
"mimeType" ,
"modifiedDate",
"published" ,
"lastModifyingUserName"
)
)
# get exportLinks URLs out of its nest
revs2_export_url <- map_df(revs2$items, "exportLinks")
# bind together
revs2_df_bind <- bind_cols(revs2_df, revs2_export_url)
The result includes, for each revision, the date, time, name of the user that did the revision, and URLs to export that revision into a download file:
# A tibble: 140 x 16
kind etag id selfLink mimeType modifiedDate published lastModifyingUs… `application/rt…
<chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr>
1 drive… "\"H… 28367 https:/… applica… 2017-09-12T… FALSE Gayoung Park https://docs.go…
2 drive… "\"H… 28487 https:/… applica… 2017-09-12T… FALSE Gayoung Park https://docs.go…
3 drive… "\"H… 28862 https:/… applica… 2017-09-13T… FALSE Gayoung Park https://docs.go…
4 drive… "\"H… 29221 https:/… applica… 2017-09-13T… FALSE Gayoung Park https://docs.go…
5 drive… "\"H… 29258 https:/… applica… 2017-09-13T… FALSE Gayoung Park https://docs.go…
6 drive… "\"H… 29434 https:/… applica… 2017-09-13T… FALSE Gayoung Park https://docs.go…
7 drive… "\"H… 29454 https:/… applica… 2017-09-18T… FALSE Gayoung Park https://docs.go…
8 drive… "\"H… 29603 https:/… applica… 2017-09-18T… FALSE Gayoung Park https://docs.go…
9 drive… "\"H… 30108 https:/… applica… 2017-09-18T… FALSE Gayoung Park https://docs.go…
10 drive… "\"H… 30115 https:/… applica… 2017-09-21T… FALSE Gayoung Park https://docs.go…
# ... with 130 more rows, and 7 more variables: `application/vnd.oasis.opendocument.text` <chr>,
# `text/html` <chr>, `application/pdf` <chr>, `application/epub+zip` <chr>,
# `application/zip` <chr>,
# `application/vnd.openxmlformats-officedocument.wordprocessingml.document` <chr>,
# `text/plain` <chr>
We can then loop over the export URLs to download all the revisions, and compare the size or word count or whatever, and eventually get some plots like this:


Full code for those plots is here: https://gist.github.com/benmarwick/1feaa2b2f0d7bc5f7e97903b8ff92aed
Be aware that there are some severe limitations to the Google Drive revision history that is available via the API. For example,
- When many users edit at the same time, we only get the name of the first editor active in that session. The others are not captured.
- When many edits happen in a short period of time, these are combined by Google into one revision, and we cannot see them individually. We do not have good temporal resolution.
- Older edits are deleted by Google to save space. We do not know what their rules are for this.