library("tibble")
library("jsonlite")
library("purrr")
#>
#> Attaching package: 'purrr'
#> The following object is masked from 'package:jsonlite':
#>
#> flatten
The audience for this article is the developers of boxr, who may let many weeks or months pass without actively thinking about how the functions in this package:
At its heart, the goal of this package is to abstract away the complexities of using the Box API. We assume that a new user starts using this package with some familiarity with the Tidyverse, and r-lib packages like fs, so we aim to provide them with a familiar way of doing things.
Providing familiarity, particularly to emulate an opinionated framework like Tidyverse, requires us (as boxr developers) to introduce opinions. Thus, we also wish provide an “escape hatch”, which could be used by those who want to work outside of the Tidyverse, or outside of our opinions.
In Tidyverse, the base unit of analysis is the data frame. Among the boxr’s developers, it is uncontroversial that we should use data frames as much as possible. However, data frames come in different flavors:
I (Ian) am a firm believer that following Postel’s Law helps us (and our users) avoid hard-to-diagnose trouble. As you may know, Postel’s law says to be “flexible in what you accept; strict in what you return”. In other words, we should strive to accept and interpret users’ input so long as the intent is clear, but we should specify very clearly what a function returns and adhere strictly to that specification.
A famous Tidyverse example is how a subsetting a
data.frame
will, by default, return a vector
rather than a data.frame
if only one column is
specified:
str(mtcars[, c("wt", "mpg")])
#> 'data.frame': 32 obs. of 2 variables:
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
str(mtcars[, "mpg"])
#> num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
To avoid this behavior you can specify drop = FALSE
, but
this is sometimes forgotten – even by experienced R users:
str(mtcars[, "mpg", drop = FALSE])
#> 'data.frame': 32 obs. of 1 variable:
#> $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
The tibble designs this problem away. Following Postel’s law, a subsetting a tibble always returns a tibble; if you want a vector, you have to call another function. It is strict with its output.
str(as_tibble(mtcars)[, "mpg"])
#> tibble [32 × 1] (S3: tbl_df/tbl/data.frame)
#> $ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
As we figure out what our functions return, I want to keep Postel’s Law in mind.
The boxr package is an exercise in abstracting away the Box API; sometimes this abstraction helps developers like me forget that it is actually there. It’s there.
The API is classified according to endpoints and resources; I think of these as analogous to R functions and objects. The Box API is comprehensive; we cannot possibly aspire to cover it all. Instead, our goal is to provide easy access to as many day-to-day endpoints as we can, and provide a way to help you to access others if you need to.
Some of our functions call to only one endpoint,
e.g. box_ls()
calls only the list
items in folder endpoint. Others of our functions call multiple
endpoints, e.g box_fetch()
calls the list-items endpoint,
as well as the download
file endpoint.
If a function calls a single endpoint (perhaps even repeatedly), it should return the response (or collection of responses) that the API returns. Consider the content of a sample response from the list-items endpoint:
content <-
fromJSON(
'{
"entries": [
{
"id": "12345",
"etag": "1",
"type": "file",
"sequence_id": "3",
"name": "Contract.pdf",
"sha1": "85136C79CBF9FE36BB9D05D0639C70C265C18D37",
"file_version": {
"id": "12345",
"type": "file_version",
"sha1": "134b65991ed521fcfe4724b7d814ab8ded5185dc"
}
}
],
"limit": 1000,
"offset": 2000,
"order": [
{
"by": "type",
"direction": "ASC"
}
],
"total_count": 5000
}',
simplifyVector = FALSE
)
The sample response shown on the Box web-page is different from the
response that I actually get. The example JSON, in the
"entries"
element, does not quote numeric values,
e.g. {"id": 0}
, whereas the actual response does
quote numeric values, e.g. {"id": "0"}
.
While this may seem inconvenient, it may help us out because although
elements like file id
are nominally integers, they are
often larger than R’s integer-maximum. For this reason, I think that
from boxr’s perspective, id
should remain a character
string. That said, I think we can parse other things:
"etag"
, "sequence_id"
."_at"
."is_"
, "can_"
, or "has_"
.Here’s the parsed content.
str(content)
#> List of 5
#> $ entries :List of 1
#> ..$ :List of 7
#> .. ..$ id : chr "12345"
#> .. ..$ etag : chr "1"
#> .. ..$ type : chr "file"
#> .. ..$ sequence_id : chr "3"
#> .. ..$ name : chr "Contract.pdf"
#> .. ..$ sha1 : chr "85136C79CBF9FE36BB9D05D0639C70C265C18D37"
#> .. ..$ file_version:List of 3
#> .. .. ..$ id : chr "12345"
#> .. .. ..$ type: chr "file_version"
#> .. .. ..$ sha1: chr "134b65991ed521fcfe4724b7d814ab8ded5185dc"
#> $ limit : int 1000
#> $ offset : int 2000
#> $ order :List of 1
#> ..$ :List of 2
#> .. ..$ by : chr "type"
#> .. ..$ direction: chr "ASC"
#> $ total_count: int 5000
In the content
list, only the entries
element has lasting information; the other elements deal with the
pagination.
# we could imagine this as a function that would contain all our parsing rules
parse_entry <- function(entry) {
# if we import tidyselect, we can use functions like `ends_with()`
entry <- purrr::map_at(entry, c("etag", "sequence_id"), as.numeric)
entry <- purrr::map_if(entry, is.list, parse_entry)
entry
}
entries <-
content$entries %>%
map(parse_entry)
str(entries)
#> List of 1
#> $ :List of 7
#> ..$ id : chr "12345"
#> ..$ etag : num 1
#> ..$ type : chr "file"
#> ..$ sequence_id : num 3
#> ..$ name : chr "Contract.pdf"
#> ..$ sha1 : chr "85136C79CBF9FE36BB9D05D0639C70C265C18D37"
#> ..$ file_version:List of 3
#> .. ..$ id : chr "12345"
#> .. ..$ type: chr "file_version"
#> .. ..$ sha1: chr "134b65991ed521fcfe4724b7d814ab8ded5185dc"
Here’s where things get interesting.
As it stands, many of boxr’s functions, e.g box_ls()
will return the entries
as a list of lists, attaching the
S3 class boxr_object_list
. It is minimally processed,
allowing you to do with it as you please.
This S3 class has an as.data.frame()
method which will
convert the element into a data frame. (If you want a data frame 99% of
the time, it is inconvenient to call as.data.frame()
99% of
the time.)
It behaves much like the internal function we have,
stack_rows_df()
:
boxr:::stack_rows_df(entries)
#> id etag type sequence_id name
#> 1 12345 1 file 3 Contract.pdf
#> sha1 file_version.id file_version.type
#> 1 85136C79CBF9FE36BB9D05D0639C70C265C18D37 12345 file_version
#> file_version.sha1
#> 1 134b65991ed521fcfe4724b7d814ab8ded5185dc
For those who prefer tibbles, we have another function,
stack_rows_tbl()
:
boxr:::stack_rows_tbl(entries)
#> # A tibble: 1 × 7
#> id etag type sequence_id name sha1 file_version
#> <chr> <dbl> <chr> <dbl> <chr> <chr> <list>
#> 1 12345 1 file 3 Contract.pdf 85136C79CBF9FE36BB9D0… <named list>
A couple of things you might notice:
stack_rows_df()
returns a data.frame
.
List items are unnested; the nested item names are delimited with a
.
, e.g. file_version.id
.
stack_rows_tbl()
returns a tibble. List items remain
nested.
Right now, we have a few different ways to deal with return objects:
box_version_history()
: calls a single endpoint, returns
a data frame, but we modify the columns: combining type
and
id
into version_id
.box_collab_create()
: calls a single endpoint, returns a
list with an S3 class "boxr_collab"
. This S3 class has an
as.data.frame()
method, and an as_tibble()
method.box_ls()
: calls a single endpoint, returns a list with
an S3 class "boxr_object_list"
. This S3 class has an
as.data.frame()
method.box_fetch()
: calls multiple endpoints, returns a list
with an S3 class "boxr_dir_wide_operation_result"
. This S3
class does not have an as.data.frame()
method.The goal is to find a way to harmonize this, without causing too many backward incompatibilities.
I’m thinking out loud here to sketch out ways to proceed so that we provide a consistent return object:
We will walk through a simplified reimagining of the
box_ls()
function.
Using `BOX_CLIENT_ID` from environment
Using `BOX_CLIENT_SECRET` from environment
boxr: Authenticated using OAuth2 as Ian LYTTLE ([email protected], id: 196942982)
Let’s imagine a single function in the package that calls the API. It will be more involved than this, but it will give you an idea.
# this works for Ian's Box account - no-one else
dir_id <- "123053109701"
# returns a httr response object
box_api_response <- function(verb, endpoint) {
response <-
httr::RETRY(
verb,
glue::glue("https://api.box.com/2.0/{endpoint}"),
boxr:::get_token(),
terminate_on = boxr:::box_terminal_http_codes()
)
response
}
response <- box_api_response("GET", glue::glue("folders/{dir_id}/items/"))
response
Response [https://api.box.com/2.0/folders/123053109701/items/]
Date: 2020-10-17 01:04
Status: 200
Content-Type: application/json
Size: 640 B
At this point, we have no idea if the response is any good or not, nor have we extracted the content.
box_content <- function(response, task = NULL) {
httr::stop_for_status(response, task = task)
text <- httr::content(response, as = "text", encoding = "UTF-8")
# we may want to deviate from the defaults
content <- jsonlite::fromJSON(text, simplifyDataFrame = FALSE)
content
}
This lets someone get a JSON list, or an error message if the response is bad.
List of 5
$ total_count: int 2
$ entries :List of 2
..$ :List of 7
.. ..$ type : chr "file"
.. ..$ id : chr "721629732867"
.. ..$ file_version:List of 3
.. .. ..$ type: chr "file_version"
.. .. ..$ id : chr "767453805267"
.. .. ..$ sha1: chr "c66f70f6c65f8cd381434a56165640d50fb3a9c2"
.. ..$ sequence_id : chr "0"
.. ..$ etag : chr "0"
.. ..$ sha1 : chr "c66f70f6c65f8cd381434a56165640d50fb3a9c2"
.. ..$ name : chr "another-attempt-at-dark-mode.pdf"
..$ :List of 7
.. ..$ type : chr "file"
.. ..$ id : chr "721628453889"
.. ..$ file_version:List of 3
.. .. ..$ type: chr "file_version"
.. .. ..$ id : chr "767454763288"
.. .. ..$ sha1: chr "69ad086c3f8d96b991b8f8bcce67b95708397b1a"
.. ..$ sequence_id : chr "2"
.. ..$ etag : chr "2"
.. ..$ sha1 : chr "69ad086c3f8d96b991b8f8bcce67b95708397b1a"
.. ..$ name : chr "ctz-widget.txt"
$ offset : int 0
$ limit : int 100
$ order :List of 2
..$ :List of 2
.. ..$ by : chr "type"
.. ..$ direction: chr "ASC"
..$ :List of 2
.. ..$ by : chr "name"
.. ..$ direction: chr "ASC"
Now, it may be interesting to parse the content into a list. We can
use the parse_entry()
function from above. Note that some
endpoints return an entries
element, others don’t. This one
does.
box_parse_entries <- function(entries) {
purrr::map(entries, parse_entry)
}
parsed <- box_parse_entries(content$entries)
str(parsed)
List of 2
$ :List of 7
..$ type : chr "file"
..$ id : chr "721629732867"
..$ file_version:List of 3
.. ..$ type: chr "file_version"
.. ..$ id : chr "767453805267"
.. ..$ sha1: chr "c66f70f6c65f8cd381434a56165640d50fb3a9c2"
..$ sequence_id : num 0
..$ etag : num 0
..$ sha1 : chr "c66f70f6c65f8cd381434a56165640d50fb3a9c2"
..$ name : chr "another-attempt-at-dark-mode.pdf"
$ :List of 7
..$ type : chr "file"
..$ id : chr "721628453889"
..$ file_version:List of 3
.. ..$ type: chr "file_version"
.. ..$ id : chr "767454763288"
.. ..$ sha1: chr "69ad086c3f8d96b991b8f8bcce67b95708397b1a"
..$ sequence_id : num 2
..$ etag : num 2
..$ sha1 : chr "69ad086c3f8d96b991b8f8bcce67b95708397b1a"
..$ name : chr "ctz-widget.txt"
The parsed content (here at least) is a list of lists. We can stack this into a tibble from the parsed info:
# A tibble: 2 x 7
type id file_version sequence_id etag sha1 name
<chr> <chr> <list> <dbl> <dbl> <chr> <chr>
1 file 72162973… <named list [… 0 0 c66f70f6c65f8cd381434a… another-attempt-at…
2 file 72162845… <named list [… 2 2 69ad086c3f8d96b991b8f8… ctz-widget.txt
For this function, we do not propose any post-processing of the
stacked content. However, box_version_history()
does this:
combining type
and id
into
version_id
.
We now have the building blocks for our reimagined
box_ls()
function:
box_dir_info <- function(dir_id) {
response <- box_api_response("GET", glue::glue("folders/{dir_id}/items/"))
entries <- box_content(response, task = "get directory listing")[["entries"]]
# The above is an oversimplification. In actuality, these two functions
# would be combined into one function that would take care of the pagination,
# something like:
#
# entries <-
# box_api_entries(
# "GET",
# endpoint = glue::glue("folders/{dir_id}/items/"),
# task = "get directory listing"
# )
#
# box_api_entries() would call box_api_response() and box_content()
parsed <- box_parse_entries(entries)
stacked <- boxr:::stack_rows_tbl(parsed)
# not doing anything here, but box_version_history() changes some columns
wrangled <- stacked
wrangled
}
box_dir_info(dir_id)
# A tibble: 2 x 7
type id file_version sequence_id etag sha1 name
<chr> <chr> <list> <dbl> <dbl> <chr> <chr>
1 file 72162973… <named list [… 0 0 c66f70f6c65f8cd381434a… another-attempt-at…
2 file 72162845… <named list [… 2 2 69ad086c3f8d96b991b8f8… ctz-widget.txt
There are five distinct steps, each of which could be adapted to particular circumstances, each of which could be exposed to the user so they can “roll their own”:
get the response from the Box API.
check the response and extract the content.
parse the content (convert strings to datetimes, etc.).
stack the parsed content into a canonical tabular form (data frame or tibble).
wrangle the stacked content (rename columns, etc.).
Also, there would be three “families” of functions:
entries
(implying pagination),
e.g. box_ls()
.box_collab_create()
.box_fetch()
.The point of this vignette, in its current form, is to sketch out how the first two families might work. The third family will require more consideration and considerably more coffee.
This could simplify the creation of new box functions, and perhaps
let us simplify some existing ones. We could export
box_api_response()
, box_content()
,
box_parse_entries()
(and box_parse_entry()
),
and stack_rows_tbl()
; this would allow someone to access
the Box API themselves, much-more-easily.
Of course, the functions would have better-thought-out names, and would be more complicated themselves. However, the areas of responsibility for each function would be the same.
What should be the canonical form of data that we return?
box_tibble()
, box_nest()
,
box_unnest()
, box_data_frame()
. These
functions could be used to translate among the formats.One way that we can avoid “breaking changes” is to create a new function with a new name for the new functionality. We can then “supersede” or “deprecate” the old function.
The problem comes when an old function has a really good name.
Another thing we would like to do is to make the documentation simpler for us to maintain. With this release, we take two steps in that direction:
an internal function string_side_effects()
:
This is useful to specify a return value:
canonical parameter-definitions:
box_browse()
: file_id
,
dir_id
box_dl()
: local_dir
,
file_name
, overwrite
, version_id
,
version_no
(file_id
also available)box_ul()
: description
(dir_id
also available)This cuts down on the possibilities for invoking different functions when we need only invoke one or two:
#' @inheritParams box_browse
As we notice more duplication, we can add to this section.