12

How does one add metadata to a tibble?

I would like a sentence describing each of my variable names such that I could print out the tibble with the associated metadata and if I handed it to someone who hadn't seen the data before, they could make some sense of it.

as_tibble(iris)

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
# ... with 140 more rows

# Sepal.length. Measured from sepal attachment to stem
# Sepal.width. Measured at the widest point
# Petal.length. Measured from petal attachment to stem
# Petal.width. Measured at widest point
# Species. Nomenclature based on Integrated Taxonomic Information System (ITIS), January 2018.

thanks!

AdrieStC
  • 389
  • 4
  • 13

3 Answers3

14

This seems tricky. In principle @hrbrmstr's comment is the way to go (i.e. use ?comment or ?attr to add attributes to any object), but these attributes will not be printed out by default. Attributes seem to be printed automatically for atomic objects:

> z <- 1:6
> attr(z,"hello") <- "goodbye"
> z
[1] 1 2 3 4 5 6
attr(,"hello")
[1] "goodbye"

... but not, alas, for data frames or tibbles:

dd <- tibble::tibble(x=1:4,y=2:5)
> attr(dd,"metadata") <- c("some stuff","some more stuff")
> dd
# A tibble: 4 x 2
      x     y
  <int> <int>
1     1     2
2     2     3
3     3     4
4     4     5

You can wrap the object with its own S3 class to get this stuff printed:

class(dd) <- c("my_tbl",class(dd))
> print.my_tbl <- function(x) {
+    NextMethod(x)
+    print(attr(x,"metadata"))
+    invisible(x)
+ }
> dd
# A tibble: 4 x 2
      x     y
  <int> <int>
1     1     2
2     2     3
3     3     4
4     4     5
[1] "some stuff"      "some more stuff"

You could make the printing more elaborate or pretty, e.g.

cat("\nMETADATA:\n")
cat(sprintf("# %s",attr(x,"metadata")),sep="\n")

Nothing bad will happen if the other user hasn't defined print.my_tbl (the print method will fall back to the print method for tibbles), but the metadata will only be printed if they have your print.my_tbl definition ...

Jessica Burnett
  • 395
  • 1
  • 13
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • This is very helpful Ben. Is there any way to do something similar but would only print the attributes at the time a table is loaded from an RDS file? – jzadra Aug 26 '20 at 21:00
  • that seems very tricky indeed. It's hard for the print method for an object to know whether the object has just been loaded or has been in the session for a while. The simplest way I can think of to manage this is to (1) add "print_attributes" as an argument to `print.my_tbl` (with default FALSE); (2) write a dedicated `read_my_tbl()` function that calls `x <- readRDS(...)` and `print(x, print_attributes=TRUE)`, then returns `invisible(x)` (so the object doesn't get printed again. – Ben Bolker Aug 26 '20 at 22:08
  • Thanks Ben. I had the same thought and am using a custom read function currently. Glad to know there isn't a simpler solution at least! – jzadra Aug 27 '20 at 16:46
11

Sorry for the delayed response. But this topic has been bugging me since I first started learning R. In my work, assigning metadata to columns is not just common. It is required. That R didn't seem to have a nice way to do it was really bothering me. So much so, that I wrote some packages to do it.

The fmtr package has a function to assign the descriptions (plus other stuff). And the libr package has a dictionary function, so you can look at all the metadata you assign.

Here is how it works:

First, assign the descriptions to the columns. You just send a named list into to the descriptions() function.

library(fmtr)
library(libr)

# Create data frame
df <- iris

# Assign descriptions
descriptions(df) <- list(Sepal.Length = "Measured from sepal attachment to stem", 
                         Sepal.Width = "Measured at the widest point",
                         Petal.Length = "Measured from petal attachment to stem", 
                         Petal.Width = "Measured at the widest point",
                         Species = paste("Nomanclature based on Integrated Taxonomic", 
                                         "Information System (ITIS), January 2018."))


Then you can see all the metadata by calling the dictionary() function, like so:

dictionary(df)
# # A tibble: 5 x 10
#  Name  Column      Class  Label Description                                                 
#  <chr> <chr>       <chr>  <chr> <chr>                                                      
# 1 df    Sepal.Leng~ numer~ NA    Measured from sepal attachment to stem                     
# 2 df    Sepal.Width numer~ NA    Measured at the widest point                                
# 3 df    Petal.Leng~ numer~ NA    Measured from petal attachment to stem                      
# 4 df    Petal.Width numer~ NA    Measured at the widest point                                 
# 5 df    Species     factor NA    Nomanclature based on Integrated Taxonomic Information Syst~

If you like, you can return the dictionary as its own data frame, then save it or print it or whatever.

d <- dictionary(df)

Here is the dictionary data frame:

dictionary data frame

David J. Bosak
  • 1,386
  • 12
  • 22
6

This is not all that different than Ben Bolker's suggestions, but conceptually, if I want information to be related to the vectors in my data frame, I would prefer they be directly tied to the vectors. In other words, I'd prefer to add the attributes to the vectors themselves rather than to the data frame object.

I don't know that I would go so far as to add a custom class to the object, but perhaps a separate function you can call up for a data frame-like object would be adequate:

library(tibble)
library(ggplot2)
library(magrittr)
library(labelVector)

print_with_label <- function(dframe){
  stopifnot(inherits(dframe, "data.frame"))
  labs <- labelVector::get_label(dframe, names(dframe))
  labs <- sprintf("%s: %s", names(dframe), labs)
  print(dframe)
  cat("\n")
  cat(labs, sep = "\n")
}

iris <- 
  as_tibble(iris) %>%  
  set_label(Sepal.Length = "This is a user friendly label",
            Petal.Length = "I much prefer reading human over computer")

print_with_label(iris)

mtcars <-
  set_label(mtcars,
            mpg = "Miles per Gallon",
            qsec = "Quarter mile time",
            hp = "Horsepower",
            cyl = "Cylinders",
            disp = "Engine displacement")

print_with_label(mtcars)
Benjamin
  • 16,897
  • 6
  • 45
  • 65