1

I have a df with multiple variables, some are very long strings with up to 4500 characters. I would like to export this database as a .dta file.

I try to save it using haven's write_dta() function, but I get the following error message: Error in write_dta_(data, normalizePath(path, mustWork = FALSE), version = stata_file_format(version), : Writing failure: A provided string value was longer than the available storage size of the specified column.

Here is an example of the issue:

library(haven)

longFun <- function(n) {
  do.call(paste0, replicate(5000, sample(LETTERS, n, TRUE), FALSE))
}

longString <- data.frame(VeryveryveryveryveryveryveryveryveryveryVeryveryveryveryveryveryveryveryveryverylongname = longFun(1), stringsAsFactors = F)
write_dta(longString,"tst.dta")

I am aware that write_dta has issues handling long strings (https://github.com/tidyverse/haven/issues/437) and that one possibility is to trim the strings (Error in write_dta : A provided string value was longer than the available storage size of the specified column). But it is essential for me to keep the full strings.

Is there any way to save variables with long strings as .dta files using R?

Edit: I have tried the readstata13::save.dta13 option suggested by @jay.sf but this has two issues: 1) Is not able to manage - i.e. it truncates - long variable names above 32-UTF characters, that write_dta() manages well. 2) It is significantly slower than write_dta(). Given that I have to save a very large dataset this is a relevant concern.

In sum is there any other approach that would allow me to

a) save as .dta a df with very long strings

b) retain original variable names (longer than 32-UTF characters)

c) do this in a relatively fast manner.

Alex
  • 1,207
  • 9
  • 25

1 Answers1

1

Use save.dta13 from the readstata13 package.

R:

readstata13::save.dta13(longString, "tst.dta")

Stata:

. use "V:\tst.dta" 
. list

     +------------------------------------------------------------------------------------------------------+
     | V1                                                                                                   |
     |------------------------------------------------------------------------------------------------------|
  1. | GZSPZGLLKOQHETKURLPKQDTZWTNHLDJDUSAFAXHFMPKUDIZURKIFLWQSXSFBLTPBGBLJKTDYJCHZOPZCFYKIMLGTQGDKRNBGUI.. |
     +------------------------------------------------------------------------------------------------------+
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Unfortunately, `save.dta13` does not work because the var names are too long. is there a way to either make `save.dta13` accept longer names or to use another package? – Alex Jan 19 '21 at 13:14
  • @Alex Please elaborate on _"does not work"_. – jay.sf Jan 19 '21 at 13:20
  • It produces the following error: `Varname to long. Resizing. Max size is 32.` It cuts my variable names after 32 characters. I have edited the question so that the name reflects this length. The issue is that I have hundreds of variables with very long names, and I would like to keep them with their original name – Alex Jan 19 '21 at 13:38
  • @Alex **1.** That's no error, rather a message that they're getting cut, **2.** Your requirements don't seem to be possible according to [this Stata list entry](https://www.statalist.org/forums/forum/general-stata-discussion/general/1452366-number-of-characters-in-variable-names). – jay.sf Jan 19 '21 at 13:47
  • Dear @jay.sf thanks a lot for your replies. They are very useful and help me to get very close to what I would like to achieve. On point 1) you are right, is a warning rather than an error. Sorry for the imprecision. On point 2) this is no longer true. Indeed, recent stata versions manage long names. One the advantage of `haven` is precisely that it allows saving long names. Indeed, the following df `longString <- data.frame(VeryveryveryveryveryveryveryveryveryveryVeryveryveryveryveryveryveryveryveryverylongname = 1, stringsAsFactors = F)` can be saved using `write_dta(longString,"tst.dta")` – Alex Jan 19 '21 at 13:58
  • @Alex I see, didn't know Stata upgraded long names. I'm not working with _haven_, though, because their `read_dta` throws tibbles. Regarding `save.dta13` and long names, I made very good experience with the _readstata13_ authors replying to user issues, I suggest to [open a ticket at their github](https://github.com/sjewo/readstata13/issues). – jay.sf Jan 19 '21 at 15:26
  • Thanks a lot for your help. I will file the issue, but unfortunately I think the variable name problem relates to Stata13. Indeed they added the feature of having longer variable names from Stata14 onwards. – Alex Jan 19 '21 at 15:39
  • @Alex I believe the package name relates to the fact, that Stata file format changed significantly in v13, I know v14 and v15 features are also already implemented. – jay.sf Jan 19 '21 at 15:42
  • Excellent! thanks I have filed the issue! – Alex Jan 19 '21 at 15:43
  • @Alex Great, I would just mention that recent Stata versions manage long names, if they aren't aware of it yet. – jay.sf Jan 19 '21 at 15:45