3

I downloaded the Gwern Branwen dataset here: https://www.gwern.net/DNM-archives

I'm trying to read the dataset in R and I'm having a lot of trouble. I tried to open one of the files in the dataset called "1776.tar.xz" and I think I "unzipped" it with untar() but I'm not getting anything past that.

untar("C:/User/user/Downloads/dnmarchives/1776.tar.xz",
  files = NULL,
  list = FALSE, exdir = ".",
  compressed = "xz", extras = NULL, verbose = FALSE, restore_times = TRUE,
  tar = Sys.getenv("TAR"))

Edit: Thanks for all of the comments so far! The code is in base R. I have multiple datasets that I downloaded from Gwern's website. I'm just trying to open one to explore.

bob
  • 117
  • 1
  • 1
  • 8
  • Not familiar with the code you posted... is it R or powershell or what? Could you specify? Python has a really simple Tar library to extract the data directly without unpacking it on disk. Or as others have pointed out, unpack with another app and then load it in R. PeaZip is best windows utitility to do this in my opinion. – d.j.yotta Feb 07 '20 at 06:59
  • Did you try to assign that value? Did you read “?untar”? – IRTFM Feb 07 '20 at 07:12
  • 1
    @d.j.yotta It's R. – Rui Barradas Feb 07 '20 at 07:15
  • 1
    @d.j.yotta, thanks for the advice! do you mind sharing the library name in python? – bob Feb 07 '20 at 16:29
  • @42- I just added "x <-" before untar and tried to load it up as a table and it didn't work. I read ?untar and generally followed the format there. not sure if i did it right – bob Feb 07 '20 at 16:32
  • "It didn't work"? What does that mean? You should have gotten a value. What was it? It should have place a file "somewhere" on you filesytem. If you set list=TRUE the value should have been the location of the file. – IRTFM Feb 07 '20 at 17:08
  • @bob https://docs.python.org/3/library/tarfile.html - tarfile python library is built in I think. But it sounds like understanding how to use R to do it is the most straightforward way. I'm guessing you will be processing the dataset also in R so you ought to load it also from there – d.j.yotta Feb 08 '20 at 09:19
  • Posted a solution using archive_extract from library(archive)n which works well also on Windows... – Tom Wenseleers Jul 11 '22 at 13:59

4 Answers4

4

Base R includes function untar. On my Ubuntu 19.10 running R 3.6.2, default installation, the following was enough.

fls <- list.files(pattern = "\\.xz")
untar(fls[1], verbose = TRUE)

Note.
In the question, "dataset" is singular but there were several datasets (plural) on that website. To download the files I used

args <- "--verbose rsync://78.46.86.149:873/dnmarchives/grams.tar.xz rsync://78.46.86.149:873/dnmarchives/grams-20150714-20160417.tar.xz ./"
cmd <- "rsync"

od <- getwd()
setwd('~/tmp')

system2(cmd, args)
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
0

Thanks everyone! Not sure what was wrong with r for a bit but I reinstalled. I ended up unzipping manually and loading up the files.

bob
  • 117
  • 1
  • 1
  • 8
  • I don't think this should be accepted as the best answer, as it required unzipping manually as opposed to doing this from R. Below I added a solution using archive_extract() from library(archive), which worked well for me also in Windows. (untar for me was very slow) – Tom Wenseleers Jul 11 '22 at 15:12
0

I find that base R's untar() is a bit unreliable and/or slow on Windows.

What worked very well for me (on all platforms) was

library(archive)
archive_extract("C:/User/user/Downloads/dnmarchives/1776.tar.xz",
                dir="C:/User/user/Downloads/dnmarchives")

It supports 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' and 'xz' formats.

And one can also use it directly read in a csv file within an archive without having to UNZIP it first using

read_csv(archive_read("C:/User/user/Downloads/dnmarchives/1776.tar.xz", file = 1), col_types = cols()) 
Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103
-1
  1. On Debian or Ubuntu, first install the package xz-utils
$ sudo apt-get install xz-utils
  1. Extract a .tar.xz the same way you would extract any tar.__ file.
$ tar -xf file.tar.xz

Done.

samuel161
  • 221
  • 3
  • 2
  • xz can also be installed on macOS. I’m pretty sure that the Rtools package can do this for windoze. – IRTFM Feb 07 '20 at 06:56
  • 1
    The OP is asking about doing this in R I think. Also, clearly is using windows so this answer is unhelpful. – d.j.yotta Feb 07 '20 at 06:57
  • 1
    R will access system utilities. As I said Rtools.exe should provide the needed resources. The OP robably succeeded and doesn’t know it. – IRTFM Feb 07 '20 at 07:12
  • I was able to resolve this issue with 7zip. This was the easiest method and it worked! Thanks everyone for your help. – bob Apr 03 '20 at 15:03