7

I have a tar archive (17GB) which consists of many small files (all files <1MB ). How do I use this archive.

  1. Do I extract it ? using 7-zip on my laptop says it will take 20hrs (and I think it will take even more)
  2. Can I read/browse the contents of the file without extracting it? If yes, then how?
  3. Is there any other option?

It is actually a processed wikipedia dataset on which I am supposed to perform some Natural Language Processing.

Platform Windows/Linux is not an issue; anything will do, as long as it gets the jobs done as quickly as possible.

Community
  • 1
  • 1
Vulcan
  • 307
  • 1
  • 7
  • 16
  • So it is a `.tgz` file which contains many `.zip` files? Or just a `.tgz` file which contains many text files? – vlp Sep 27 '15 at 09:59
  • a `.tgz` with many text files – Vulcan Sep 27 '15 at 10:00
  • How many files are in there? It sounds strange that such a small file would take so much time... – Matteo Italia Sep 27 '15 at 10:28
  • @MatteoItalia i don't know How Many? but have a look http://imgur.com/fOiSHLq – Vulcan Sep 27 '15 at 10:33
  • 1
    I have a feeling I am doing something Completely Wrong here – Vulcan Sep 27 '15 at 10:34
  • you want all the files, then you decompress the tarball before you go to sleep. small files are pain in the ass especially for mechanic disk. if you have a ssd, that would be better. – Jason Hu Sep 27 '15 at 13:50
  • IMHO using Windows is completely wrong. See my answer. – Basile Starynkevitch Sep 27 '15 at 13:50
  • btw, I think this question is actually quite valuable since it's quite often when there are too many small files to move around and ends up taking too much time to compress and decompress. this question shouldn't deserve a downvote. – Jason Hu Sep 27 '15 at 13:53
  • `put on hold offtopic` so should i delete this question and copy to Super User ASAP or should i wait for moderators to do it? – Vulcan Sep 27 '15 at 14:42

2 Answers2

8

I suppose you have a Linux laptop or desktop on which your hugearchive.tgz file is on some local disk (not a remote network filesystem, which could be too slow). If possible, put that hugearchive.tgz file on some fast disk (SSD preferably, not magnetic rotating hard disks) and fast Linux-native file system (Ext4, XFS, BTRFS, not FAT32 or NTFS).

Notice that a .tgz file is a gnu-zipped compression of a .tar file.

Next time you get a huge archive, consider asking it in afio archive format, which has the big advantage of compressing not-too-small files individually (or perhaps ask for some SQL dump - e.g. for PostGreSQL or Sqlite or MariaDB - in compressed form).

First, you should make a list of the file names in that hugearchive.tgz gziped tar archive and ask for the total count of bytes:

 tar -tzv --totals -f hugearchive.tgz > /tmp/hugearchive-list.txt

That command will run gunzip to uncompress the .tgz file to some pipe (so won't consume a lot of disk space) and write the table-of-contents into /tmp/hugearchive-list.txt and you'll get on your stderr something like

  Total bytes read: 340048000 (331MiB, 169MiB/s)

of course the figures are fictive, you'll get much bigger ones. But you'll know what is the total cumulated size of the archive, and you'll know its table of contents. Use wc -l /tmp/hugearchive-list.txt to get the number of lines in that table of content, that is the number of files in the archive, unless some files are weirdly and maliciously named (with e.g. some newline in their filename, which is possible but weird).

My guess is that you'll process your huge archive in less than one hour. Details depend on the computer, notably the hardware (if you can afford it, use some SSD, and get at least 8Gbytes of RAM).

Then you can decide if you are able to extract all the files or not, since you know how much total size they need. Since you have the table-of-contents in /tmp/hugearchive-list.txt you can easily extract the useful files only, if so needed.


For what it is worth, on my i3770K desktop with 16Gb RAM and both SSD & disk storage, I made (for experimenting) a useless huge archive (made specifically for the purpose of answering this question, since I don't have your hugearchive.tgz file ....) with

sudo time tar czf /tmp/hugefile.tgz /bin /usr/bin /usr/local/bin /var 

and it took this time to create that archive (with all these file systems on SSD):

 719.63s user 60.44s system 102% cpu 12:40.87 total

and the produced /tmp/hugefile.tgz has 5.4 gigabytes (notice that it probably sits in the page cache).

I then tried:

time tar -tzv --totals -f /tmp/hugefile.tgz > /tmp/hugefile-list.txt

and got:

Total bytes read: 116505825280 (109GiB, 277MiB/s)
tar -tzv --totals -f /tmp/hugefile.tgz > /tmp/hugefile-list.txt
    395.77s user 26.06s system 104% cpu 6:42.43 total

and the produced /tmp/hugefile-list.txt has 2.3Mbytes (for 23Kfiles), not a big deal.

Don't use z in your tar commands if your tar archive is not GNU zipped.

Read the documentation of tar(1) (and also of time(1) if you use it, and more generally of every command you are using!) and of course use the command line (not some GUI interface), also learn some shell scripting.

BTW, you could later segregate very small files (less than 64Kbytes) and e.g. put them inside some database (perhaps some Sqlite or Redis or PostGreSQL or MongoDB database, filled with e.g. a small script) or maybe some GDBM indexed file. Notice that most file systems have some significant overhead for a big lot of small files.

Learning shell scripting and some scripting language (Python, Lua, Guile, Ocaml, Common Lisp), and basic database techniques is not a loss of time. If e.g. you are starting a PhD, it is almost a required skill set.

I don't know and don't use (and dislike) Windows, so I am obviously biased (my first Linux was some Slackware with a 0.99.12 kernel circa 1993 or early 1994), but I strongly recommend you to do all your NLP work on Linux (and keep Windows only for playing video games, when you have time for that), because scripting and combining many useful existing free software is so much easier on Linux.

Archer
  • 1,062
  • 1
  • 13
  • 32
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • 1
    I especially love paragraphs after BTW:) – Jason Hu Sep 27 '15 at 13:51
  • `sudo time tar czf /tmp/hugefile.tgz /bin /usr/bin /usr/local/bin /var ` i tried my best but could not figure what do these extra paths specify `/bin /usr/bin /usr/local/bin /var` – Vulcan Sep 27 '15 at 14:08
  • and yes i have windows only for playing games .. dual boot with lubuntu for everything else..and iam not doing a PhD. it's a college project :P – Vulcan Sep 27 '15 at 14:09
  • Don'"t repeat that exact command !!! It is just an example to create a big `.tgz` archive. I don't have your `hugefile.tgz` on *my* machine, so I created a stupid example for you... – Basile Starynkevitch Sep 27 '15 at 14:10
  • But you should learn basic shell & scripting skills, and read the documentation of `tar` *before* using it. BTW, do you know that *college* has very different meaning in various countries; in France (where I live), it is some kind of *junior high school* for pupils 13 years old! – Basile Starynkevitch Sep 27 '15 at 14:10
  • yes yes i know basic shell scripting(an amateur though!). but the part after which u give the filename of the tgz you have also have paths to other folders specifically this part `/bin /usr/bin /usr/local/bin /var` what does this specify? Is this a part of `time` or `tar` – Vulcan Sep 27 '15 at 14:21
  • It is just a *stupid example* to make a huge archive file from my system files under `/bin`, `/usr/bin/` etc....I don't have your archive, and I won't download it. Read more about `tar` and shell scripting – Basile Starynkevitch Sep 27 '15 at 14:26
  • 1
    You absolutely need to RTFM. You won't understand my answer if you don't follow the links – Basile Starynkevitch Sep 27 '15 at 14:29
  • Ohhh boy! yes i got it know. I overlook the fact that you can combine files from *multiple* directories... really really silly mistake and sorry :( – Vulcan Sep 27 '15 at 14:35
-1

EDIT> The idea behind this answer is to process the contents of the archive on-the-fly and thus avoid any expensive (slow) IO which necessarily happens when archive content is written to disk.

It is difficult to answer not knowing how this data is supposed to be processed.

If yours "Natural Language Processing" software can process input from a pipe (stream) -- then you can process the contents of the archive without extracting it using any variant of the following:

tar -xf hugeFile.tar -O | yourSoftware

Which will pipe the combined contents of all files in this archive to the yourSoftware (under linux or cygwin).

E.g. to count the total number of words use the following:

tar -xf hugeFile.tar -O | wc -w

Supposing you will probably need to test your algorithm it might be wise to test on some smaller subset, i.e. first 10.000 lines:

tar -xf hugeFile.tar -O | head -n10000 | yourSoftware

If your processing software needs to have the files on disk, then you need to extract this archive (beware that some filesystems do not handle many small files very well -- it might consume a lot more free space then expected and access times might be long as well).

vlp
  • 7,811
  • 2
  • 23
  • 51