Sunday, January 9, 2011

Archiving with tar & compresing with pbzip2 -- a great combination

Creating a tar archive is not anything new or fancy. Heck, it's not even glamorous or exiting. But, if you need a ubiquitous all around good tool to create an archive of files, tar should be toward the top of your list.

First, let me say that this quick intro will generalize on tar and a couple compression options. No lvm or disk snapshots, rysnc, librsync or any other sometimes similar choices will be compared here. However, there are many situations when you will want to incorporate those other choices into a bigger strategy. This information will shed some light on a couple of opportunities that are present either directly or indirectly that may need to be thought about depending on your situation when using tar.

Generally, when you just need a quick archive the standard tar command will just be:
tar cf /path/to/archive.tar /path/to/source

That's good. It will work, but many will compress the archive with either gzip (z) or bzip2 (j) options:
tar czf /path/to/archive.tar /path/to/source
tar cjf /path/to/archive.tar /path/to/source

The space required to create the archive is normally reduced if file is not already compressed or of a compressed type like avi's or mp3 etc. That's often better but there are issues when dealing with larger sized files or directories:
  1. tar compression extends the time required to hold open file(s)
  2. tar compression extends the time required to complete the archive
  3. tar compression tends to create files slightly larger than post compressed files
  4. tar compression is limited to single processor utilization
All of these issue are overcome by post compression using "pbzip2". pbzip2 is "a parallel implementation of the bzip2... and achieves near-linear speedup on SMP
machines". With pbzip2 (optionally) all of the systems' processors can be put to use at the same time. The archive requiring 2 hours to bzip2 can take as little as 30 minutes on a idle single quad core system.

Naturally, the trade off will be an increase in the free disk space required to complete the process. A trade off will be the need for increased free disk space. As a general minimum you will need at least 1.5 times the size of the files to be archived in order to complete the process.

This is a small scale example with a modest 1.5GB directory. The directory has data base SQL unload files. The system has 2 older quad core CPU's and a fairly fast disk subsystem along with 16GB RAM.

testdir = 1603076072 bytes or 1.5GB

time tar cf test.tar testdir
real    0m12.039s
size 1603164160 bytes or about 1.5GB

time tar cjf test.tar.bz2 testdir
real    9m37.415s
size 216820944 bytes = 207M

time tar czf test.tar.gz testdir
real    2m44.550s
size 282025065 bytes = 269M

time pbzip2 test.tar
real    2m17.014s
size 217235869 = 208M

time bzip2 test.tar
real    9m13.197s
size  216820491 = 207M

Combining a normal tar file with pbzip2 provides about 22% greater compression than gzip in less time for this test. For some situations, tar + pbzip2 is a great combination. I just wanted to share ;)

No comments:

Post a Comment