Parallel Archiving Techniques

The .tar.gz and .zip archive formats are quite ubiquitous and with good reason. For decades they have served as the backbone of our data archiving and transfer needs. With the advent of multi-core and multi-socket CPU architectures, little unfortunately has been done to leverage the wider number of processors. While archiving then compressing a directory may seem like the intuitive sequence, we will show how compressing files before adding them to a .tar can provide massive performance gains.

Compression Benchmarks: tar.gz VS gz.tar VS .zip


Consider the 3 following directories: 1. The first is a large number of tiny CSV files containing stock data. 2. The second is a relatively small number of genome sequence files in nested folders. 3. The third is a small set of large PCAP files containing bid prices.

Below are timed archive compression results for each scenario and archive type.

Compression Times

A .gz.tar is NOT a real file extension. It refers to when files are first individually compressed in a directory then the whole directory is archived into a .tar

Is .gz.tar actually more than 15x faster than .tar.gz?

Yup, you are reading that right. Not 2x faster, not 5x faster, but at its peak .gz.tar is 17x faster than normal! A reduction in compression time from nearly an hour to just 3 minutes. How did I achieve such a massive time reduction?

parallel ::: gzip && cd .. && tar -cf archive.tar dir/

These results are from a near un-bottlenecked environment. You will see scaling in proportion to your thread count and drive speed.

Using GNU Parallel to Create Archives Faster:


GNU Parallel is easily one of my favorite packages and a daily staple when scripting. Parallel makes it extremely simple to multiplex terminal “jobs”. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.

In the above benchmarks, we are seeing massive time reductions by leveraging all cores during the compression process. In the command, I am using parallel to create a queue of gzip -d /dir/file commands that are then run asynchronously across all available cores. This prevents bottlenecking and improves throughput when compressing files compared to using the standard tar -zxcf command.

Consider the following diagram to visualize why .gz.tar allows for faster compression: Multi-Threading Visualization

GNU Parallel Examples:


To recursively compress or decompress a directory:

find . -type f | parallel gzip
find . -type f | parallel gzip -d

To compress your current directory into a .gz.tar:

parellel ::: gzip && cd .. && tar -cvf archive.tar dir/to/compress

Multi-Threading GNU Parallel HTOP

Below are my personal terminal aliases:

alias gz-="parallel gzip ::: *"
alias gz+="parallel gzip -d ::: *"
alias gzall-="find . -type f | parallel gzip"
alias gzall+="find . -name *.gz -type f | parallel gzip -d"

Scripting GNU Parallel with Python:


The following python script builds bash commands that recursively compress or decompress a given path.

To compress all files in a directory into a tar named after the folder:
./gztar.py -c /dir/to/compress

To decompress all files from a tar into a folder named after the tar:
./gztar.py -d /tar/to/decompress

#! /usr/bin/python
# This script builds bash commands that compress files in parallel

def compress(dir):
    os.system('find ' + dir + ' -type f | parallel gzip -q && tar -cf '
              + os.path.basename(dir) + '.tar -C ' + dir + ' .')

def decompress(tar):
    d = os.path.splitext(tar)[0]
    os.system('mkdir ' + d + ' && tar -xf ' + tar + ' -C ' + d +
          ' && find ' + d + ' -name *.gz -type f | parallel gzip -qd')

p = argparse.ArgumentParser()
p.add_argument('-c', '--compress', metavar='/DIR/TO/COMPRESS', nargs=1)
p.add_argument('-d', '--decompress', metavar='/TAR/TO/DECOMPRESS.tar', nargs=1)
args = p.parse_args()

if args.compress:
    compress(str(args.compress)[2:-2])
if args.decompress:
    decompress(str(args.decompress)[2:-2])

Multi-Threaded Compression Using Pure Python:


If for some reason you don’t want to use gnu parallel to queue commands, I wrote a small script that uses exclusively python (no bash calls) to multi-thread compression. Since the python GIL is notorious for bottlenecking, extreme care is taken when calling multiprocessing(). This implementation also has the benefit of a CPU throttle flag, a remove after compression/decompression flag, and a progress bar during the compression process.

  1. First, check and make sure you have all the necessary pip modules:
    pip install tqdm
  2. Second Link the gztar.py file to /usr/bin:
    sudo ln -s /path/to/gztar.py /usr/bin/gztar
  3. Now compress or decompress a directory with the new gztar command:
    gztar -c /dir/to/compress -r -t

Multi-Threading Pure Python HTOP

#! /usr/bin/python
## A pure python implementation of parallel gzip compression using multiprocessing
import os, gzip, tarfile, shutil, argparse, tqdm
import multiprocessing as mp

#######################
### Base Functions
###################
def search_fs(path):
    file_list = [os.path.join(dp, f) for dp, dn, fn in os.walk(os.path.expanduser(path)) for f in fn]
    return file_list

def gzip_compress_file(path):
    with open(path, 'rb') as f:
        with gzip.open(path + '.gz', 'wb') as gz:
            shutil.copyfileobj(f, gz)
    os.remove(path)

def gzip_decompress_file(path):
    with gzip.open(path, 'rb') as gz:
        with open(path[:-3], 'wb') as f:
            shutil.copyfileobj(gz, f)
    os.remove(path)

def tar_dir(path):
    with tarfile.open(path + '.tar', 'w') as tar:
        for f in search_fs(path):
            tar.add(f, f[len(path):])

def untar_dir(path):
    with tarfile.open(path, 'r:') as tar:
        tar.extractall(path[:-4])

#######################
### Core gztar commands
###################
def gztar_c(dir, queue_depth, rmbool):
    files = search_fs(dir)
    with mp.Pool(queue_depth) as pool:
        r = list(tqdm.tqdm(pool.imap(gzip_compress_file, files),
                           total=len(files), desc='Compressing Files'))
    print('Adding Compressed Files to TAR....')
    tar_dir(dir)
    if rmbool == True:
        shutil.rmtree(dir)

def gztar_d(tar, queue_depth, rmbool):
    print('Extracting Files From TAR....')
    untar_dir(tar)
    if rmbool == True:
        os.remove(tar)
    files = search_fs(tar[:-4])
    with mp.Pool(queue_depth) as pool:
        r = list(tqdm.tqdm(pool.imap(gzip_decompress_file, files),
                           total=len(files), desc='Decompressing Files'))

#######################
### Parse Args
###################
p = argparse.ArgumentParser('A pure python implementation of parallel gzip compression archives.')
p.add_argument('-c', '--compress', metavar='/DIR/TO/COMPRESS', nargs=1, help='Recursively gzip files in a dir then place in tar.')
p.add_argument('-d', '--decompress', metavar='/TAR/TO/DECOMPRESS.tar', nargs=1, help='Untar archive then recursively decompress gzip\'ed files')
p.add_argument('-t', '--throttle', action='store_true', help='Throttle compression to only 75%% of the available cores.')
p.add_argument('-r', '--remove', action='store_true', help='Remove TAR/Folder after process.')
arg = p.parse_args()
### Flags
if arg.throttle == True:
    qd = round(mp.cpu_count()*.75)
else:
    qd = mp.cpu_count()
### Main Args
if arg.compress:
    gztar_c(str(arg.compress)[2:-2], qd, arg.remove)
if arg.decompress:
    gztar_d(str(arg.decompress)[2:-2], qd, arg.remove)

Conclusion:


When dealing with large archives, use gnu parallel to reduce your compression times! While there will always be a place for .tar.gz (especially with small directories like build packages) .gz.tar provides scalable performance for modern multi-core machines.

Happy Archiving!

Avatar
Justin Timperio
Researcher and Freelance Dev

Justin is a freelance developer and private researcher.

Related