Helsinki University of Technology
Laboratory of Information Processing Science

Hannu Peltola and Jorma Tarhio:

PPMZ for Linux

We have ported PPMZ to Linux: ppmz.tar.gz. See more details below.

General

PPM (Prediction by Partial Match) is a classic compression algorithm. PPM predicts the probability of a given character based on the characters that immediately precede it. PPMZ is an efficient version of PPM. PPMZ has been developed and programmed in C by Charles Bloom. The source code of PPMZ consists of two parts: ppmz and crblib.

License

The source code is covered by the Bloom Public License.

Port of PPMZ to gcc

We have ported PPMZ v9.1 to a Linux with gcc 2.95.2. Some routines could not be compiled without changes. Minimal changes were made using conditional compilation. An identifier 'unix' was used to indicate changes related to the operating system. Details are presented in the files ppmz/gcc.compile and crblib/gcc.compile.

Old versions are preserved with the extension '.old'. Unmodified versions are preserved with the extension '.original'.

The port is available as ppmz.tar.gz (0.6 MB). There is also an executable for processors using i386 instruction set. There is also another static linked executable. For safety reasons we recommend recompiling.

A port to Sparc is under construction.

Notes about PPMZ

Especially with large input files PPMZ uses pretty much memory. Any tests recording used time should report available main memory.

"HeaderLen not included in report of results because all info in the header is not necessary for compression decompression. see ppmzhead.c for details":

Header consists of:

  • 4-char signature, for convenient decoding
  • ulong CRC , for convenient error-checking
  • ulong rawlen, so that the buffer size can be known (i.e. I can use fread instead of fgetc)
  • 3-ulong RunTransform info, again for array allocation (the fact that these are not needed is proved by the fact that they are not passed to UnRunTransform)

Test results

On the PPMZ home page there are some test results on the files of the Calgary Corpus; it can also be found from Canterbury. We repeated them with the ported version of PPMZ, and got the following results:

PPMZ v9.1 results on the Calgary Corpus
file raw sizecompressed
by Bloomby usby Bloomby us
bib 111261 111261 24256 24256
book1 768771 768771212733212733
book2 610856 610856143075143074
geo 102404 102400 51635 59446
news 377109 377109105725105722
obj1 21504 21504 9854 9853
obj2 246814 246814 68804 68801
paper1 53161 53161 14772 14772
paper2 82199 82199 22749 22748
pic 513216 513216 50685 50685
progc 39611 39611 11180 11179
progl 71646 71646 13185 13185
progp 49379 49379 9122 9124
trans 93695 93695 14508 14508

Currently the Canterbury Corpus is the most polular compression benchmark. Below are the results for PPMZ and the best results reported on the result page of the Canterbury Corpus.

PPMZ v9.1 results on the Canterbury Corpus
file raw sizecompressedbpc
PPMZbest reported
text 152089 395762.0812.20
fax 513216 506850.7900.77
Csrc 11150 26031.8672.08
Excl 10297441397481.0850.83
SPRC 38240 117152.4502.58
tech 426754 974601.8271.95
poem 4818611335082.2162.36
html 24603 67442.1922.32
list 3721 10482.2532.40
man 4227 15142.8652.98
play 125179 365472.3352.49


This page is maintained by Jorma Tarhio and Hannu Peltola, E-mail: tarhio at cs.hut.fi.
This page has been updated on April 10, 2002
URL: http://www.cs.hut.fi/u/tarhio/ppmz/