GitHub - league/rngzip: Type-based XML compression

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
doc		doc
libs		libs
net/contrapunctus		net/contrapunctus
tests		tests
.boring		.boring
CAVEATS		CAVEATS
COPYING		COPYING
INSTALL		INSTALL
LICENSE		LICENSE
Makefile		Makefile
Makefile.local		Makefile.local
README		README
USAGE		USAGE

Repository files navigation

Introduction
============

RNGzip is a schema-based compressor for XML.  This file documents
version 0.6 (July 2007) by Christopher League.  See also:

> <http://contrapunctus.net/league/haques/rngzip/>

RNGzip uses information from the schema of an XML document to achieve
better compression (usually) than is possible with a general-purpose
tool such as gzip.  We use the Relax NG schema language, which is
flexible, yet formally well-defined.  A separate tool called
[trang](http://www.thaiopensource.com/relaxng/trang.html) can convert
other schema formats to RNG.

To compress an XML stream with RNGzip, it *must* be valid according to
some schema.  Documents that don’t match the schema are literally not
compressible with this technique.  More importantly, to decompress,
you must provide a copy of _precisely the same_ schema that was used
to compress the document.  This is one of the major caveats for using
RNGzip.  When your schema evolves, be sure to give it a new name or
version number; otherwise, your compressed data may become
inaccessible.

Here is a simple use case.  The [National Center for Biotechnology
Information](http://www.ncbi.nih.gov/) at the NIH provides lots of
data, optionally available in various XML formats.  Here is an example
of a search result from PubMed:

    % head -20 pubmed01.xml 
    <?xml version="1.0"?>
    <!DOCTYPE PubmedArticleSet PUBLIC 
        "-//NLM//DTD PubMedArticle, 1st November 2004//EN" 
        "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_041101.dtd">
    <PubmedArticleSet>
    <PubmedArticle>
        <MedlineCitation Owner="NLM" Status="In-Data-Review">
            <PMID>16102004</PMID>
            <DateCreated>
                <Year>2005</Year>
                <Month>08</Month>
                <Day>16</Day>
            </DateCreated>
            <Article PubModel="Print">
                <Journal>
                    <ISSN>0950-382X</ISSN>
                    <JournalIssue>
                        <Volume>57</Volume>
                        <Issue>5</Issue>
                        <PubDate>
                            <Year>2005</Year>
                            <Month>Sep</Month>

The file is 96,264 bytes long.  If we compress it using gzip, the
result is just 9,215 bytes, or about 90% smaller.  This is pretty good
already, but RNGzip can do better.  Here is how it works.

First, we must convert the PubMed DTD into the Relax NG format:

    % trang http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_041101.dtd \
            pubmed.rng

Next, we zip it:

    % java -jar rngzip.jar -v -s pubmed.rng pubmed01.xml 
    loading file:/Users/league/r/xml-group/src/rngzip/tests/ncbi/pubmed.rng
    building automaton... done
    pubmed01.xml              92.28% -- replaced with pubmed01.xml.rnz

The size of the result is 7,431, which is about 92% smaller than the
original and 19% smaller than the gzipped version.  

We can ask RNGzip to provide some information about this file,
including the URL of the schema, which was embedded for convenience.

    % java -jar rngzip.jar -iv pubmed01.xml.rnz 
    pubmed01.xml.rnz: 7431 bytes HUFFMAN/GZ/GZ 
      file:/Users/league/r/xml-group/src/rngzip/tests/ncbi/pubmed.rng(e6133345)

To decompress, we do not need to specify the schema again unless its
location has changed.

    % java -jar rngzip.jar -dvp2 pubmed01.xml.rnz
    loading file:/Users/league/r/xml-group/src/rngzip/tests/ncbi/pubmed.rng
    building automaton... done
    pubmed01.xml.rnz          91.28% -- replaced with pubmed01.xml

    % head pubmed01.xml
    <?xml version="1.1" encoding="UTF-8"?>
    <PubmedArticleSet>
      <PubmedArticle>
        <MedlineCitation Owner="NLM" Status="In-Data-Review">
          <PMID>16102004</PMID>
          <DateCreated>
            <Year>2005</Year>
            <Month>08</Month>
            <Day>16</Day>
          </DateCreated>

Notice that the spacing has changed slightly, and the DOCTYPE has been
dropped.  Unlike a generic compressor, RNGzip does not store the
ignorable spaces at all.  Instead, it can optionally reformat the XML
as it decompresses; the option `-p2` means to pretty-print with an
indentation of two spaces.  RNGzip also drops any DOCTYPE declarations
or processing instructions.

You may think that it is unfair, because gzip must preserve this
information that RNGzip discards.  But if we remove all the ignorable
space from the XML first, it doesn't make much difference in the size
of gzip's output anyway.

    % xmllint --noblanks pubmed01.xml > pubmed01.nb.xml
    % gzip -v pubmed01.nb.xml
    pubmed01.nb.xml:   86.5% -- replaced with pubmed01.nb.xml.gz

The XML with ignorable spaces dropped is 62,377 bytes, and compressed
by gzip it is 8,432.

In general, the more ‘taggy’ an XML document is — and the more
precisely those tags are constrained by the schema — the better the
compression will be.  Less taggy XML documents will essentially
degenerate into the compression ratio of gzip, plus a small overhead.

For example, the plays of Shakespeare have been marked up in XML by
Jon Bosak.  These represent mostly text, with light markup to indicate
who is speaking, and where lines begin and end.  In these examples,
RNGzip does roughly the same as gzip, on average.

    % ls -l taming.xml*
    -rw-r--r-- league  200,918  Jul 15  1999  taming.xml
    -rw-r--r-- league   53,479  Jul  7 12:31  taming.xml.gz
    -rw-r--r-- league   52,820  Jul  7 12:31  taming.xml.rnz

The compressed forms are both roughly 73% smaller than the original,
with the RNGzip version just 1.2% smaller than the gzipped one.

You may be aware that gzip does not yield the best compression ratio
of any generic compressor; rather, it is a good trade-off between
speed and size.  The block-sorting compressor bzip2 tends to yield
much smaller files, although it is slower than gzip.  For example,
taming.xml.bz2 is just 39,502 bytes, beating RNGzip by 25%!

This is because RNGzip actually uses a generic compressor internally,
and by default, it uses gzip (zlib).  To make RNGzip more competitive
against bzip2, a Java implementation of bzip2 could be substituted
within RNGzip.  Incidentally, on the PubMed example above, RNGzip
(with zlib) still beats bzip2 by 9%.

----
» [Installation](INSTALL.html)