Amar Mukherjee, Principal Investigator
School of Electrical
Engineering and Computer ScienceUniversity of Central Florida
Orlando, FL
32816
Contact Information
Contact Name of PI:
Amar Mukherjee
School of Electrical Engineering and Computer Science
University of Central Florida
Orlando, FL 32816
Phone: (407)
823-2763
Fax: (407) 823-5419
Email: amar@cs.ucf.edu
WWW PAGE
http://vlsi.cs.ucf.edu/datacomp/nfs/report/nfsreport.html
List of Supported Students
Nan Zhang,
Graduate Research Assistant
Raja Iqbal, Graduate Research
Assistant
Nitin Motgi, Graduate Research Assistant
Research Collaborators
Dr. Robert
Franceschini
Mr. Holger Kruse
Project Award Information
Keywords
Data Compression, Lossless
Compression, Text Compression, Wide-area network, Caching.
Project Summary
The goal of this research
project is to develop new lossless text compression algorithms and software
tools to incorporate compression in MIME/HTML standards. The approach consists
of encoding the text to exploit the natural redundancy of a language via the use
of a dictionary and then compressing it using a pre-existing compression
algorithm. The encoding scheme depends on the specific characteristics of the
compression algorithm and the Bzip2 algorithm based on Burrows-Wheeler transform
is used as the main backend compression algorithm. A basic understanding of the
interaction of the encoding schemes and the compression algorithms is being
developed. The performance of the algorithms is being measured taking into
account both compression and communication metrics. Infrastructure tools are
being developed using dynamic caching of dictionaries to embed compression into
MIME/HTML standards. The impact of the research on the future of information
technology is to develop data delivery systems where communication bandwidth is
at a premium and archival storage is an exponentially costly endeavor. It is
expected that the new lossless text compression algorithms will have 3 to 30%
improved compression ratio over the best known pre-existing compression
algorithms which might translate into a reduction of more than 70% of the text
traffic on the Internet. The experimental research is linked to educational
goals via rapid dissemination of results via reports, conference and journal
papers, doctoral dissertation and master’s thesis, and transferring the research
knowledge into the graduate curriculum. Software tools developed under this
grant will be shared via a web site.
Goals and Objectives
The major goal of the
project is to develop data delivery systems where communication bandwidth is at
a premium and archival storage is an exponentially costly endeavor. Specific
objectives for this period were:
Method of Approach
The basic philosophy
of our compression algorithm is to transform the text into some intermediate
form, which can be compressed with better efficiency. The transformation is
designed to exploit the natural redundancy of the language. We have developed a
class of such transformations each giving better compression performance over
the previous one and all of them giving better compression over most of the
current and classical compression algorithms (viz. Bzip2 (based on Burrows
–Wheeler Transform), the class of PPM (Partial Predicate Match) algorithms (such
as PPMA, PPMC, PPMD, PPMZ and PPM*), Gzip (based on LZW algorithms), Arithmetic
and Huffman algorithms). We also measured the execution times needed to produce
the pre-processing and its impact on the total execution time is
negligible. The algorithm has a fixed amount of storage overhead in the
form of a word dictionary for the particular corpus of interest and must be
shared by all the users. Typical size of dictionary for the English language is
about 1 MB and can be downloaded using caching and memory management techniques
that are being developed for use in the context of the Internet technologies. If
the compression algorithms are going to be used over and over again, which is
true in all practical applications, the amortized storage overhead is negligibly
small and is no more than the backend compression algorithm used after the
transformation. Our recommendation is to use the Bzip2 algorithm for backend
compression along with our proposed transformation. The combined effect is to
produce the best average compression ratio of 2.00 BPC (bits per character) over
the Calgary and Canterbury corpus with no appreciable degradation over the
execution time or storage overhead. We also have measured the impact of using
these algorithms on the transmission time and bandwidth over the Internet for
text file transfer using both http and ftp protocols. On the average we
established a 500% improvement of transmission time and 70% reduction of
bandwidth requirements for streamed text data.
The basic idea underlying the
transformations that we invented is that one can replace letters in a word by a
special placeholder (*) and retain at most two characters of the original
word. Given an encoding, the original word can be retrieved from a
dictionary that contains a one-to-one mapping between encoded words and original
words. The encoding produces an abundance of * characters in the
transformed text making it the most frequently occurring character. The
transformed text can be compressed better when all the available compression
algorithms are applied except in one case: the Bzip2 algorithm. For
Bzip2, the run-length encoding step at the beginning destroys the benefit of
*-encoding. For Bzip2, we propose three new transformations called Fixed Context
Length Preserving Transformation (LPT), Fixed Context Reverse LPT (RLPT) and
shortened-context LPT (SCLPT) all of which improve the compression performance
and uniformly beat the best of the available compression algorithms over an
extensive text corpus.
Targeted Activities, Summary of Results and Indication
of Success
Our activities have been directed to achieve results
in all three areas of goals and objectives stated earlier. The major results of
our achievements can be summarized as follows.
1. Performance Measurements of the Compression
Algorithms
We have implemented the four lossless reversible text compression algorithms
(*-encoded, LPT, RLPT and SCLPT)
as discussed above
and measured their compression performance using a combination of the Canterbury
and Calgary
corpus
(See Table 1) to do our testing and compared them with the existing algorithms.
The results are summarized as
follows.
1. When
compared to the standard compression algorithms, the *-encoded compression
algorithms produce a
compression improvement of as high as 33% over Huffman and arithmetic
algorithms, about 10% over Unix
compress and 9% to 19% over Gzip algorithms and average 1.7% over Bzip2 and 0.3%
over PPMD algorithms.
The last two algorithms are known to perform the best compression so far in the
literature. The improvement over
Bzip2 algorithms is not always guaranteed because of the effect that we
explained earlier. Fig. 1 illustrates typical
performance results.

Figure 1. Comparison of BPC on plrabn12.txt
(Paradise Lost) with original file and star converted file using
different compression algorithms.
2. The three new algorithms
that we proposed (LPT, RLPT and SCLPT) produce further improvement uniformly
over
the
corpus. LPT has an average improvement of 4.4% on Bzip2 and 1.5% over PPMD; RLPT
has an average
improvement of 4.9% on Bzip2 and 3.4% over PPMD. The SCLPT has an average
improvement of 7.1% on Bzip2
and 3.8%
over PPMD. The compression ratios are given in terms of average BPC (bits per
character) over our test
corpus.
The results of comparison with Bzip2 and PPMD are shown only. As can be seen
from this table, our results
indicate
the best average BPC of 2.147 which beats the best results available in the
literature. See Table 2, Fig.2 and
Fig.3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 2. Average BPC over the corpus with Original, Star-encoded, LPT, RLPT, and
SCLPT using
Bzip2 and PPMD.
2. Execution Time Measurements
In making
average timing measurements, we chose to compare with the Bzip2 algorithm, which
so far has given one of the best compression ratios with lowest execution time.
We also compare with the family of PPM algorithms, which gives the best
compression rations but are very slow.
Note the above times are only for encoding and one can afford to spend
more time off line encoding files, particularly if the difference in execution
time becomes negligibly small with increasing file size which is true in our
case. Our initial measurements on decoding times show no significant
differences.
When we compared the average execution times with PPMD, the results are as follows:
Figure 2. Compression BPC over the corpus with different encoding scheme
using Bzip2.
Figure 3.
Compression BPC over the corpus with different encoding schemes using
PPMD.
Original:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Star-transform:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
LPT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
RPT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
SCLPT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 3. Statistics of compressions in different context order for alice.txt
with PPMD
2. Explanation of Observed Compression
Performance
Recent research [A3][A4] shows that both the BWT and
PPM algorithms explore the representation of context in the text either
implicitly or explicitly. The longer and more deterministic the context is,
i.e., higher order context, the higher is the probability estimation of the next
predicted character leading to better compression ratio. All our transforms aim
to create such contexts. Table 3 gives the statistics of the context for the
sample text “alice.txt” using PPMD, as well as our four transforms along with
PPMD. The first column is the order of the context. The second is the number of
input bytes encoded in that order. The last row is the summation of all the
bytes compressed at different context levels, namely, the file size. The third
column is the cumulative percentage of column two. The fourth is the final
number of bits used to encode all the input in that order. The fifth column
shows the cumulative frequency of column four. The sixth is the corresponding
bytes for bits statistics. The seventh is the compression ration of that order
and the last column shows the BPC in that order.
Table 3 shows that length
preserved transforms (*, LPT and RLPT) result in a higher percentage for high
order contexts than the original PPMD algorithm. The advantage is accumulated in
the different orders to gain an overall improvement. Although SCLPT has a
smaller value of high order context compressions, it is compensated by
compression on the deterministic contexts beforehand that could be in a higher
order with long word length. We are continuing to research the precise
relationship between the statistics of different order context and the overall
compression ratio.
3. Compressed Data Delivery via the Internet
Due to the increased growth in the traffic (refer to Table 4 from the
NSF Website) there is a need to compress data transferred over the network. If
not, growth would lead to congestion, drowning the Internet with the amount of
data in transition. Web browsing, FTP transfers and electronic mail
generate most of the traffic on internet, which transfer basic text code as
payload (viz., HTML, plain/text part in mails) with additional payloads of
different formats (like attachments, inline images, word processing documents,
etc.). One of the objectives of our project is to minimize the part of the basic
payload that is plain text. Reduction in text data transfer over the
Internet will help in using the bandwidth for other types of data (like video
and audio) transferred on the Internet. In our research, we incorporated new
lossless text compression algorithms into file transfers that are made over some
of the standard network protocols e.g., FTP, HTTP, POP3 and SMTP. These
algorithms will be implemented as a part of the standard protocols defined at
Session-Application layer over the transport layer in the OSI Model. The
compressed file is transferred over the TCP/IP Network (official protocol for
the Internet).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 4. Current and Projected transfer over Internet in Gigabits/sec
(The data transfer include normal text, image, video and audio)
We are in the process of testing the above method for different compression/decompression methods, which have performed 3% to 30% better than standard compression algorithms (viz. Huffman, arithmetic, LZW, DMC, PPM*, BWT) under different network conditions and server loads. We will be evaluating several performance metrics like payload transform time (includes transformation and compression), payload transfer time and payload reverse transform time (includes decompression and reverse transformation). During the transfers there are different other parameters that will be evaluated, such as server performance, bandwidth use over high and low bandwidths. These tests will reflect the overall performance in the transfer of files over the Internet. We will be also looking into designing a protocol for managing a dictionary on the clients. This management should incorporate some of the features like dictionary update and propagation, different dictionaries for different languages and should also have versioning capabilities.
As for the initial stages in this approach we have procured all the material needed and have built the infrastructure that is necessary to start. We are developing a real-time compression test facility, which will be capable of testing encoding/compression algorithms for the above-mentioned parameters and for different standard protocols like FTP, HTTP, POP3 and SMTP. This facility will log different parameters mentioned above plus some miscellaneous information, which can be later used to evaluate the performance of the compression algorithms on text transfers over Internet.
The basic test facility has been developed to perform Internet transfers using FTP. The results obtained from this are very promising. The transfers were initiated by normal FTP clients from people on the same network as well as from different networks. These networks are at different distances from the location where the server runs. The compression method is employed on the server and we used the Bzip2 algorithm for compression as a starting point without any of the transforms RPT, LPT or SCLPT. Results of the test runs are shown in Table 5.
As we can see in Table 5, the size of the files transferred is in the range
of 0.3KB to 500KB. Our method always shows improvement in the transfer even if
we include time involved in compressing the data. In one example, the size of
the uncompressed data transferred is 3420.72 Kbytes and time taken to transfer
that amount of data is 458.057 sec. When the same data is compressed and
transferred, total data transferred is 743.097 Kbytes and time taken is
50.17sec. The time taken to compress 3420.72 Kbytes of data is 5.971sec (server
was assumed to slightly loaded). Hence, 2677.63Kbytes of data transfer and
(458.057 – 50.17 – 5.971) sec = 401.916sec was saved. If we used any of
the transforms mentioned earlier we will spend some amount of time in applying
the transform but will gain in reducing the transmission time since data is much
more compressed
|
|
|
|
|
(Transformation Time, Compression Time) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 5: Payload transformation and transmission using Bzip2 compression algorithm at the server. Three stations were used: LAB located at UCF, FSC stands for Florida SolarEnergy Center, and IND is a place located in India. N= not compressed, C=Compressed.
Project Impact
We expect that our research
will impact the future status of information technology by developing data
delivery systems where communication bandwidths is at a premium and archival
storage is an exponentially costly endeavor. We have developed new lossless text
compression algorithms that have 3 to 30% improved compression ratio over the
best known pre-existing compression algorithms which might translate into a
reduction of more than 70% text traffic on the Internet. We are developing
software tools to include compression in standard Internet protocols, and we are
planning to automate dynamic caching interaction with web browsers, e-mail
programs and authoring tools.
We have taught (in the spring 2000 semester) a
new course entitled "Multimedia Compression on the Internet":
http://www.cs.ucf.edu/courses/cap5937-01/. This is a graduate level course with
14 students enrolled in the Spring 2000 semester. This particular topic
has grown directly out of the research that we have been conducting for the last
couple of years on data compression. Lecture topics have included both
text and image compression, including topics from the research on the current
NSF grant.
So far three Ph.D. students and five Masters students have
participated and contributed in this research project, but not all of them
received direct support from the grant. Dr. Robert Franceschini and Mr. Holger
Kruse made significant contributions in the project before it was officially
funded by NSF. Currently, one Ph. D. student (Nan Zhang ) and two Masters
Students (Nitin Motgi and Raja Iqbal) are working on the project. Another Master
student (Fauzia Awan) did a term project in the graduate course “Multimedia
Compression on the Internet” directly related to this research project. Other
members of the M5 Research Group at the School of Electrical Engineering and
Computer Science, Dr. Kunal Mukherjee, Tao Tao, and Piyush Jamkhandi, made
critical comments and observation during the course of this work. The
overall effect of these activities is to train graduate students with the
current research on the forefront of technology.
Project References
This work was started
several years ago; two early papers that established this work are the
following:
1. R. Franceschini and A. Mukherjee. "Data
Compression Using Encrypted Text", Proceedings of the Third Forum on Research
and Technology, Advances in
Digital
Libraries, ADL96, May 13-15 1996, pp 130-138.
2. H. Kruse and A.
Mukherjee. "Data Compression Using Text Encryption", Proceedings of
the Data Compression Conference, 1997, IEEE Computer Society
Press, pp. 447.
We have submitted a journal version of a paper [1] and two other papers [2,3]
are under preparation for submissions to conferences in the near future. Copies
of these papers will be made available via our project website and also by hard
copy
1. R. Franceschini, H. Kruse, N. Zhang, R. Iqbal and A. Mukherjee,
“Lossless, Reversible Transformations that Improve Text Compression Ratios”,
submitted to
IEEE Transactions on Multimedia Systems
(June 2000).
2. F. Awan, R. Iqbal and A. Mukherjee, “ A New Text
Preprocessing Algorithm for Bzip2 and PPM*” (in preparation).
3. N. Motgi
and A. Mukherjee, “ High Speed Text Data Transmission over Internet Using
Compression Algorithm” (in preparation)
Area Background
In the last decade of the
20th century we have experienced an unprecedented increase in the amount of
digital data transmitted across public networks, in particular across the
Internet. Recent standards such as the World Wide Web, HTTP, HTML and image
standards have provided free access to a vast amount of information available
from public servers. With the continuously increasing use of the Internet the
efficient use of available resources, in particular hardware resources, has
become a primary concern. One way of improving performance is by compressing and
caching data, i.e. by ensuring that data is not sent more often than necessary,
that as little data is sent in response to each request as possible, and that
all data transmitted across networks is compressed during transmission, to
increase the efficiency by which the existing infrastructure is utilized.
This proposal is concerned with integrating new data compression and caching
techniques for improving the efficiency of bandwidth-limited Internet. We
propose new algorithms to preprocess text in order to improve the compression
ratio of textual documents, in particular online documents such as web pages on
the World Wide Web. We also develop similar techniques for images that do
not change with time. We describe how our algorithm can be efficiently used to
aid in the compression of documents consisting of large texts and static images,
for transmission across public data networks such as the Internet, using
existing standards such as HTTP and HTML. The compression algorithms also can be
used to utilize archival storage effectively. Our approach integrates the
two basic ideas of compression and caching into a single mechanism, i.e., data
is automatically cached whenever possible, avoiding retransmission altogether,
and if data has to be transmitted, it is compressed in a very efficient way,
which surpasses the performance of compression algorithms that are widely used
at this time.
Area References
A1. I.H. Witten, A. Moffat,
and T.C. Bell. Managing Gigabytes. 2nd Ed, Morgan Kaufmann,
1999. Managing Gigabytes Web Site
A2. K. Sayood.
Introduction to Data Compression. 2nd Ed, Morgan Kaufmann, 2000.
Introduction to Data Compression Web Site
A3. J. Cleary, W. Teahan,
Unbounded Length Contexts for PPM, The Computer Journal, Vol. 36, No. 5, 1993.
(Also see Proc. Data Compression Conference,
Snowbird, Utah, 1997)
A4. Michelle Effros, PPM Performance with BWT
Complexity: A New Method for Lossless Data Compression, Proc. Data Compression
Conference, Snowbird,
Utah, March, 2000
Potential Related Projects
There is an
ongoing research project under the supervision of the Principal Investigator on
Wavelet Based Image Compression and Transmission supported by Intel Corporation.
A research proposal on this subject is under preparation for submission to NSF
in the near future.