|
DATA COMPRESSION |
By Lance Jensen
Executive Software Technical Support Director
![]()
Compression is a very valuable tool. It can dramatically increase the
amount of data you can store on a disk, which saves thousands of dollars for
many sites. But it is not intended to be used indiscriminately. The
information in this article will help you decide when and when not to
compress files.
Benefits and Costs of Compression
Compression has one invariable benefit: You can fit up to twice the data on
a disk. In addition, on smaller, simple partitions, you may even get faster
I/O. These would be under 2GB, and not RAID or stripe sets or mirror sets.
But set against these benefits are several costs:
- CPU Utilization. Accessing a compressed file is a CPU-intensive action.
The mere act of reading such files can use over 60% of your CPU. This is no
problem if you are only reading the file, but it can impact other operations
that are running at the same time.
- Access Time. On volume sets and on partitions over 2GB, reading and
writing generally takes longer if compression is used. In my experience,
reading and writing on 4.3GB or larger partitions always takes longer if the
file is compressed; on partitions over 8GB, it takes at least twice as long
because of fragmentation inherent in compressed files. This is explained in
detail later in this article.
- Fragmentation. When you decompress a file, it is written to a different
part of the disk; it may be written contiguously, and it may not. When you
decompress an entire partition, it always fragments badly. In fact, you can
run the analysis tool in our Diskeeper defragmenter before and after running
compression and see for yourself the results on your system.
- MFT (Master File Table) Fragmentation. File compression is achieved by
taking the first 16 clusters of the file, packing the data into as small a
space as possible and writing it to the disk, then repeating this for each
remaining 16 cluster increment. The Logical Cluster Number (LCN) where it
is written and the number of clusters the compressed data uses is stored in
the MFT. This is repeated for the next increment of 16 clusters, and so on.
There is also a last entry which stores -1 instead of an LCN, along with how
many clusters are needed to decompress the last increment. (For more
information on the MFT, see "The Master File Table: What It Is and What
It's For", eLetter Volume 2, Issue 5).
Now, the actual compressed file may or may not be fragmented; it doesn't
really matter, because NTFS must always access a compressed file as if it
were fragmented. You see, the MFT entry of an uncompressed file contains
the LCN and size in clusters of the first fragment. If the file is
contiguous, that is all the data needed. But if the file is fragmented,
there is another set of LCN and cluster count data required for each
fragment. When the file is accessed, the system has to do an I/O for each
LCN and cluster count. Since that is what the MFT entry of a compressed
file looks like, the system must do an I/O for each 16-cluster increment.
But the real drawback of compression as regards the MFT is the size of the
MFT entries. If the file is large enough, there will be so many LCN and
cluster count records that the MFT entry will overflow, requiring at least
one additional MFT entry. If you compress enough files, and almost
certainly if you compress the entire partition, the MFT will fill the MFT
zone (the pre-allocated space, usually at the beginning of the disk). Any
new MFT entries will be written wherever the Next Free Space pointer happens
to point. These tiny fragments of MFT take longer to read because the
read/write head has to move to access them, and the fragments break up the
free space permanently. At this time there is no way to get rid of them or
to defragment the MFT short of reformatting the partition.
This may not seem serious, but it can get out of hand very quickly. In one
of our tests, we compressed a 271MB file; it resulted in 467 extra MFT
entries!
All of these points result in slower performance, and therefore less
production. Adding more hard disk space lets you avoid compressing files,
and makes your system run faster. If spending $1,000 for new disks allows
you to bring in another $100 per week, then the disks will pay for
themselves in ten weeks, and the rest is pure profit.
Stated simply, you can weigh the value of compression against the
performance hit you will take in using it. If you lose little to no
performance, fine. But if the saved disk space is going to seriously impact
performance, you won't save money on disk space, but instead lose it in
performance and man-hours.
When is Compression Worth Doing?
The basic underlying reason for using compression has always been to reduce
the cost of data storage in those cases where there would be little or no
price to pay in terms of system performance. These are the most common
cases where compression is needed:
- If a partition is used for archive, and you don't access it frequently,
compression may be worthwhile.
- If a partition is under 2GB and is not a volume set of any kind, and if
you never exceed 40% CPU utilization, compression should be worthwhile.
- If the performance hit can be balanced against the disk space saved, then
compression is worthwhile.
-------------------------------------------------------------
Lance Jensen is Executive Software ace Tech Support Director, and has great experience with both Windows NT and Digital's OpenVMS operating systems. He can be reached at dknt_support@executive.com. Please feel free to write to him with questions or comments about this article.
CONTACT EXECUTIVE SOFTWARE
![]()
@Macarlo,
Inc.
@Macarlo's Shareware & Web
OS/2
Java
Lobby Member
Java Site Accredited