|
Tuning Garbage Collection with the 1.3.1 Java Virtual
Machine |
See also Performance
Docs |
Introduction
The Java 2 Platform is increasingly used for large server
applications such as web services. These applications demand scalability,
and directly benefit from large numbers of threads, processors, sockets and
memory. Yet 'big iron' performance has a reputation as an art form,
requiring special expertise beyond what is needed for performance on smaller
systems. Fortunately, the Java Virtual Machine (JVM)* and Solaris
operating environment provide effective implementations of threads, I/O and
memory management. This document addresses a common speed bump on the road
to scalable high performance: poorly tuned garbage collection
(GC).
Amdahl observed that most workloads cannot be perfectly
parallelized; some portion is always sequential and does not benefit from
parallelism. This is also true for the Java 2 Platform. In
particular, JVMs up to and including version 1.3.1 do not have parallel garbage
collection, so the impact of GC on a multiprocessor system grows relative
to an otherwise parallel application.
The graph below models an ideal system that is perfectly scalable with the
exception of GC. The top line (red) is an application spending only 1% of
the time in GC on a uniprocessor; this translates into more than 20% loss in
throughput at 32 processors. At 10%, not considered an outrageous amount
of time in GC in uniprocessor applications, more than 75% of throughput is lost
when scaling up.
This demonstrates that issues that appear lost in the noise when developing
on small systems may become principal bottlenecks when scaling up. The
silver lining is that small improvements in such a bottleneck can produce large
gains in performance. For a sufficiently large system it becomes well
worthwhile to tune garbage collection.
This document is written from the
perspective of 1.3.1 JVM on the Solaris (SPARC Platform Edition) operating
environment, because that platform provides the most scalable hardware/software
Java 2 platform today. However, the descriptive text applies to other
supported platforms, including Linux, Microsoft Windows, and the Solaris (Intel
Architecture) operating environment, to the extent that scalable hardware is
available. Although command line options are consistent across platforms,
some platforms may have different defaults than described here.
Generations
One of Java 2 Platform's great strengths is that it shields
the substantial complexity of memory allocation and garbage collection from the
developer. However, once GC has become the principal bottleneck, it
becomes worth understanding aspects of this hidden implementation. Garbage
collectors make assumptions about the way applications use objects, and these
are reflected in tunable parameters that can be adjusted for improved
performance without sacrificing the power of the abstraction.
An object
is garbage when it can no longer be reached from any pointer in the running
program. The most straightforward garbage collection algorithms simply
iterate over every reachable object; any objects left over are then known to be
garbage. This approach takes time proportional to the number of living
objects, which is prohibitive for large applications maintaining lots of living
data.
The 1.3 JVM incorporates a number of different garbage collection
algorithms that are combined using generational collection. While
naive garbage collection examines every living object in the heap, generational
collection exploits several empirically observed properties of most applications
to avoid extra work.
The most important of these properties is infant
mortality. The blue area in the diagram below is a typical
distribution for the lifetimes of objects. The sharp peak at the left
represents objects that can be reclaimed shortly after being allocated.
Iterator objects, for example, are often alive for the duration of a single
loop.
Some objects do live longer, and so the distribution stretches out to the the
right. For instance, there are typically some objects allocated at
initialization that live until the process exits. Between these two
extremes are objects that live for the duration of some intermediate
computation, seen here as the lump to the right of the infant mortality
peak. Some applications have very different looking distributions, but a
surprisingly large number possess this general shape. Efficient collection
is made possible by focusing on the fact that a majority of objects die
young.
To do this, memory is managed in generations: memory pools
holding objects of different ages. Garbage collection occurs in each
generation when it fills up; these collections are represented on the diagram
above with vertical bars. Objects are allocated in eden, and
because of infant mortality most objects die there. When Eden fills up it
causes a minor collection, in which some surviving objects are moved to
an older generation. When older generations need to be collected there is
a major collection that is often much slower because it involves all
living objects.
The diagram shows a well-tuned system in which
most objects die before they survive to the first garbage collection. The
longer an object survives, the more collections it will endure and the slower GC
becomes. By arranging for most objects to survive less than one
collection, garbage collection can be very efficient. This happy situation
can be upset by applications with unusual lifetime distributions, or by poorly
sized generations that cause collections to be too frequent.
The
default garbage collection parameters were designed to be effective for most
small applications. They aren't optimal for many server
applications. This leads to the central tenet of this document:
If GC has become a bottleneck, you may wish to customize the generation
sizes. Check the verbose GC output, and then explore the sensitivity
of your individual performance metric to the GC
parameters. |
Types of collection
Each generation has an associated type of
garbage collection that can be configured to make different algorithmic
time, space and pause tradeoffs. In 1.3, the JVM implements three very
different garbage collectors:
- Copying (sometimes called scavenge): this collector very
efficiently moves objects between two or more generations. The source
generations are left empty, allowing remaining dead objects to be reclaimed
quickly. However, since it requires empty space to operate, copying
requires more footprint. In 1.3.1 copying collection is used for all
minor collections.
- Mark-compact: this collector allows generations to be collected in
place without reserving extra memory; however, compaction is significantly
slower than copying. In 1.3.1 mark-compact is used for major
collections.
- Incremental (sometimes called train): this collector is used
only if -Xincgc is passed on
the command line. By careful bookkeeping, incremental GC collects just a
portion of the old generation at a time, trying to spread the large pause of a
major collection over many minor collections. However, it is even slower
than mark-compact when considering overall throughput.
Since copying
is very fast, a tuning goal is to collect as many objects as possible by copying
rather than by compaction or incremental collection.
The default
arrangement of generations looks something like this.
At initialization, a maximum address space is virtually reserved but not
allocated physical memory unless it is needed. The complete address space
reserved for object memory can be divided into the young and old
generations.
The young generation consists of eden plus two
survivor spaces . Objects are initially allocated in eden.
One survivor space is empty at any time, and serves as the destination of the
next copying collection of any living objects in eden and the other survivor
space. Objects are copied between survivor spaces in this way until they
age enough to be tenured (copied to the old generation.)
(Other
virtual machines, including the production JVM version 1.2 for the Solaris
operating environment, used two equally sized spaces for copying rather than one
large eden plus two small spaces. This means the options for sizing the
young generation are not directly comparable; see the Performance FAQ
for an example.)
The old generation is collected in place by
mark-compact. One portion called the permanent generation is
special because it holds all the reflective data of the JVM itself, such as
class and method objects.
Performance considerations
There are two primary measures of garbage
collection performance. Throughput is the percentage of total time
not spent in garbage collection, considered over long periods of time.
Throughput includes time spent in allocation (but tuning for speed of allocation
is generally not needed.) Pauses are the times when an application
appears unresponsive because garbage collection is going on.
Users have
different requirements of garbage collection. For example, some consider
the right metric for a web server to be throughput, since pauses during garbage
collection may be tolerable, or simply obscured by network latencies. But
for an interactive graphical program, even short pauses may upset the user
experience.
Some users are sensitive to other considerations.
Footprint is the working set of a process, measured in pages and cache
lines. On systems with limited physical memory or many processes,
footprint may dictate scalability. Promptness is the time between
when an object becomes dead and when the memory becomes available, an important
consideration for distributed systems including RMI.
In general, a
particular generation sizing chooses a trade-off between these
considerations. For example, a very large young generation may maximize
throughput, but does so at the expense of footprint and promptness. Pauses
can be minimized by using a small young generation and incremental collection,
at the expense of throughput.
There is no one right way to size
generations; the best choice is determined by the way the application uses
memory as well as user requirements. For this reason the JVM's default GC
choices may not be optimal, and may be overridden by the user in the form of
command line options below.
Measurement
Throughput and footprint are best measured using metrics particular to
the application. For example, throughput of a web server may be
tested using a client load generator, while footprint of the server might be
measured on the Solaris operating environment using the pmap command. On the other hand,
pauses due to GC are easily estimated by inspecting the diagnostic output
of the JVM itself.
The command line argument -verbose:gc prints information at every
collection. For example, here is output from a large server
application:
[GC
325407K->83000K(776768K), 0.2300771 secs]
[GC
325816K->83372K(776768K), 0.2454258 secs]
[Full GC 267628K->83769K(776768K),
1.8479984 secs]
Here we see two minor collections and one major
one. The numbers before and after the arrow indicate the combined size of
live objects before and after the GC. (After minor collections the count
includes objects that aren't necessarily alive but can't be reclaimed, either
because they are directly alive, or because they are within or referenced from
the old generation.) The number in parenthesis is the total available
space, which is the total heap minus one of the survivor spaces.
Sizing the generations
A number of parameters affect generation size. This
diagram illustrates the ones most important to tuning the 1.3.1 JVM.
Many parameters are actually ratios x:y, and these are
depicted with black (representing x) and grey (representing
y) size bars: 
Total heap
Since collections occur when generations fill up, throughput
is inversely proprotional to the amount of memory available. Total
available memory is the most important knob affecting GC performance.
By default, the JVM grows or shrinks the heap at each collection to try
to keep the proportion of free space to living objects at each collection within
a specific range. This target range is set as a percentage by the
parameters -XX:MinHeapFreeRatio=<minimum>
and -XX:MaxHeapFreeRatio=<maximum>,
and the total size is bounded below by -Xms and above by -Xmx . The default parameters for
the Solaris (SPARC Platform Edition) operating environment are shown in this
table:
-XX:MinFreeHeapRatio=
|
40
|
-XX:MaxHeapFreeRatio=
|
70
|
-Xms
|
3584k
|
-Xmx
|
64m
|
Large server
apps often experience two problems with these defaults. One is slow
startup, because the initial heap is small and must be resized over many major
collections. A more pressing problem is that the default maximum heap size
is unreasonably small for most server applications. The rules of thumb for
server applications are:
Unless you have
problems with pauses, try granting as much memory as possible to the
JVM. The default size (64MB) is often too small.
Setting
-Xms and -Xmx to the same value increases
predictability by removing the most important sizing decision from the
JVM. On the other hand, the JVM can't compensate if you make a poor
choice.
Be sure to increase the memory as you increase the number
of processors, since allocation can be parallelized, but GC is not
parallel.
|
The young generation
The second most influential knob is the proportion
of the heap dedicated to the young generation. The bigger the young
generation, the less often minor collections occur. However, for a bounded
heap size a larger young generation implies a smaller old generation, which will
increase the frequency of major collections. The optimal choice depends on
the lifetime distribution of the application.
By default, the young
generation size is controlled by NewRatio. For example, setting -XX:NewRatio=3 means that the ratio
between the young and old generation is 1:3; in other words, the combined size
of eden and the survivor spaces will be one fourth of the heap.
The
parameters NewSize and MaxNewSize bound the young generation
size below and above. Setting these equal to one another fixes the young
generation, just as setting -Xms
and -Xmx equal fixes the total
heap size. This is useful for tuning the young generation at a finer
granularity than the integral multiples allowed by NewRatio.
Because the
young generation uses copying collection, enough free memory must be reserved in
the old generation to ensure that a minor collection can complete. In the
worst case, this reserved memory is equal to the size of eden plus the objects
in non-empty survivor space. When there isn't enough memory available in
the old generation for this worst case, a major collection will occur
instead. This policy is fine for small applications, because the memory
reserved in the old generation is typically only virtually committed but not
actually used. But for applications needing the largest possible heap, an
eden bigger than half the virtually committed size of the heap is useless: only
major collections would occur.
If desired, the parameter SurvivorRatio can be used to tune the
size of the survivor spaces, but this is often not as important to
performance. For example, -XX:SurvivorRatio=6 sets the ratio
between each survivor space and eden to be 1:6; in other words, each survivor
space will be one eighth of the young generation (not one seventh,
because there are two survivor spaces).
If survivor spaces are too
small, copying collection overflows directly into the old generation. If
survivor spaces are too large, they will be uselessly empty. At each
garbage collection the JVM chooses a threshold number of times an object can be
copied before it is tenured. This threshold is chosen to keep the
survivors half full. (For the intrepid, a 1.3.1 option -XX:+PrintTenuringDistribution can be
used to show this threshold and the ages of objects in the new generation.
It is also useful for observing the lifetime distribution of an
application.)
Here are the default values for the Solaris (SPARC Platform
Edition) operating environment:
NewRatio
|
2 (client JVM:
8)
|
NewSize
|
2172k
|
MaxNewSize
|
32m
|
SurvivorRatio
|
25
|
The rules of
thumb for server applications are:
First decide the
total amount of memory you can afford to give the JVM. Then graph
your own performance metric against young generation sizes to find the
best setting.
Unless you find problems with excessive major
collection or pause times, grant plenty of memory to the young
generation. The default MaxNewSize (32MB) is generally
too small.
Increasing the young generation becomes
counterproductive at half the total heap or less.
Be sure to
increase the young generation as you increase the number of processors,
since allocation can be parallelized, but GC is not
parallel.
|
Other considerations
For most applications the permanent generation is
not relevant to GC performance. However, some applications dynamically
generate and load many classes. For instance, some implementations of JSPs
do this. If necessary, the maximum permanent generation size can be
increased with MaxPermSize.
Some applications
interact with garbage collection by using finalization and weak/soft/phantom
references. These features can create performance artifacts at a
Java-programming-language level; an example is relying on finalization to close
file descriptors, which makes an external resource (descriptors) dependent on GC
promptness. Relying on GC to manage resources other than memory is almost
always a bad idea.
Another way apps can interact with garbage
collection is by invoking GCs explicitly, such as through the System.gc() call. These calls force
major collection, and inhibit scalability on large systems. The
performance impact of explicit GCs can be measured using the unsupported flag
-XX:+DisableExplicitGC.
One of
the most commonly encountered uses of explicit GC occurs with RMI's distributed
garbage collection (DGC). Applications using RMI refer to objects in other
JVMs. Garbage can't be collected in these distributed applications without
occasional local collection, so RMI forces periodic full collection. The
frequency of these collections can be controlled with properties. For
example,
java
-Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000 ...
specifies explicit collection once per hour instead of the default rate of
once per minute. However, this may also cause some objects to take much longer
to be reclaimed. These properties can be set as high as Long.MAX_VALUE to
make the time between explicit collections effectively infinite, if there is no
desire for an upper bound on the timeliness of DGC activity.
The Solaris 8
operating environment supports an alternate version of libthread that binds
threads to LWPs directly; this may help avoid starvation of the finalization
thread. To try this, set the environment variable LD_LIBRARY_PATH to include /usr/lib/lwp before launching the
JVM.
Soft references are cleared less aggressively in the server JVM than
the client. The rate of clearing can be slowed by increasing a parameter
in this way: -XX:SoftRefLRUPolicyMSPerMB=10000. The
default is value 1000, or one second per megabyte.
For large dedicated
systems, there are other special options available
to boost performance.
Conclusion
Garbage collection can become a bottleneck in highly parallel
systems. By understanding how GC works, it is possible to use a variety of
command line options to minimize that impact.
The demands of large
servers are being met with larger hardware configurations that ever
before. For these systems, the 1.4 line of JVMs will provide additional
solutions, including a 64 bit address space for even larger generations and
concurrent collection to hide the pauses associated with major collection.
As used on the web site, the terms "Java Virtual Machine and
"JVM" mean a virtual machine for the Java platform.