Garbage Collection in Java
Understanding Java garbage collection
Understanding Java garbage collection begins with the heap and the stack. If you aren't familiar with Java Heap vs Stack be sure to read this first.
What is garbage collection in Java?
Garbage collection is the automatic process for handling memory management. Memory management is the dynamic process of allocating/deallocating memory to/from your application.
Memory management is not unique to Java. Every programming language must provide a means of requesting memory from the underlying OS and releasing memory when it's no longer needed. This allows programs to run without consuming too much memory or causing applications to crash.
Some languages like C and C++ require a manual process for allocating/deallocating memory. Programmers must write constructor and destructor functions to allocate memory when it's needed and release it when objects are no longer being referenced by the application.
While this manual process gives developers more control over memory management, it's ultimately more tedious, repetitive, and vulnerable to memory leaks and other errors. For these reasons, programming languages like Java implement an automatic garbage collection mechanism for managing memory. This mechanism operates independently from the application code allowing developers to write code without worrying about memory management.
Why is garbage collection important in Java?
Garbage collection (GC) is important because it automatically handles memory management in Java. This makes Java memory-efficient without the developer's intervention. As a developer, you can write Java code without worrying about memory allocation. GC handles all of this for you.
While GC "magically" handles memory management for the developer, it's still important to understand what's going on under the hood. If you've ever experienced a OutOfMemory exception then you know GC won't save you every time. Furthermore, there are different implementations of GC and various configurations for optimizing this process based on the nature of your application.
Having a good grasp on the GC mechanism will help you write better code and avoid performance issues related to memory management.
How garbage collection works in Java
Allocating memory to the JVM
Java applications run inside the JVM. The OS running your Java program allocates a configurable amount of heap space memory to the JVM for running your applications.
-Xms2G -Xmx5G
The above are two sample arguments for configuring min and max heap space for the JVM.
Since the JVM has a predetermined allotment of memory for running applications, it doesn't need to synchronize with the underlying OS for memory resources. Instead, it can utilize this same memory space and free up unused memory without deleting or requesting more from the OS. This is fundamental to what makes Java memory-efficient as a programming language.
Application execution: Java Heap vs Stack
When your application starts up, classes and methods are loaded into the heap space. As your program executes, objects are created and referenced in the heap space by the stack
While the heap represents a global space of memory that exists throughout the duration of your application, the stack only exists during method or thread execution. Once a method finishes execution, the memory used is released from the stack via a relatively simple FIFO approach.
This simple approach is sufficient for automatically handling memory living in the stack (local variables and references to objects stored in the heap) but what about actual objects referenced in the heap?
Since the heap space is much larger and more permanent, it must be managed through more sophisticated means. This is where GC comes into play. GC automatically handles the memory used by the heap space.
Confused? I highly recommend checking out Java Heap vs Stack before continuing any further...
How garbage collection works
With the understanding that memory management for the heap is automatically handled via GC, it's time to understand exactly how GC works in Java.
Identifying unused objects
GC works by identifying or "marking" unused objects by the application. An unused object is anything that isn't being referenced by a GC root. A GC root is either:
1) Objects referenced in the stack
2) Objects referenced by class static attributes
3) Objects referenced by constants
4) Objects referenced by JNI
These special "root" objects give GC a starting point to search for "living" objects. Any object that isn't referenced by one of these root objects is considered "unused". Take the following code snippet as an example:
public class MyClass {
public static final MyClass myC = MyClass("final");
public MyClass(String name){}
}
public static void testMyClass(){
MyClass c = new MyClass("hello");
c = null;
}
public static final MyClass myC = MyClass("final");
public MyClass(String name){}
}
public static void testMyClass(){
MyClass c = new MyClass("hello");
c = null;
}
While setting c = null will dereference c, it's static member myC won't be garbage collected after testMyClass() executes.
Since the instance of MyClass c is created within the VM stack, it will be garbage collected.
Garbage Collection Algorithms
A combination of the following GC algorithms are used by modern GC collectors in Java:
The Mark-Sweep Algorithm
GC algorithms follow a high-level "mark-sweep" phase concept. In the "mark phase", GC identifies objects that are no longer being referenced by the application. In the "sweep phase", the memory addresses used for these unreferenced objects are freed up by the JVM. This essentially "deletes" the unreferenced objects and makes memory available to the heap.
While the "mark-sweep" algorithm is fairly simple, it presents problems related to fragmentation. When memory is released, it becomes available in many fragments or segments. Since memory is allocated in contiguous blocks, this makes it harder to assign address space and wastes unused memory.
The Copy Algorithm
Unlike the mark-sweep algorithm, the copy algorithm avoids memory fragmentation by copying all of the "unmarked" living objects to a new space. This is more efficient as living memory can exist in a contiguous block. The downside is memory is less efficiently used as memory space must be reserved as a "destination" for copied data.
The Mark-Compact Algorithm
This is a hybrid of the traditional mark-sweep and copy algorithm. Mark-compact identifies unused objects similar to mark-sweep. Unlike mark-sweep, mark-compact moves living objects to one end of boundary and reclaims the leftover space.
This is similar to the copy algorithm in that living memory is isolated from unused memory. Unlike the copy algorithm, mark-compact moves memory instead of copying it.
While mark-compact provides a nice hybrid approach to mark-sweep and copy, it takes a performance hit with frequent manipulation and sorting of the memory space. For these reasons, generations and regions are used with GC.
Generations
GC algorithms have their pros and cons. While "mark sweep" is simple, it introduces fragmentation. While copy eliminates fragmentation, it wastes memory. While mark-compact serves as a nice hybrid, it can be inefficient.
To leverage these tradeoffs, heap space is divided into blocks called "generations". These blocks categorize objects based on their age or how long they've been in the heap.
By categorizing objects as "young" vs "old", GC can run different algorithms best suited for the lifespan of memory...
The Young Generation
The young generation of the heap holds all newly created objects. This space is further divided into the Eden and Survivor space:
The Eden Space
Empirical data and research shows that 98% of objects created are short lived. For these reasons, most newly created objects are allocated to the Eden space. When the eden space becomes full, minor gc takes place. During a minor GC event, objects surviving the eden space are moved to the survivor space.
The Survivor Space (S0, S1)
The survivor space itself exists in two spaces (S0, S1). This is sometimes referred to as the "to" and "from" space but the important thing is that two separate spaces exist within the survivor space.
When a minor GC occurs, surviving objects from the eden space are moved to the S0 space. The next time GC occurs, surviving objects from eden and S0 are copied into S1. The next time GC occurs, surviving objects now living in S1 are copied (along with Eden objects) to the S0 space.
Why the survivor space?
Why is this so complicated? Why can't the eden space graduate objects to the old generation if they survive GC? It turns out the survivor space(s) act as a buffer between young and old generation. While many objects may survive a single "minor GC event", they won't likely last more than a few cycles. The survivor space gives objects the "opportunity" to die as it collects anything surviving up to 16 GC cycles.
Why two spaces (S0,S1)?
Remember the copy algorithm? By having two spaces in the survivor space, one space can be used as a destination for everything copied over from the eden and other survivor space. This eliminates the memory fragmentation issue and works well for more short-lived objects found in the younger generation.
The Old Generation
If an object survives enough rounds of minor GC, it will be promoted to the old generation. Major garbage collection events are performed on the old generation.
Major GC happens less often and adopts a "mark-sweep" algorithm for cleanup. This is better for the older generation space as most objects are long lived. This minimizes the number of objects needing to be "marked" and resorted as per the "mark-sweep" approach.
If an object survives enough GC events, it will eventually be promoted to the old generation space. Other objects, such as very large objects, will automatically be promoted to the old generation space. Objects that can't fit within the eden/survivor space will be promoted to the old generation.
Regions
Older implementations of GC represent the heap space as one contiguous block of memory. More modern implementations used with Java 8, 11+ implement a "garbage first" (G1) approach to GC.
With the G1 approach, the heap space is divided into equally sized heap regions. Each region can represent a different generation at any given time. While a region may sometimes represent a young generation (eden, survivor space), it may represent an old generation after being reallocated.
The advantage of regions
By subdividing the heap space into regions, multiple threads can scan regions in parallel. Generations aren't confined to a predetermined size and can dynamically change (young space vs old space) based on the algorithm's needs.
Regions also work to identify spaces with the most garbage (fewest live objects) during global "mark" phases. These regions are "evacuated" by copying any surviving objects to a new single region. This compacts and frees up memory in parallel ultimately leading to faster throughput and smaller pause times.
All GCs are not created equal...
Through the use of regions, you can see how the G1 collector performs differently than alternatives. So what are these alternatives?
Serial GC: This is the most basic implementation of GC. Utilizing a single thread, the application is paused and GC is performed. Once GC has run and compaction is complete, the application thread resumes.
Serial GC is the most basic form of GC and is not recommended for use in environments where low latency is emphasized. You can use serial GC by specifying -XX:+UseSerialGC.
Parallel GC: Also known as "throughput collector", parallel GC works by utilizing multiple threads to perform young generation collection. A single thread is used for major GC in the old generation.
You can specify parallel GC with the -XX:+UseParallelGC option.
Parallel Old GC: Very similar to Parallel GC except that multiple threads are used across young and old generations. You can specify use of Parallel Old GC with the -XX:+UseParallelOldGC.
Concurrent Mark Sweep (CMS) GC: Unlike the already mentioned alternatives, CMS works concurrently with the application execution. Multiple threads work alongside application threads to minimize the pause time in between GC cycles.
While CMS sounds great, it only pays off if you have the CPU power to throw at it. CMS requires more CPU and no compaction is performed with this option.
You can enable CMS with the -XX:+UseConcMarkSweepGC option (deprecated)..
G1 (Garbage first): This is the default GC used in modern Java programs. Since G1 prioritizes GC on regions having the most garbage (fewest live objects), it's considered "garbage first".
Epsilon: This GC was released as an experimental feature of JDK 11. The reason for using Epsilon is essentially to not use a memory reclamation mechanism at all. Since memory reclamation can be a costly to performance, the Epsilon GC is meant for hyper low latency applications where memory is being handled well by the developer.
You can enable Epsilon with the -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC.
Shenandoah: Shenandoah is relatively new (released as part of JDK 12). It's basically the newer/more improved version of today's G1 collector and can evacuate its heap concurrently with the application. While more CPU intensive than G1, Shenandoah can be thought of as a "better" version of G1.
You can enable Shenandoah with the -XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGCM/span> option.
ZGC: Released with 11 and improved with 12, ZGC allows the application to continue running while performing all GC operations. ZGC prides itself on the lowest pause times and points to the future of GC in Java.
You can enable ZGC via the -XX:+UnlockExperimentalVMOptions -XX:+UseZGC option.
When Garbage Collection is Done in Java
GC is performed automatically in Java whenever the Eden space becomes full. If there isn't enough free space around to allocate an object, GC will also be performed.
Minor GC events occur every time the Eden space reaches capacity. A major GC event occurs whenever younger generations are full or when the older generation becomes full.
Major GC also occurs whenever you force garbage collection programatically....
Java Force Garbage Collection
While it's highly recommended that you don't manually perform GC, you can programmatically call GC with:
System.gc();
Runtime.getRuntime.gc();
Runtime.getRuntime.gc();
Remember that GC is non-deterministic. This means there is really no way of telling when GC will run. While you can manually trigger GC with the above commands, it's more of a "request" for GC to run and doesn't guarantee execution.
Monitoring Garbage Collection
The HotSpot JVM comes with some out-of-the-box tools for monitoring the JVM (GC included). Using popular tools like jstat and jconsole you can easily monitor details behind GC...
Using jstat
jstat is simply a CLI tool for monitoring the JVM. It will give you real time information regarding the performance of your application.
jstat can be rather overwhelming. It includes lots of information outside of GC. To use jstat solely for GC monitoring you can run something like this:
jstat -gc 46562 1000
Running jstat with the -gc option gives us only information related to GC. 46562 is the process id for your running application. If your app is up and running, you can easily find this by running the jps command. The 1000 argument specifies how frequently to report. In this case, data will be displayed ever 1 second and look something like this:
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT CGC CGCT GCT
0.0 8192.0 0.0 8192.0 190464.0 12288.0 116736.0 66168.6 76124.0 73926.6 9548.0 8751.2 42 0.339 0 0.000 6 0.008 0.346
0.0 8192.0 0.0 8192.0 190464.0 12288.0 116736.0 66168.6 76124.0 73926.6 9548.0 8751.2 42 0.339 0 0.000 6 0.008 0.346
0.0 8192.0 0.0 8192.0 190464.0 13312.0 116736.0 66168.6 76124.0 73926.6 9548.0 8751.2 42 0.339 0 0.000 6 0.008 0.346
0.0 8192.0 0.0 8192.0 190464.0 13312.0 116736.0 66168.6 76124.0 73926.6 9548.0 8751.2 42 0.339 0 0.000 6 0.008 0.346
0.0 8192.0 0.0 8192.0 190464.0 13312.0 116736.0 66168.6 76124.0 73926.6 9548.0 8751.2 42 0.339 0 0.000 6 0.008 0.346
0.0 8192.0 0.0 8192.0 190464.0 12288.0 116736.0 66168.6 76124.0 73926.6 9548.0 8751.2 42 0.339 0 0.000 6 0.008 0.346
0.0 8192.0 0.0 8192.0 190464.0 12288.0 116736.0 66168.6 76124.0 73926.6 9548.0 8751.2 42 0.339 0 0.000 6 0.008 0.346
0.0 8192.0 0.0 8192.0 190464.0 13312.0 116736.0 66168.6 76124.0 73926.6 9548.0 8751.2 42 0.339 0 0.000 6 0.008 0.346
0.0 8192.0 0.0 8192.0 190464.0 13312.0 116736.0 66168.6 76124.0 73926.6 9548.0 8751.2 42 0.339 0 0.000 6 0.008 0.346
0.0 8192.0 0.0 8192.0 190464.0 13312.0 116736.0 66168.6 76124.0 73926.6 9548.0 8751.2 42 0.339 0 0.000 6 0.008 0.346
EC: Displays current size of Eden area (KB)
SOC: Displays current size of S0 area (KB)
S1C: Displays current size of S1 area (KB)
OC: Displays current size of old area (KB)
EU: Displays current usage of Eden area (KB)
SOU: Displays current usage of S0 area (KB)
S1U: Displays current usage of S1 area (KB)
OU: Displays current usage of old area (KB)
YGC: The number of GC events that have occurred in the young area
FGC: The number of full GC events that have occurred for the total application
GCT: The total accumulated time for GC operations
Using jconsole
jconsole is a build in GUI tool you can use with the JVM. It provides basic visualizations for CPU usage, thread count, memory etc. To launch jconsole, simply run:
jconsole
This will launch a UI with a list of running Java processes you can connect to. After selecting your process, you should see something like this:
There is even a memory tab where you can manually perform GC to see the performance implications:
Garbage Collection Interview Questions in Java
Garbage collection is a complicated subject to cover in a job interview. While it probably won't be expected that you memorize every implementation etc, higher level questions like the following are things you should be able to answer...
Q: Explain garbage collection mechanism in Java
A: A fixed amount of memory is allocated to the JVM's heap space. This is a global memory space used to store objects, classes, methods over the duration of the program execution. Heap space is divided into different regions. Minor GC is performed on the younger generation where most objects live for a short amount of time. This GC process is triggered when the Eden space is filled or when there isn't enough free space to allocate a given object. Major GC occurs when objects are moved to the older generation. GC implementations do vary. While some GC cycles run concurrently, others run in separate threads causing main application execution to pause while GC runs.
Q:Can we call garbage collector forecfully in Java?
A:Yes, however, this is not recommened. The reason being is that GC is non-deterministic. Calling System.gc(); is one way to "request" GC in Java.
Conclusion
Garbage collection automates memory management in Java. Memory management isn't unique to Java. Every programming language must manage the allocation/deallocation of memory.
The Java Garbage collector utilizes multiple GC algorithms to efficiently manage memory in Java. By dividing the heap into different regions based on object lifespan, GC is able to apply the most efficient algorithms to memory management based on the time an object has existed in the heap.
Not all GC implementations are created equal. Different implementations exist across different JVMs. You can specify which GC to use as a JVM argument.
Monitoring GC in Java can be done using built in tools like jstat and jconsole.
Your thoughts?
wow. amazing summary. this really helps ..thx
super helpful!
amazing read. thanks stackchief!
so the whole concept is that GC works with different "generations" to apply the best algo for that generation.
Since most objects are short lived, copy based approach can be adopted in young generation and "mark-compact" approach can be taken in old generation.
Most objects in old generation got there for a reason. So they will still be there after each collection cycle. So old generation can use "mark-sweep" since it doesn't have to "mark" that many objects at the end of the day. Similarly, in younger generation since most objects die..the few ones that are left over can be copied without jeoprodizing performance at all