Definitions & Algorithms

Back to Contents
  • Definitions
  • Algorithms

  • Definitions

    I use the follow definitions to simplify the language in HeapRoots and the documentation. I have redefined terms where appropriate or made them specific to this context. (e.g. see 'Object' and 'Type')

    This line of heapdump below defines an object with name 'java/lang/String' at address 0x1018f030 with size 32 bytes. It references one object at address 0x1018f050.

    0x1018f030 [32] java/lang/String
         0x1018f050 
    Heap
    See Intro - I use 'heap' to mean a 'JVM garbage collected heap'

    Graph
    A graph is a term commonly used in computer science to descibe a series of points ('nodes') connected by lines ('edges'). A road network is a form of graph. In this case, the nodes are the Objects and the edges are the references between Objects.

    Object
    I define an 'Object' to be any item in the heap. This could actually be a Java object, Java class (e.g. class java/lang/String), array, compiled method, symbol etc. depending on the virtual machine. I call every object either a 'root' or a 'non-root'. Each object has a name which is the string which represents it in the heapdump and I (very loosely) say the objects of the same name are of the same type. e.g. two primitive arrays are of the same 'type'.

    Reference
    If an object A has a field (say defined as 'Object obj; or illustrated above' in the definition of A's class and the field is set to B (i.e. 'obj = B;' ) then A has a reference to B.

    Child
    If an object A has a reference to object B then B is a child of A.

    Parent
    If an object A has a reference to object B then A is a parent of B. Note that with this definition objects can have many parents. e.g. B above also has K as a parent. In HeapRoots, every object has one (if any) chosen 'parent-owner'. This choice is made (fairly arbitrarily) for calculating statistics about each object during processing. See roots, root-owner below.

    Cycle
    A cycle is a chain of references starting and ending at the same object (e.g. K -> L -> K above). A reference from an object to itself (e.g. D) above could be considered as a trivial cycle. I don't call this a cycle since its of no practical use to HeapRoots.

    Reachability
    I define the reachability of an object as either the set of all objects reachable from the object, or the size of this set depending on the context. e.g. object I has a reachability of {J} but J can't reach anything so its reachability is zero. Purely for convention, I don't include an object in its own reachability set (even if it has a reference to itself or is in a cycle).

    Root
    Generally roots are object's which are interesting to explore from.

    In JVM terms, this could mean a reference on a Java thread stack - e.g. during garbage collection mark. In heap dumps, the obvious choices are objects which have no parents (source nodes in the graph). i.e. no other object references it, object A above for example. For graphs without cycles, every object is reachable (indirectly at least) from one or more roots.

    However heap's contain a vast number of cycles, leading to cyclic structures without any true roots. Since these structures aren't reachable from any root, it helps to identify some objects as 'root's even though they do have parents. For example, object K can reach many objects (all of those that A can, plus L) and so I call this a 'root' also. So I make these definitions:-

    Root - an object which is a good starting point for graph traversal. In HeapRoots, this will be either one of the following:-

    Pure Root - an object which has no parents. (e.g. object A)

    Artificial Root - an object chosen as a good starting point for graph traversal. This is usually (e.g. object K) because it is part of a cyclic structure which isn't reachable from a 'Pure Root'. Artificially chosen roots are needed to provide a formatted textual representation of parts of the graph not reachable from 'Pure Roots' in the 'Dump from Roots' feature.

    Strong Reference,Soft Link
    In addition to normal references (called 'Strong' references), Java has Soft, Weak, and Phantom references. These are all weaker types of reference than the normal type and have different behaviour with respect to Garbage Collection. e.g. a an object which is 'softly reachable' (i.e. not reachable by following only normal references but reachable when allowed to follow Soft references - see Java API documentation on java.lang.ref package) can still be garbage collected. These references can be used to implement memory friendly caches.

    I don't distinguish between these 3 special Reference classes and call them all 'Soft' references/links. So from HeapRoot's point of view, references are either 'Strong' or 'Soft'.

    Keeping Alive
    An object's keeping alive set is the subset of its reachability set which can not be reached from any object outside of the reachability set. In English, an object A is keeping alive an object B if it can only be reached by a path through object A. e.g. above, B is keeping alive most of the graph (C,D,..,J) but A is not keeping alive anything since K can reach everything that A can.

    Size
    An object's size (measured in bytes) is the amount of memory required in the heap to actually store the data that forms its instance. Java Objects also have corresponding class 'Objects' in the heap storing class data such as static field data.

    Root-Owner
    HeapRoots (see algorithms) reduces the complexity of graph analysis by assigning objects a unique 'root-owner/parent' which is then the 'owner' of that object. This 'root-owner' will probably not be a direct parent, but will be able to reach the object indirectly. It may be that another object is more significant in keeping it from being garbage collected (e.g. a straight-forward example being when the root 'soft' references the object but another holds a 'strong' reference). In the graph above, L's owner is K, but every other object's owner is root A (not readable from graph).

    Descendants & Total/Subtree Size & Max Depth
    Each root object owns (see algorithms) some subset of its reachability set. These are the objects whose 'root-owner' is the root in question. A root doesn't necessarily own all of its reachability set because another root may own some of it e.g. K above only owns L, but can reach every object that A can.

    I define the number of descendants for a root to be the number of objects uniquely owned by it. The total or subtree size is then the total size of all of these objects plus the root itself. e.g. For a non-root object, the number of descendants and subtree size are based on the objects 'below/deeper/further down' from them in the tree.

    The Maximum Depth of a root is the depth of the deepest branch of the tree ( as formed while processing the graph) which it owns. For example above, Object A has max-depth of 6 - the branch thru' the 6 objects: B->C->D->E->F->G

    Top


    Algorithms

    A 'Depth First Search' (referred to as DFS from now on) is a particular type of recursive graph traversal.

    A 'root' starting node is picked and then it's children are explored (in an arbitrary order). These children are then processed in a similar manner until every reachable object has been processed. It is called 'depth-first' because the first child of a node is fully explored before moving onto the next sibling (so going deep first) rather than exploring equally amongst the children in a breadth-first way. Care has to be taken not to get stuck in livelock processing cyclic structures, but this achieved quite easily by marking objects which have already been processed.

    Since a DFS from a node will easily calculate the node's reachability, a DFS is a good way to see how important an object is. If we did a DFS from every object then it would be quite easily to see which objects have the greatest reachability and hence total size.

    Unfortunately doing a DFS from every object would take a very very long time. Considering that heaps have huge numbers of objects (implying we have to be space efficient), doing this calculation would take a very long (quadratic in N) time.

    There is a more time efficient algorithm which I haven't implemented. This involves creating (using logN matrix multiplications) a N*N reachability matrix where each cell (i,j) is 1 if i can reach j or 0 otherwise. You can then sum each row i to find object i's total reachability. The problem is storing a N*N matrix when N is millions!

    Not to worry because in actual fact, heaps are quite well connected generally and most of the 'pure' roots can actually reach most of the entire graph! I could do more with algorithms in future given a bit more spare time.

    The current HeapRoots algorithm is to do a DFS from the 'pure' roots (in arbitrarily address order) marking which objects are reached (and by which root - setting the 'root-owner') in the process. This then leaves objects not reachable from 'pure' roots (in cycles) and some of these are chosen as 'artificial' roots which are then traversed from until every single object has been reached.

    Here, the objects are labelled alphabetically in the order in which they are reached in the Depth First traversal, the numbers outside the bubbles are the numbers of descendants for each object.

    Object A is the only 'pure' root and is processed first. Its only child B is then processed, followed by its first child C and so on. When an object's children are finished with, its total size and descendants are calculated based on its children's sizes and descendants.

    Take object E as an example. Its first child F has one child G (ignoring C which has already been processed by object B - to avoid cycles), so F's total size is the size of F+G and it has 1 descendant (namely G). E's second child H has no children and so its total size is just the of G and has 0 descendants. Once F and H are processed, the total size of E is now size(E)+totalsize(F)+totalsize(H) = E + F+G + H. The number of descendants of E is number of immediate descendants + descendants(F) + descendants(G) = 2 + 1 + 0 = 3 (as shown next to E). This process now unwinds all the way back up to A.

    Next, since objects K and L are unreached from the single 'pure' root, K is marked arbitrarily as an 'artificial' root. Its child L is processed, but finishes immediately since child B has already been owned by root A and child K is already being processed.

    This completes the processing of the graph. Note that the descendant numbers are skewed (almost arbitrarily) and do not necessarily reflect the true reachability numbers (here K and L in fact can both reach 10 other objects, and A can only reach 9). See Dumping from Roots for the formatted textual output representing this graph.

    Top