Compressing callstacks: a bitpacked DAG powered by a keyless hashmap

- By Jelle van der Beek

One of the challenges we face in Superluminal is dealing with large data. When we capture performance data, the data is dominated by callstack information. On Windows, we sample at 8Khz, and capture a callstack for every active CPU. For a 16-core machine with an average stack depth of 40 frames, the raw data rate becomes:

16 (cores) * 40 (frames) * 8 (bytes per frame) * 8192 (Hz) = 40MiB/s

As a real example, we use a 2.64 GiB ETL file that was captured on Windows while running the Unity engine. It contains 21,096,952 callstacks totalling 4.2 GiB of raw data. The callstack data is larger than the total file size because the ETL stream is compressed.

Even in a moderately sized capture such as this, there is more callstack data than we are willing to fit in RAM. Multiple systems reference these stacks, so we need an efficient way to store those callstacks. As one of the cornerstones of our profiler is to load and process capture files of any size with a predictable upper memory bound, we need to find a representation that:

Greatly reduces memory size
Supports partial streaming, allowing us to load and evict pages on demand

An analysis

During capture, most events retrieve a callstack and append it directly to the event stream. There are a lot of duplicate stacks in that event stream. For example:

void Foo()
{
	for (…)
	{
		// do work here
	}
}

int main()
{
	Foo();
	
	// ...
}

Sampling this at 8Khz may produce a series of stacks such as:

[main  ] [main  ] [main  ] [main  ] [main  ] 
[Foo   ] [Foo   ] [Foo   ] [Foo   ] [Foo   ] 
[RIP X ] [RIP Y ] [RIP Z ] [RIP X ] [RIP Z ]

Here RIP represents instruction pointers (samples) in the scope of Foo. The parent functions (Foo, main) will repeat, and RIPs will be scattered across Foo’s address space. Over time, most samples fall on hotspots, so stacks repeat heavily. In this example, our five stacks could be deduplicated to three stacks. In our real capture:

	Count	Size
Raw stacks	21,096,952	4,468,602,976 (4.2 GiB)
Deduplicated stacks	570,866	171,811,480 (163.9 MiB)

Deduplication alone removes 96% of the data. However, even for deduplicated stacks, there is still a lot of redundancy. Callstacks share many common frames like main and Foo in our example. If we’re storing frames in a tree, any redundancy is fully eliminated:

Main
	Foo
		RIP X
		RIP Y
		RIP Z

Let’s take a very simple Node structure:

struct Node
{
	std::vector<std::unique_ptr<Node>> mNodes;
	uint64_t mFrame;
};

The std::vector contains 3 pointers, and together with the frame, it accumulates to 32 bytes per node. The std::vector performs a heap allocation once the first child is added to the list. The minimum size for a heap allocation differs per allocator, but let’s assume a minimum of 16 bytes. As vectors grow, they tend to allocate more memory than needed to avoid reallocating for each new element. Ignoring that internal fragmentation, we can assume that each Node will take at least 48 bytes. That’s a lot of overhead, and doing millions of heap allocations is not great.

At millions of nodes, 48 bytes/node is too expensive, and the heap traffic alone becomes a bottleneck. So we need a representation that has very little per-node overhead and performs only few allocations.

Divide and conquer

In our profiler, we have a capturing phase and runtime phase:

The capturing phase includes both the capturing and importing of the capture
The runtime phase is where a file is fully imported and displayed in the UI

The capturing phase needs to gather and organize the data based on the event stream, but as soon as the runtime phase starts, the data is immutable. We used to share data structures between both phases, but we learned that building dedicated data structures for each phase limits the scope of these data structures, and that in turn opens up a world of opportunities for optimization. Our strategy here is the same: we are going to design separated data structures that fit the purpose of each phase.

The static Directed Acyclic Graph (DAG)

For the runtime stage, we don’t need to insert or remove stacks. But there is also no need for traversal. The only operation needed is: given an ID, return the callstack.

As there is no need for parent-child traversal, the key observation is that we can reconstruct a callstack by reversing the operation: we start at the leaf and follow parent pointers up until the root. Our Node structure then looks like this:

struct Node
{
	Node* mParent;
	uint64_t mFrame;
};

This reduces the node size from ~48 to 16 bytes.

Instead of performing heap allocations per node, we can allocate nodes from a pre-allocated page, and refer to parents by index instead:

struct Node
{
	uint64_t mFrame;
	uint64_t mParentIndex;
};

If we’re inserting stacks { A, B, C } and { A, B, D }, we can allocate a page of nodes and insert them:

Index	Frame	ParentIndex
0	A	-1
1	B	0
2	C	1
3	D	1

The ID of the first stack { A, B, C } is 2, which is the leaf frame of the stack, and for the stack { A, B, D }, the ID is 3. We can resolve the full stack by tracking parent indices.

Bitpacking parent indices

Storing 64-bit parent indices is wasteful. Even with a million stacks, the parentIndex could fit in 4 bytes. However, because we build the structure incrementally, the final size of the callstack database is unknown at insertion time, so we can’t pick a global bit-depth up front. What we can rely on is that every inserted frame always references a parent that was created earlier. That means the bit-depth required to encode any mParentIndex is always less than or equal to the bit-depth needed to encode the current element index.

In fact, we can determine the bit-depth for an entire page using this property: the worst-case bit-depth for a page is simply the number of bits needed to represent the largest index stored in that page.

For example, suppose we allocate pages of 64 elements. In the first four pages, all parent indices still fit in 8 bits. Below are the first and last elements of page 3. As before, each element can only reference a parent with an index less than or equal to its own:

Index	Frame	ParentIndex
192	H	108
…	…	…
253	I	150
254	J	253
255	K	254

The next page (4) could look like this:

Index	Frame	ParentIndex
256	X	255
257	Y	256
258	Z	257

Frame X can still reference an 8-bit index, but frame Y cannot, because it references an element within the same page. Page 4 covers the index range 256–319, which no longer fits in 8 bits. Therefore, we choose a 16-bit parentIndex for this page.

Now that we have multiple pages, we can find the page and the offset within the page by a simple division and modulo operation:

pageIndex = index / NumElementsInPage
offset = index % NumElementsInPage

The page structure

Our Node structure now needs to be replaced, because it cannot efficiently store parent indices with varying bit-depths. Instead, we split the data into two parallel arrays: one for the frames and one for the parent indices. This gives us a classic Structure-of-Arrays (SoA) layout:

struct Page
{
	uint64_t* 	mFrames;
	void* 		mParentIndices;
	uint16_t 	mNumElements;
	uint8_t 	mBitDepth;
};

We now allocate a single memory block to hold both the frames and the parent indices. The mFrames array points to the start of this block, while mParentIndices points to an offset within it. When traversing mParentIndices, we cast the void* to the appropriate type (for example, uint8_t* or uint16_t*) depending on the bit-depth. With this layout, the memory footprint per node is further reduced from 16 bytes to:

Page bitdepth	Bytes per node
8	9
16	10
32	12
64	16

For our reference capture file, we have a total byte size of:

Raw stacks	4,468,602,976 (4.2 GiB)
Deduplicated stacks	171,811,480 (163.9 MiB)
Compressed stacks	25,820,790 (24.6 MiB)

The compressed stacks occupy only 33 pages. Because we perform a single allocation per page, that translates to just 33 malloc calls, greatly reducing pressure on the allocator. Allocating by pages also gives us precise control over how many pages are loaded at once. Using a simple LRU algorithm, we can evict the least recently used pages whenever a memory threshold is exceeded.

Building the DAG

Having a compact DAG is great, but how do we construct it during the capturing phase? For this, we need a data structure that supports fast insertions.

Our input is a stack represented as an array of frames, and for each frame we need to determine whether a corresponding element already exists in the DAG:

For each frame in callstack
	If (!frame exists in data structure)
		createFrame();

Here we can’t just check if the frame is in the data structure, we need to check if the frame is present in the context of its parent. At first glance, it seems like we do need an actual tree-like structure here where we encode the parent-child relations.

However, we can express the same question in a different way: given a node with a frame A and a parent frame X, does it already exist in the static data structure?

This query can be more efficiently implemented with a map than a tree. The map uses a key {frame, parent frame} that maps to a value {element index}.

Because we only ever append to the map and never erase entries, the interface we need is simple: GetOrInsertFrame.

For each frame in callstack
	uint64_t index = map.GetOrInsertFrame(frame, parentFrame);

GetOrInsertFrame checks whether the key exists in the map. If it does, it returns the corresponding DAG index. If not, it creates a new element in the DAG and inserts the key/value pair into the map. The returned value isn’t used immediately here, but it will be in the next step.

Designing a custom map

While we could use an off the shelf map like std::unordered_map to store this data, that would still result in a lot of memory usage and heap allocations. We can do better by writing something highly tailored to our usecase.

To minimize memory usage and heap allocations, we use an open-addressing hash map. In this approach, all elements are stored in a single contiguous array, in contrast to separate-chaining hash maps, which allocate memory for each bucket. Since the number of nodes is very large, avoiding per-node allocations is crucial. Additionally, because we never erase entries, we don’t need the complexity of tombstones (a mechanism used to mark deleted items). This makes our custom map simpler.

To insert or look up an element, we hash the key, compute a position in the array from the hash, and then probe forward from that position. Here’s a simple example:

// A function that takes a parent & child frame, and returns an index in the DAG
uint64_t CompressedCallstackDatabase::GetOrInsertFrame(uint64_t inChildFrame, uint64_t inParentFrame)
{
	uint64_t h = GetHash(inChildFrame, inParentFrame);	// Calculate hash
	size_t index = h & (mHashMapCapacity - 1);		// Calculate index in the vector based on the hash

	while (true)
	{
		Entry& entry = mEntries[index];

		if (!IsUsed(entry))
		{
			entry.mParentFrame = inParentFrame;
			entry.mFrame = inChildFrame;

                        // Create a frame in the DAG
			entry.mValue = CreateFrame(inChildFrame, inParentFrame);	
			return entry.mValue;
		}

		if (inParentIndex == entry.inParentFrame && inChildFrame == entry.mFrame)
			return entry.mValue;

		if (++index == mHashMapCapacity)
			index = 0;
	}
}

If we’re exceeding a certain size threshold, we’re growing and rehashing the entire array. More on that later.

Here is the naïve version of storing the key and value in the map:

struct Entry
{
	uint64_t mFrame = -1;		// the key
	uint64_t mParentFrame = -1;	// the key
	uint64_t mElementIndex = -1;	// the value
	bool mIsUsed = false;		// whether this entry is occupied
};

That is 25 bytes per entry, and with alignment, it rounds up to 32 bytes.

Initially the vector contains default constructed versions of each Entry. When probing, we need to check if the entry is occupied or not. The most obvious first step is to eliminate the mIsUsed flag and just encode it as a sentinel value in one of the other members:

struct Entry
{
	uint64_t mFrame = -1;		// the key
	uint64_t mParentFrame = -1;	// the key
	uint64_t mElementIndex = -1;	// the value.  -1 means not used.
};

That brings the entry size back from 32 to 24 bytes.

When adding a stack to our data structure, we always start with a root frame that has no parent. We insert this root as a hardcoded element at index 0 in the DAG. On page zero, the layout looks like this:

Index	Frame	ParentIndex
0	root	-1
1	A	0
2	B	1
3	C	2

Once the root is established, we can change the map’s key from {frame, parent frame} to {frame, parentIndex}. Since the root index is 0, we start by querying {frame, 0}. The map lookup returns the element index, which also provides the parent index for the next frame. This simplifies our callstack insertion code to:

int parentIndex = 0;
for each frame
	parentIndex = map.Find(frame, parentIndex);

Our Entry structure now looks like this:

struct Entry
{
	uint64_t mFrame = -1;		// the key
	uint64_t mParentIndex = -1;	// the key
	uint64_t mElementIndex = -1;	// the value.  -1 means not used.
};

We didn’t save any memory in this step, but now that our key is {frame, parentIndex}, we can recognize that this information is already stored in the DAG:

The DAG:

Index	Frame	ParentIndex
0	root	-1
1	A	0
2	B	1
3	C	2

What the map layout could look like (including possible empty entries):

Index	Frame	ParentIndex
-1	–	–
3	C	2
1	A	0
-1	–	–
2	B	1

Here is the core insight: once a node is inserted into the DAG, its key (frame, parentIndex) can always be recovered from the DAG itself. So the map never needs to store keys at all. For example, if we have just the value [2] in the map for frame B, we can lookup mFrame and mParentIndex in the DAG. This means we can erase the entire key from the map and just replace it altogether with a simple vector of indices:

std::vector<uint64_t> mEntries;

That brings the entry size back from 32 bytes to 8 bytes. The entire hash map function now looks like this:

uint64_t CompressedCallstackDatabase::GetOrInsertFrame(uint64_t inChildFrame, uint64_t inParentIndex)
{
	if (mHashMapSize * 2 >= mHashMapCapacity)
		Rehash(mHashMapCapacity * 2);

	uint64_t h = sHashFrame(inChildFrame, inParentIndex);
	size_t index = h & (mHashMapCapacity - 1);

	uint64_t* __restrict base = mHashMapEntries.data();
	uint64_t* __restrict cur = base + index;
	uint64_t* __restrict end = base + mHashMapCapacity;

	while (true)
	{
		uint64_t elementIndex = *cur;

		if (elementIndex == -1)
		{
			++mHashMapSize;

			*cur = CreateFrame(inChildFrame, inParentIndex);
			return *cur;
		}

		uint64_t frame;
		uint64_t parentIndex;
		GetNode(elementIndex, frame, parentIndex);

		if (inParentIndex == parentIndex && inChildFrame == frame)
			return elementIndex;

		if (++cur == end)
			cur = base;
	}
}

We managed to create a map that operates on keys, but it doesn’t store keys at all, it just stores values. We’re utilizing the highly compressed DAG in our hash map to accomplish this.

Rehashing without duplicating

As our map grows, it occasionally needs to be rehashed. Normally, rehashing involves:

Storing the old contents in a temporary location
Clearing and resizing the map
Reinserting the old contents

This process temporarily keeps two copies of the map in memory.

In our case, we can avoid this entirely by using the DAG as the source of truth for rehashing. The DAG already contains all three data members required for efficient reinsertion:

Element index
Parent index
Frame

By looping over the nodes in pages (a very cache friendly operation as they’re all allocated contiguously in memory), we can perform a custom insert where all three data members are already known:

void CompressedCallstackDatabase::InsertFrame(uint64_t inChildFrame, uint64_t inParentIndex, uint64_t inElementIndex)
{
	uint64_t h = sHashFrame(inChildFrame, inParentIndex);
	size_t index = h & (mHashMapCapacity - 1);

	uint64_t* __restrict base = mHashMapEntries.data();
	uint64_t* __restrict cur = base + index;
	uint64_t* __restrict end = base + mHashMapCapacity;

	while (true)
	{
		uint64_t elementIndex = *cur;

		if (elementIndex == 0)
		{
			++mHashMapSize;

			*cur = inElementIndex;
			return;
		}

		if (++cur == end)
			cur = base;
	}
}

One of the great advantages of this map is that it is entirely separate from the DAG. For our reference capture, it occupies only 32.0 MiB and is allocated solely during the capturing phase. By storing everything in a single heap-allocated buffer, we put minimal pressure on the allocator.

Reading the static DAG from disk is also highly efficient: it requires only a few large disk reads, with no C++ data structures, complex heap allocations, or pointer fixups. This approach significantly speeds up the loading of session files.

A final performance optimization

Inserting elements into the map is not free. For our hash function, we have average probe lengths of 2.2, which is really good. However, as the initial memory access to the vector is a random memory access in a large array, we have cache misses there. Once that value is loaded, we have great locality of reference as we immediately load 8 values into a cache line. But that very first access remains expensive and dominates the cost of the GetOrInsert function.

If we look at stacks on a timeline, there is a lot of coherency between stacks for a single thread. Code does not jump around erratically – it behaves in predictable manners: some frames will drop, some frames will be added.

An often utilized optimization in our profiler is to take advantage of this coherency. Here, we have an additional cache layer in the form of a very simple vector that just stores { frame, index }. We can loop over this array and skip many map lookups:

CallstackKey CompressedCallstackDatabase::AddStack(int inThreadID, const Callstack& inCallstack)
{
	ThreadState& threadState = mThreadStateMap[inThreadID];

	int frameIndex = 0;
	uint64_t parentIndex = 0;
	
	int maxFrameIndex = std::min((int)inCallstack.size(), (int)threadState.mStack.size());
	for (; frameIndex != maxFrameIndex; ++frameIndex)
	{
		if (inCallstack[frameIndex] != threadState.mStack[frameIndex].mFrame)
			break;

		parentIndex = threadState.mStack[frameIndex].mElementIndex;
	}

	threadState.mStack.resize(inCallstack.size());

	for (; frameIndex != inCallstack.size(); ++frameIndex)
	{
		uint64_t frame = inCallstack[frameIndex];
		uint64_t elementIndex = GetOrInsertFrame(frame, parentIndex);

		threadState.mStack[frameIndex].mFrame = frame;
		threadState.mStack[frameIndex].mElementIndex = elementIndex;

		parentIndex = elementIndex;
	}

	return parentIndex;
}

This simple optimization is very effective as our statistics show that it saves between 60-90% of the map lookups in our captures. This can probably be pushed even further by having a caching layer that keeps data over more time and evicts entries from the cache based on age. That is something to look into for the future.

Final thoughts

Looking back on the process of designing these data structures, two things stand out.

First, dividing a problem into smaller subdomains helps constrain the problem space. The more constraints you can apply, the easier it becomes to optimize.

Second, as programmers we all have invisible barriers — things we tend to accept as-is. Existing code, third-party libraries, and sometimes even the kernel. In this case the barrier was a hash map: by stepping inside such a tried-and-tested data structure, new optimization opportunities appear right in front of you. And with enough constraints, the solution doesn’t even have to be complicated.

In this case, by looking at the problem in this way we were able to go from 4.2 GiB of raw data to 24.6 MiB data for the static datastructure, which is a 99.4% procent reduction. The hashmap for building this data is separated and only takes 32.0 MiB – which is all temporary memory. All these data structures perform almost no heap allocations and can be streamed in and out directly into simple buffers.