Raw Source Code

Friendly MB2 source code reference

Pointers to the MallocBinned2 source code to be used in tandem with the illustrated overview

Introduction:

This document highly relies on you having access to the Unreal source code, please request access if you still don’t have it, here we map relevant places for the MallocBinned2 allocator a.k.a. MB2, in a not very introductory way, just links, descriptions, raw info and human readable pointers to C++ pointers (joke), for an illustrated overview of the inner working please check my other post “An illustrated overview of the Unreal Engine MallocBinned2 allocator” and use both documents in tandem.

Hardware Abstraction Layer:

GenericPlatformMemory

File: Engine/Source/Runtime/Core/Public/GenericPlatform/GenericPlatformMemory.h

Any supported platform (Windows, Linux, iOS, etc..) implements a static non polymorphic interface defined by struct FGenericPlatformMemory, following a bridge pattern.

MallocBinned2 will request/free memory from/to the operating system using specific platform-specific implementations of the static functions:

Different supported allocators are enumerated at FGenericPlatformMemory::EMemoryAllocatorToUse, each platform will select the active runtime allocator by implementing the function FGenericPlatformMemory::BaseAllocator.

WindowsPlatformMemory

File: Engine/Source/Runtime/Core/Private/Windows/WindowsPlatformMemory.cpp

Windows-specific implementation of the static interface defined by GenericPlatformMemory, FWindowsPlatformMemory::BaseAllocator selects the active allocator, this function is not thread safe, however thread safety is handled later by FMemory::GCreateMalloc which relies on C++ local static thread safety, by initializing a local static variable with the value returned by the wrapper function FMemory_GCreateMalloc_ThreadUnsafe.

Win32 VirtualAlloc and VirtualFree functions are used at FWindowsPlatformMemory::BinnedAllocFromOS and FWindowsPlatformMemory::BinnedFreeToOS respectively to request/free memory from the operating system, these Win32 functions are the lowest-level and fastest call available in user-mode land for memory allocations/deallocations, for a wonderful overview please check ahmd0 answer in this Stack Overflow question.

The important thing to note here is that the start address of allocations done using VirtualAlloc are aligned to 64KiB (Exposed at SYSTEM_INFO::dwAllocationGranularity) and memory is committed in blocks that are multiple of the virtual page (Usually 4KiB which is exposed at SYSTEM_INFO::dwPageSize), so if you request 8 bytes using VirtualAlloc, the start address of your allocation will be a multiple of 0x10000 and the committed memory will be the first block of 4KiB beginning from the start address, in this case you’d be generating an internal fragmentation in the memory manager of the Windows kernel of 15 virtual pages (4KiB * 15 = 61440 bytes), the reason behind this alignment seems to be RISC processors supported by Windows NT as per the legendary Raymond Chen.

Unreal is aware of previously described lore at FWindowsPlatformMemory::GetConstants which defines the MB2PageSize to be SYSTEM_INFO::dwAllocationGranularity, MB2 expects that FWindowsPlatformMemory::BinnedAllocFromOS will always return pointers aligned to a 64KiB boundary, as we’ll see next this implies extra juggling on other platforms like Linux, iOS and others that doesn’t have the 64KiB alignment imposition in their operating system memory functions.

UnixPlatformMemory

File: Engine/Source/Runtime/Core/Private/Unix/UnixPlatformMemory.cpp

Unix-specific implementation of the static interface defined by GenericPlatformMemory, the runtime allocator is selected in the non-thread-safe function FUnixPlatformMemory::BaseAllocator.

The MB2PageSize is defined to be

MemoryConstants.BinnedPageSize = FMath::Max((SIZE_T)65536, MemoryConstants.PageSize);

Unix functions mmap and munmap are used respectively at FUnixPlatformMemory::BinnedAllocFromOS and FUnixPlatformMemory::BinnedFreeToOS, these functions need to do a bunch of juggling to guarantee the 64KiB alignment expected by MB2, for any given allocation size, Unreal requests an extra MB2Page just to guarantee that a 64KiB boundary is contained in the allocation then un-maps memory that precedes the aligned address and memory that succeeds the spanning block from the aligned address to the requested size, this traduces in extra kernel calls (potentially two extra unmaps during an allocation), fragmentation on the linux kernel and runtime overhead (however unmapping during an allocation can be mitigated with UE4_PLATFORM_REDUCE_NUMBER_OF_MAPS, but it has never been enabled by default).

ApplePlatformMemory

File: Engine/Source/Runtime/Core/Private/Apple/ApplePlatformMemory.cpp

Lore for the apple implementation is the same described for Unix, with a minor distinction, at FApplePlatformMemory::BinnedAllocFromOS and FApplePlatformMemory::BinnedFreeToOS when the macro constant USE_MALLOC_BINNED2 is undefined all the extra juggling described previously is ripped out and Unreal simply does a usual and simple mmap which effectively removes all the extra unmaps described previously for Unix, this seems irrelevant though because BinnedAllocFromOS and BinnedFreeToOS are shared only by MB2 and MallocBinned (the binned allocator created before MB2), and MB seems to be deprecated (more research required for confirmation).

This implementation is just common code for iOS and Mac platforms.

iOSPlatformMemory

File: Engine/Source/Runtime/Core/Public/IOS/IOSPlatformMemory.h

iOS extends the apple implementation class

MacPlatformMemory

File: Engine/Source/Runtime/Core/Public/Mac/MacPlatformMemory.h

Mac extends the apple implementation class

AndroidPlatformMemory

File: Engine/Source/Runtime/Core/Private/Android/AndroidPlatformMemory.cpp

Overall, very similar to the Unix implementation, FAndroidPlatformMemory::BaseAllocator shows that only MB, MB2, MB3 and the ANSI allocators are supported and they cannot be selected using cmd arguments, apart from that everything is very similar to Unix and Apple.

Active PlatformMemory

File: Engine/Source/Runtime/Core/Public/HAL/PlatformMemory.h

The active platform code is exposed for the engine at PlatformMemory.h, each platform implementation should expose the typedef for FPlatformMemory.

Allocator is selected statically at the build scripts for each respective platform target.

  • UEBuildWindows.cs uses MallocBinned2 by default
  • UEBuildLinux.cs uses MallocBinned2 by default, has a comment “default to FMallocBinned2 as it uses less Virtual Memory Areas”
  • UEBuildIOS.cs uses MallocBinned2 by default, has a comment “Prefer MB2 as MB3 requires extended virtual address space entitlement”
  • UEBuildMac.cs uses MallocBinned3 by default
  • UEBuildAndroid.cs uses MallocBinned3 by default

Memory functions

File: Engine/Source/Runtime/Core/Public/HAL/UnrealMemory.h

Application code should access memory functions through the static FMemory interface, this class wraps functions from the FPlatformMemory interface, instances the active allocator (thread safe) and also provides wrap functions to use the default allocator from the active C runtime (which is intended only for some low level subsystems).

Unreal overrides the global C++ allocator using the macro REPLACEMENT_OPERATOR_NEW_AND_DELETE which is referenced by the macro PER_MODULE_BOILERPLATE and the file HAL/PerModuleInline.inl, for non monolithic builds, the inline header is added into the final source by UnrealBuildTool as an intermediate gen.cpp file per each module, see Engine/Source/Programs/UnrealBuildTool/Configuration/UEBuildModuleCPP.cs, for an example open the “Intermediate” folder in your project and look for “PerModuleInline.gen.cpp”.

BaseAllocator

File: Engine/Source/Runtime/Core/Public/HAL/MemoryBase.h

FMalloc is the base class for a polymorphic interface from which all the supported allocators should extend, this class extends FUseSystemMallocForNew which allows the allocator to be instanced using the C runtime allocator, for this study we will focus only in the MallocBinned2 implementation.

MallocBinned2 source code overview

MallocBinnedCommon.h

File: Engine/Source/Runtime/Core/Public/HAL/MallocBinnedCommon.h

Contains code that is shared between MallocBinned2, MallocBinned3 and MallocBinnedGPU, we’ll outline only what is relevant for MB2 but ignoring a few exceptions, this file mostly contains data structures relevant for the cache subsystems and the hashing mechanism used by MB2 to make O(1) allocations possible.

Configurable knobs

Macro constants to customize the allocator behavior, we’ll discuss this later.

FMallocBinnedCommonBase

Class that extends the FMalloc interface but doesn’t implement it, declares a bunch of nested types relevant for the cache subsystem.

TMallocBinnedCommon

Templated class that extends FMallocBinnedCommonBase and from which all binned allocators extend, it still doesn’t fully implement the FMalloc interface.

The class provides a hash table where the keys are pointers to bins and the values are arrays of FPoolInfo objects. Each FPoolInfo is responsible for tracking the availability status of a given block of memory, for small allocations it tracks availability of bins in a MB2Page (64KiB block), for large allocations it holds the amount of memory in a single large allocation.

FPtrToPoolMapping

Holds all info necessary to query the FPoolInfo object that “owns” a given pointer. The pointer can represent a bin in a small allocation or a large allocation block.

FPoolHashBucket

Represents a bucket node in the hash table of pointers to FPoolInfo objects, the number of buckets depends on the amount of physically installed memory, when a collision happens a new node is linked into the collided bucket.

This structure makes it possible for allocations to have a time complexity of O(1) on average, we’ll talk more about this later.

FBundleNode

Represents bin memory that is owned by the cache subsystem (FFreeBlockList or GlobalRecycler), this object is a bin in a MB2Page but reinterpreted.

FBundle

A linked list of FBundleNode objects, the cache subsystem holds multiple bins in packs known as bundles.

FFreeBlockList

Main interface of the cache subsystem, for each supported bin size and each thread that enables local caches, there is a FFreeBlockList object that holds two bundles, a PartialBundle and a FullBundle, each bundle can link up to 64 nodes or point to no more than 65536 bytes of bin memory, if one of the condition is not meet, the bundle is said to be full, when both bundles are full, the FullBundle will be evicted and passed to a thread shared cache known as the GlobalRecycler, similarly when both bundles are empty, MB2 will try to get a previously recycled bundle from the GlobalRecycler.

FPerThreadFreeBlockLists

Contains a table of FFreeBlockList vs BinSize, this table is a singleton per each thread that opted in to activate the cache mechanism (thread local), the caching for a given thread is activated by calling SetTLS on the thread.

The cache subsystem is formed by this class and the related bundle and BundleNode structures mentioned previously.

MallocBinned2.h

File: Engine/Source/Runtime/Core/Public/HAL/MallocBinned2.h

Contains the class that extends TMallocBinnedCommon and provides implementations for the FMalloc interface, here is where the magic starts.

FPoolInfo

An asymmetric linked list that tracks either a linked list of non-cached bins that belong to the same MB2 page and that can be used to fulfill a future allocation (FirstFreeBlock) or the size of a large allocation.

FFreeBlock

A linear allocator of a block of free bins, during the runtime you can find these in two flavors:

  • When a new MB2Page is requested from the operating system the first 16 bytes of the page contain a special FFreeBlock known as the pool-header (which is never returned to the user), when a user request arrives this FFreeBlock is responsible of allocating for the first time bins in a given page. Hitting a bunch of allocations from the same pool-header guarantees spatial locality for similar sized allocations (however the caches and single-bin FFreeBlocks will be used first).
  • When a pointer to a bin is freed by the user and evicted from all the caches, the bin is reinterpreted as a FFreeBlock of a single bin, this block gets linked into the FPoolInfo owning the bin address.

FPoolList

Holds a linked list of FPoolInfo objects that are bound to small allocations of the same size (the structure is just a linked list container, in reality the pools can be any pool, the bit about small allocations is inferred later during runtime and usage).

FPoolTable

An entry for a table that tracks all the Exhausted and Active FPoolInfo objects that are bound to a given bin size, a pool is considered exhausted or active depending on if it has FreeBlocks.

MallocBinnedCommonUtils.h

File: Engine/Source/Runtime/Core/Public/HAL/MallocBinnedCommonUtils.h

Contains code for the GlobalRecycler (thread shared cache), and most of the logic related to trimming memory (which translates to flushing thread local caches).

TGlobalRecycler

A non locking and cache friendly cache of bundles, it has one entry per each supported bin size and each entry has 8 slots for storing bundles (the number of slots can be changed but 8 is the sweet spot for 64 bit systems with a cache line of 64 bytes).

This is the thread shared part of the cache subsystem.

FMallocBinnedCommonUtils

Handles the mechanisms to free as much from the thread caches as possible, this is an asynchronous operation that will complete on a given thread when the thread is going to sleep. The operation consists on linking the Partial and Full bundle of all the FFreeBlockList objects, and popping them out.

CachedOSPageAllocator.h

File: Engine/Source/Runtime/Core/Public/HAL/Allocators/CachedOSPageAllocator.h

A cache between the operating system and MB2 to reduce kernel calls, when a cache miss happens the request will be forwarded to the operating system.

Pointers will be inserted into the cache when MB2 tries to free a previously allocated OS memory block.

Hashing mechanism explained

Memory managed by the allocator is globally tracked by a hash table, the number of buckets in the table is calculated based on the amount of physical memory installed in the system and other factors that are platform dependent (licensed platforms like Switch, PlayStation. Xbox might be subtly different, however what we list here is true for openly available platforms like Windows, Android, Unix, iOS as per Unreal 5.6).

Installed RamNum of Buckets
1GiB8
4GiB32
8GiB64
16GB128
32GB256
64GB512
128GB1024

The number of buckets is calculated from subtracting the constants AddressStart (usually zero) from AddressLimit at FPtrToPoolMapping::Init.

Each bucket in the hash table lazily contains an array of 2048 FPoolInfo objects that are backed in a continuous block of 64KiB (each pool object has a size of 32 bytes), a given bin is hashed into the hash table by masking different parts of the bin address as a sequence of bits CHPX such that:

NumOfBits(CHPX)==64 NumOfBits(CHPX) == 64

or

NumOfBits(CHPX)==32NumOfBits(CHPX) == 32

Depending on the platform word size

X:

Sequence of 16 bits, encodes the address of a byte in a given MB2Page of 65536 bytes (16 bits can encode 65536 values).

P:

Sequence of 11 bits, represents an index in an array of 2048 pool objects (11 bits can encode 2048 values), this array is contained in a hash bucket entry, the number of pools is calculated as:

MB2Pagesizeof(FPoolInfo)=6553632=2048\frac{MB2Page}{sizeof(FPoolInfo)} = \frac{65536}{32} = 2048

H:

Variable sequence of bits, calculated depending on the available physical memory, represents an index in the array of buckets of the hash table, the number of buckets is calculated by first measuring how many MB2Pages can theoretically fit into physical memory (rounded up to the next power of two for easy binary arithmetic):

Q=RoundToNextPowOf2(TotalPhysicalMemoryInBytes)65536Q = \frac{RoundToNextPowOf2(TotalPhysicalMemoryInBytes)}{65536}

Then we calculate how many hash buckets we would need in order to map the total number of MB2Pages Q:

Q2048\frac{Q}{2048}

Divided by 2048 because a hash bucket contains 2048 pools (lazily initialized) and each pool is assumed to track a block 65536 bytes which is a MB2Page.

C:

Variable sequence of bits that identifies a collision in a hash table bucket, for example, 0HPB and 1HPB are two managed addresses that collide into the same bucket of the hash table, which would only happen if the virtual memory address assigned by the operating system is not in the same block [AddressStart, AddressLimit], the number of bits of C is calculated as

NumOfBits(CHPX)NumOfBits(HPX)NumOfBits(CHPX) - NumOfBits(HPX)

Customizable knobs and other constants

Relevant constants to customize the behavior of MallocBinned2.

UE_MB2_MAX_CACHED_OS_FREES_BYTE_LIMIT

The maximum amount of memory that can be cached by the paged allocator (64MiB), this cache receives elements when MB2 tries to free a pointer to a big block allocated from the OS.

If the memory being allocated/freed is bigger than a quarter of UE_MB2_MAX_CACHED_OS_FREES_BYTE_LIMIT (16MiB), the request will be forwarded directly to the operating system.

UE_MB2_MAX_CACHED_OS_FREES

The maximum number of entries (pointers to big blocks allocated from the OS) that the paged allocator will have, for performance critical threads, this value will be doubled, a performance critical thread is considered to be a thread that enabled the thread caches.

SmallBinSizesInternal

Static array that defines the supported bin sizes, they must be ordered from small to big and be a multiple of 16.

UE_MB2_MAX_SMALL_POOL_SIZE

The biggest supported bin size. Should be updated if you change the biggest bin in the SmallBinSizesInternal array.

UE_MB2_SMALL_POOL_COUNT

Number of bins defined in the SmallBinSizesInternal array.

UE_MBC_MIN_SMALL_POOL_ALIGNMENT

The minimum alignment of a bin, this constant cannot be changed as much of the internal math of the allocator relies on this. When the user requests an alignment smaller than this, the requested alignment will be silently ignored and treated like a 16 byte alignment.

UE_MBC_MIN_SMALL_POOL_ALIGNMENT_SHIFT

Same as above, but expressed as a shift 1 << UE_MBC_MIN_SMALL_POOL_ALIGNMENT_SHIFT.

UE_MBC_MAX_SMALL_POOL_ALIGNMENT

A power of two representing the maximum alignment supported by a small allocation, if the small allocation has an alignment bigger than this, it will be treated as a large allocation and a full block of 4KiB will be requested from the operating system (or the paged allocator).

You can modify this value but would need to guarantee that there exists a matching bin size that is a power of two otherwise you would just be wasting extra computations, also note that setting up a bin size with a big power of two like 8192 would create MB2 pages where the first bin of 8192 is always wasted to hold the 16 bytes of a pool-header FFreeBlock (a page without slack memory), so this has tradeoffs and should be done very carefully. For more details check my illustrated notes on alignment.

UE_DEFAULT_GMallocBinnedBundleSize

The maximum amount of memory that can be cached in a bundle, if all the BundleNode objects in the bundle point to more memory than this number, the bundle is considered to be full and no more nodes can be added into it.

UE_DEFAULT_GMallocBinnedBundleCount

The maximum number of BundleNodes that can be held in a bundle object, if this number is reached, the bundle is considered to be full and no more nodes can be added into it.

UE_DEFAULT_GMallocBinnedPerThreadCaches

If true, the cache subsystem is enabled, this refers to the thread local and thread shared mechanisms, false fully disabled the caching.

UE_DEFAULT_GMallocBinnedAllocExtra

This number represents an extra amount of allocations that will be immediately freed in order to put elements in the thread local cache. Happens for any thread that enables the local cache, whenever a cache miss occurs.

This mechanism allows future allocations to be fulfilled directly from the cache which is a lock-free and the fastest path possible in terms of assembly instructions count.

UE_DEFAULT_GMallocBinnedMaxBundlesBeforeRecycle

The default maximum number of slots in an entry of the TGlobalRecycler, this number should be selected carefully to match the cache line of the running system, e.g. for a 64-bit platform with CPU L1 cache lines of 64 bytes a pointer has a size of 8 bytes and 8 pointers would fit exactly in one cache line.

Credits

Written by Romualdo Villalobos

All rights reserved.