Memory Locality in the Age of Virtualization: Optimizing Database Performance in Hidden NUMA Topologies

J

15 Apr 2025 — 20 min read

1. Introduction

Modern cloud infrastructure presents a fascinating paradox: as we advance toward more sophisticated hardware architectures, virtualization increasingly obscures these same architectures from the software running within virtual machines. This phenomenon is particularly evident in Non-Uniform Memory Access (NUMA) systems, where memory access times depend on the memory location relative to the processor. When a hypervisor presents a virtual machine with a simplified, often single-socket view of the underlying hardware—despite physical cores being distributed across multiple NUMA nodes—it creates a challenging environment for performance-critical software like database engines.

This essay explores the complex landscape of hidden NUMA topologies in virtualized environments, offering both theoretical insights and practical strategies for database engine optimization. We'll examine why even sophisticated general-purpose allocators like jemalloc are insufficient for peak performance, and how specialized memory pooling techniques can help bridge the gap between virtualized abstractions and physical reality.

2. The Mathematics of NUMA Systems

2.1 Formal NUMA Access Model

Let's begin by formalizing the NUMA access model. In a physical NUMA system with $n$ nodes, we can represent memory access latency as a function:

$$L(i, j) = \begin{cases} L_{local} & \text{if } i = j \\ L_{remote} + D(i, j) & \text{if } i \neq j \end{cases}$$

Where:

$L(i, j)$ is the latency for a core in node $i$ accessing memory in node $j$
$L_{local}$ is the local access latency within a node
$L_{remote}$ is the base remote access latency
$D(i, j)$ is the additional distance-dependent latency between nodes $i$ and $j$

In typical hardware, $L_{remote}$ can be 1.5× to 3× higher than $L_{local}$, with significant implications for performance-sensitive applications.

2.2 Virtualization Effect on Memory Access Model

However, in a virtualized environment, this reality is obscured. From the VM's perspective, the system appears to have a uniform memory access model:

$$L_{VM}(i, j) \approx L_{uniform} \quad \forall i, j$$

This discrepancy between the physical reality and the VM's perception creates a fundamental challenge for memory-intensive applications.

3. Modeling Memory Access in Virtualized Environments

To illustrate this challenge, consider a C++23 implementation of a memory access benchmark:

// A simple structure to measure memory access times across a memory region
class MemoryAccessProbe {
public:
  explicit MemoryAccessProbe(size_t size_mb)
      : buffer_(std::make_unique<std::byte[]>(size_mb * 1024 * 1024)),
        size_(size_mb * 1024 * 1024) {
    // Initialize with sequential values to prevent optimization
    std::ranges::for_each_n(buffer_.get(), size_,
                            [i = 0](auto& b) mutable {
                              b = std::byte {static_cast<uint8_t>(i++ % 256)};
                            });
  }

  auto measure_random_access(size_t iterations) -> std::vector<double> {
    auto latencies = std::vector<double> {};
    latencies.reserve(iterations);

    // Create pseudo-random access pattern
    auto indices = std::vector<size_t>(iterations);
    std::iota(indices.begin(), indices.end(), 0);
    std::ranges::shuffle(indices, std::mt19937 {std::random_device {}()});

    for (auto idx : indices | std::views::take(iterations)) {
      auto address = (idx * 64) % (size_ - 64);  // Ensure cache line alignment
      auto start = std::chrono::high_resolution_clock::now();

      // Force memory access with volatile to prevent optimization
      volatile auto value = buffer_[address];
      (void)value;  // Prevent unused variable warning

      auto end = std::chrono::high_resolution_clock::now();
      auto elapsed = std::chrono::duration<double, std::nano> {end - start};
      latencies.emplace_back(elapsed.count());
    }

    return latencies;
  }

private:
  std::unique_ptr<std::byte[]> buffer_;
  size_t size_;
};

// Example usage
void analyze_memory_topology(size_t total_memory_mb) {
  auto probe = MemoryAccessProbe {total_memory_mb};
  auto latencies = probe.measure_random_access(1'000'000);

  // Apply statistical analysis to detect clusters in latency distribution
  // that might indicate NUMA boundaries
  // ...
}

This code provides a foundation for detecting potential NUMA boundaries by measuring memory access latencies across a large memory space. In a true uniform memory access environment, we would expect a normal distribution of latencies. However, in a hidden NUMA environment, we might observe a multi-modal distribution, with distinct peaks corresponding to different memory regions.

4. The Limitations of General-Purpose Allocators

While allocators like jemalloc employ sophisticated techniques such as thread-local caching and size-class segregation, they operate without knowledge of the underlying NUMA topology in virtualized environments. Their optimization strategies are based on general patterns of memory usage rather than specific hardware characteristics.

The performance gap becomes particularly evident in database engines, where memory access patterns are both predictable and critical to performance. Let's analyze why:

Query Processing Locality: Database query execution typically works with related data structures (predicates, intermediate results, hash tables) that benefit from being co-located in memory.
Object Lifecycle Correlation: Many database objects have synchronized lifecycles tied to transactions or query execution, making them ideal candidates for region-based allocation.
Size Distribution Specialization: Database engines use predictable object sizes for specific components (record headers, index entries), allowing for highly optimized fixed-size allocators.

5. Designing NUMA-Aware Memory Pools

5.1 Database-Specific Slab Allocator

To address these challenges, we can implement specialized memory pools that adapt to hidden NUMA topologies. Here's a C++23 implementation of a database-specific slab allocator optimized for common query execution objects:

// A specialized slab allocator for database query objects
class QueryObjectSlabAllocator {
public:
  // Size classes optimized for common database operation objects
  enum class SizeClass : uint8_t {
    kTinyObjects = 0,   // 64 bytes (predicate, scalar values)
    kSmallObjects,      // 256 bytes (tuple headers, small strings)
    kMediumObjects,     // 1024 bytes (intermediate buffers)
    kLargeObjects,      // 4096 bytes (hash table buckets)
    kHugeObjects,       // 16384 bytes (sort buffers, large intermediate results)
    kSizeClassCount
  };

  QueryObjectSlabAllocator() {
    // Initialize slabs for each thread and each size class
    auto num_threads = std::thread::hardware_concurrency();
    thread_slabs_.resize(num_threads);

    // Initialize size class information
    // 64-byte objects, 64KB slabs
    size_class_info_[SizeClass::kTinyObjects] = {64, 64 * 1024};
    // 256-byte objects, 256KB slabs
    size_class_info_[SizeClass::kSmallObjects] = {256, 256 * 1024};
    // 1KB objects, 1MB slabs
    size_class_info_[SizeClass::kMediumObjects] = {1024, 1024 * 1024};
    // 4KB objects, 4MB slabs
    size_class_info_[SizeClass::kLargeObjects] = {4096, 4 * 1024 * 1024};
    // 16KB objects, 16MB slabs
    size_class_info_[SizeClass::kHugeObjects] = {16384, 16 * 1024 * 1024};
  }

  // Allocate an object of the specified size class
  void* allocate(SizeClass size_class) {
    auto thread_id = get_thread_id_();
    auto size_class_idx = static_cast<size_t>(size_class);

    auto& slab = thread_slabs_[thread_id].slabs[size_class_idx];

    // Try to allocate from the current slab
    auto lock = std::scoped_lock {slab.mutex};

    // Check if we need to allocate a new slab
    if (slab.current_offset + size_class_info_[size_class_idx].object_size >
        slab.current_slab_size || slab.current_slab == nullptr) {
      allocate_new_slab_(thread_id, size_class);
    }

    // Allocate from the current slab
    void* result = static_cast<char*>(slab.current_slab) + slab.current_offset;
    slab.current_offset += size_class_info_[size_class_idx].object_size;

    // Track locality statistics
    ++slab.allocations;

    return result;
  }

  // Free an object (note: in a slab allocator, individual objects aren't
  // immediately freed)
  void deallocate(void* ptr, SizeClass size_class) {
    auto thread_id = get_thread_id_();
    auto size_class_idx = static_cast<size_t>(size_class);

    auto& slab = thread_slabs_[thread_id].slabs[size_class_idx];

    // In a real implementation, we might track used/free objects
    // For simplicity, we're just counting deallocations
    auto lock = std::scoped_lock {slab.mutex};
    ++slab.deallocations;

    // If a slab becomes mostly empty, we might recycle it
    maybe_reclaim_slab_(thread_id, size_class);
  }

  // Get statistics about memory usage and locality
  auto get_slab_stats() -> std::vector<SlabStats> {
    auto stats = std::vector<SlabStats> {};

    for (auto thread_id = 0uz; thread_id < thread_slabs_.size(); ++thread_id) {
      for (auto sc = 0uz; sc < static_cast<size_t>(SizeClass::kSizeClassCount); ++sc) {
        auto& slab = thread_slabs_[thread_id].slabs[sc];

        auto lock = std::scoped_lock {slab.mutex};
        stats.emplace_back(SlabStats {
          .thread_id = thread_id,
          .size_class = static_cast<SizeClass>(sc),
          .total_slabs = slab.total_slabs,
          .active_slabs = slab.active_slabs,
          .allocations = slab.allocations,
          .deallocations = slab.deallocations,
          .foreign_accesses = slab.foreign_accesses
        });
      }
    }

    return stats;
  }

  // Structure to hold statistics about slab usage
  struct SlabStats {
    size_t thread_id;
    SizeClass size_class;
    size_t total_slabs;
    size_t active_slabs;
    size_t allocations;
    size_t deallocations;
    size_t foreign_accesses;  // Accesses from other threads
  };

private:
  // Information about each size class
  struct SizeClassInfo {
    size_t object_size;
    size_t slab_size;
  };

  // Per-thread, per-size-class slab information
  struct SlabInfo {
    void* current_slab = nullptr;
    size_t current_slab_size = 0;
    size_t current_offset = 0;
    size_t total_slabs = 0;
    size_t active_slabs = 0;
    size_t allocations = 0;
    size_t deallocations = 0;
    size_t foreign_accesses = 0;
    std::mutex mutex {};
    std::vector<void*> all_slabs {};  // Keep track of all allocated slabs
  };

  // Per-thread slab collection
  struct ThreadSlabs {
    std::array<SlabInfo, static_cast<size_t>(SizeClass::kSizeClassCount)> slabs;
  };

  std::vector<ThreadSlabs> thread_slabs_;
  std::array<SizeClassInfo, static_cast<size_t>(SizeClass::kSizeClassCount)> size_class_info_;

  // Thread-local storage for quick thread ID lookup
  static inline thread_local size_t cached_thread_id = std::numeric_limits<size_t>::max();

  size_t get_thread_id_() const {
    // Fast path if we already know our thread ID
    if (cached_thread_id < thread_slabs_.size()) {
      return cached_thread_id;
    }

    // Slow path: find our thread ID
    auto hasher = std::hash<std::thread::id> {};
    auto this_thread_id = hasher(std::this_thread::get_id());
    cached_thread_id = this_thread_id % thread_slabs_.size();

    return cached_thread_id;
  }

  void allocate_new_slab_(size_t thread_id, SizeClass size_class) {
    auto size_class_idx = static_cast<size_t>(size_class);
    auto& slab = thread_slabs_[thread_id].slabs[size_class_idx];

    // Allocate memory for a new slab - using huge pages can improve locality
    auto slab_size = size_class_info_[size_class_idx].slab_size;
    auto new_slab = aligned_alloc(4096, slab_size);  // Align to page size

    // Update slab information
    slab.current_slab = new_slab;
    slab.current_slab_size = slab_size;
    slab.current_offset = 0;
    slab.all_slabs.emplace_back(new_slab);
    ++slab.total_slabs;
    ++slab.active_slabs;
  }

  void maybe_reclaim_slab_(size_t thread_id, SizeClass size_class) {
    auto size_class_idx = static_cast<size_t>(size_class);
    auto& slab = thread_slabs_[thread_id].slabs[size_class_idx];

    // If we've deallocated most objects from this slab class, we might
    // want to reclaim some memory
    if (slab.deallocations > slab.allocations * 0.8 && slab.active_slabs > 1) {
      // In a real implementation, we would identify empty slabs and free them
      // For simplicity, we're just updating counters
      --slab.active_slabs;
    }
  }
};

This implementation shows a specialized slab allocator with several key features:

Size-Class Optimization: The allocator is tailored for common database object sizes, reducing internal fragmentation.
Thread-Local Slabs: Each thread has its own set of slabs, maximizing memory locality even when the NUMA topology is hidden.
Allocation Tracking: The allocator tracks various statistics that could be used to infer potential NUMA boundaries and adapt accordingly.
Efficient Reuse: The slab design allows for efficient object reuse without the overhead of individual deallocations.

5.2 Mathematical Analysis of Locality in Slab Allocation

Let's formalize the locality benefits of this approach with a mathematical model. Given a system with $T$ threads potentially distributed across $N$ NUMA nodes, we can define:

$P(t_i, n_j)$ as the probability that thread $t_i$ is running on a core in NUMA node $n_j$
$P(m_k, n_j)$ as the probability that memory block $m_k$ is allocated in NUMA node $n_j$

For a thread-local allocator, when thread $t_i$ allocates memory, we want to maximize $P(m_k, n_j)$ given that the thread is running on a core in node $n_j$ (i.e., maximize the conditional probability $P(m_k, n_j | t_i \text{ on } n_j)$).

In an ideal scenario with perfect NUMA awareness, this conditional probability would be 1, meaning all memory is allocated locally to the running thread. In our thread-local slab allocator, we approximate this by ensuring that each thread allocates from its own dedicated memory region.

5.3 Huge Page Utilization for Database Memory Pools

Beyond basic NUMA-awareness, database memory pools can benefit significantly from utilizing Huge Pages, even in virtualized environments. Huge Pages (typically 2MB or 1GB, compared to standard 4KB pages) provide two major benefits in the context of hidden NUMA topologies:

// Enhanced slab allocator with Huge Page support
class HugePageSlabAllocator {
public:
  enum class PageSize : uint8_t {
    kStandard = 0,   // 4KB pages
    kHuge2MB,        // 2MB huge pages (supported by most VMs)
    kHuge1GB         // 1GB huge pages (limited support)
  };

  explicit HugePageSlabAllocator(PageSize page_size = PageSize::kHuge2MB)
      : page_size_(page_size) {
    // Initialize based on selected page size
    switch (page_size_) {
      case PageSize::kHuge2MB:
        page_size_bytes_ = 2 * 1024 * 1024;
        break;
      case PageSize::kHuge1GB:
        page_size_bytes_ = 1024 * 1024 * 1024;
        break;
      default:
        page_size_bytes_ = 4 * 1024;
        break;
    }

    // Initialize thread-local slabs, aligned to the page size
    initialize_slabs_();
  }

  void* allocate(size_t size) {
    // Implementation details...
    return nullptr;
  }

private:
  PageSize page_size_;
  size_t page_size_bytes_;

  void initialize_slabs_() {
    // Allocate memory using appropriate huge page method
    if (page_size_ == PageSize::kStandard) {
      // Standard 4KB page allocation
      allocate_standard_pages_();
    } else {
      // Huge page allocation
      allocate_huge_pages_();
    }
  }

  void allocate_standard_pages_() {
    // Try mmap with MAP_HUGETLB flag
    void* memory = mmap(nullptr, required_size_, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

    if (memory == MAP_FAILED) {
      // Fallback to transparent huge pages if explicit allocation fails
      memory = mmap(nullptr, required_size_, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

      // Advise the kernel to use huge pages if possible
      madvise(memory, required_size_, MADV_HUGEPAGE);
    }

    // Initialize memory regions
    // ...
  }
};

The benefits of using Huge Pages in database memory pools include:

TLB Coverage: A single 2MB huge page entry covers 512 standard 4KB pages in the Translation Lookaside Buffer (TLB), dramatically reducing TLB misses during database operations that scan large memory regions. This is particularly important for analytical queries that process large datasets sequentially.
NUMA Locality Improvement: Huge Pages are typically allocated from contiguous physical memory, increasing the likelihood that the entire page resides within the same NUMA node, even when the NUMA topology is hidden. This reduces the probability of accidental cross-node memory access.
Performance Measurements: In virtualized database workloads, properly configured Huge Pages often yield 15-30% improvement in memory-intensive operations, with greater benefits for analytical queries versus transactional workloads.

The implementation challenges include:

VM Configuration Requirements: Many hypervisors require specific configuration to expose Huge Page capabilities to guests. Database systems should detect available Huge Page support at startup and adapt accordingly.
Availability Monitoring: Database engines should monitor Huge Page availability at runtime and adapt allocation strategies:

bool is_huge_page_available(size_t requested_huge_page_size) {
  // Check /proc/meminfo or similar to verify huge page availability
  auto meminfo = std::ifstream {"/proc/meminfo"};
  auto line = std::string {};

  // Look for appropriate huge page entry
  auto huge_pages_entry = std::string {
      requested_huge_page_size >= 1024 * 1024 * 1024 ? "HugePages_Total" : "Hugepagesize"};

  while (std::getline(meminfo, line)) {
    if (line.find(huge_pages_entry) != std::string::npos) {
      // Parse available huge pages
      auto available_pages = 0;
      auto ss = std::stringstream {line};
      auto key = std::string {};
      ss >> key >> available_pages;

      return available_pages > 0;
    }
  }

  return false;
}

5.4 Cross-Thread Memory Management Challenges

While thread-local slabs minimize contention, several real-world database scenarios challenge the ideal model of perfect thread locality:

// Enhanced slab allocator with cross-thread handling
class ThreadAwareSlabAllocator {
public:
  void* allocate(size_t size) {
    auto thread_id = get_thread_id_();
    auto size_class = get_size_class_(size);

    // Try thread-local allocation first (fast path)
    void* result = thread_local_pools_[thread_id].try_allocate(size_class);
    if (result != nullptr) {
      return result;
    }

    // If local pool is exhausted, handle gracefully
    return handle_thread_local_pool_exhaustion_(thread_id, size_class, size);
  }

  void deallocate(void* ptr, size_t size) {
    auto owner_thread_id = find_owner_thread_(ptr);
    auto current_thread_id = get_thread_id_();

    // Fast path: Deallocating memory owned by current thread
    if (owner_thread_id == current_thread_id) {
      thread_local_pools_[current_thread_id].deallocate(ptr, size);
      return;
    }

    // Cross-thread deallocation case
    handle_cross_thread_deallocation_(ptr, size, owner_thread_id);
  }

private:
  std::vector<ThreadLocalPool> thread_local_pools_;
  std::mutex cross_thread_mutex_;
  std::unordered_map<size_t, std::vector<void*>> deferred_deallocations_;

  void handle_cross_thread_deallocation_(void* ptr, size_t size,
                                         size_t owner_thread_id) {
    // Cross-thread deallocation performance metrics
    cross_thread_stats_.increment_count();

    // Option 1: Queue for later processing by owner thread
    auto lock = std::scoped_lock {cross_thread_mutex_};
    deferred_deallocations_[owner_thread_id].push_back(ptr);

    // Option 2: Handle immediately with higher contention risk
    // thread_local_pools_[owner_thread_id].deallocate_from_other_thread(ptr, size);
  }

  // Periodically called by each thread to reclaim its memory
  void process_deferred_deallocations_(size_t thread_id) {
    auto pending = std::vector<void*> {};

    {
      auto lock = std::scoped_lock {cross_thread_mutex_};
      pending = std::move(deferred_deallocations_[thread_id]);
      deferred_deallocations_[thread_id].clear();
    }

    // Process pending deallocations
    for (void* ptr : pending) {
      thread_local_pools_[thread_id].deallocate(ptr, 0);  // Size needs tracking
    }

    // Potentially reclaim memory if needed
    thread_local_pools_[thread_id].maybe_reclaim_memory_();
  }
};

Key scenarios where thread-local allocation models break down in database systems:

Producer-Consumer Patterns: Common in database engines where one thread produces objects (e.g., query parser) that are consumed by another (query executor). Here memory may be allocated by one thread but freed by another.
Work Stealing: Dynamic task distribution systems frequently move memory objects between threads for load balancing, particularly during complex join or aggregation operations.
Thread Pool Rebalancing: Long-running database instances often adjust thread allocations based on workload, causing memory ownership migrations when thread counts are adjusted up or down.

Strategies to mitigate these challenges:

Ownership Transfer: Explicitly transfer object ownership between threads rather than allowing cross-thread deallocations.
Deferred Processing: Queue cross-thread deallocations for batch processing by owner threads, reducing lock contention.
Performance Accounting: Track cross-thread operations to identify suboptimal data sharing patterns:

struct CrossThreadStats {
  std::atomic<uint64_t> deallocation_count {0};
  std::atomic<uint64_t> ownership_transfers {0};
  std::atomic<uint64_t> contention_events {0};

  void increment_count() {
    deallocation_count.fetch_add(1, std::memory_order_relaxed);
  }

  std::string get_performance_report() {
    // Generate performance report with suggestions for optimization
    auto ss = std::stringstream {};
    ss << "Cross-thread deallocations: " << deallocation_count.load()
       << "\nOwnership transfers: " << ownership_transfers.load()
       << "\nContention events: " << contention_events.load();

    // Add recommendations based on thresholds
    if (deallocation_count.load() > 10000) {
      ss << "\n\nRecommendation: Consider adjusting threading model to reduce "
         << "cross-thread object sharing.";
    }

    return ss.str();
  }
};

6. Memory Access Patterns in Database Engines

Database engines present unique memory access patterns that can be particularly sensitive to NUMA effects. Let's examine a typical query execution scenario:

// Simplified database query execution context
class QueryExecutor {
public:
  auto execute_query(const std::string& query) -> ResultSet {
    // Phase 1: Parse and prepare the query
    auto plan = query_planner_.create_plan(query);

    // Phase 2: Allocate execution resources
    // This is where NUMA-aware allocation becomes critical
    auto hash_tables = allocate_hash_tables_(plan);
    auto intermediate_buffers = allocate_intermediate_buffers_(plan);

    // Phase 3: Execute the query
    for (const auto& op : plan.operations()) {
      execute_operation_(op, hash_tables, intermediate_buffers);
    }

    // Phase 4: Gather and return results
    return gather_results_(intermediate_buffers);
  }

private:
  QueryPlanner query_planner_;
  QueryObjectSlabAllocator object_pool_;

  // NUMA-aware hash table allocation
  auto allocate_hash_tables_(const QueryPlan& plan) -> std::vector<HashTable> {
    auto tables = std::vector<HashTable> {};
    tables.reserve(plan.required_hash_tables());

    for (const auto& table_spec : plan.hash_table_specs()) {
      // Allocate hash table buckets from the slab allocator
      tables.emplace_back(table_spec,
          object_pool_.allocate(QueryObjectSlabAllocator::SizeClass::kLargeObjects));
    }

    return tables;
  }
};

In this scenario, the performance impact of NUMA boundaries can be particularly severe during the hash join operations, where large hash tables are built and probed multiple times. If the hash tables span NUMA boundaries, the random access pattern of hash table probes will regularly incur the latency penalty of remote memory access.

7. Mathematical Analysis of Memory Access Latency

7.1 Formal Latency Model for Virtualized NUMA Systems

Let's develop a more mathematical model for memory access latency in virtualized environments with hidden NUMA topologies.

Let $\Omega$ be a set of physical NUMA nodes in the host system, where $|\Omega| = N$. Each node $\omega \in \Omega$ has a set of cores $C_\omega$ and a set of memory pages $M_\omega$.

Define a memory access function $A: C \times M \rightarrow \mathbb{R}^+$ that maps a core-memory pair to an access latency, where $C = \cup_{\omega \in \Omega} C_\omega$ is the set of all cores and $M = \cup_{\omega \in \Omega} M_\omega$ is the set of all memory pages.

For cores $c \in C_\omega$ and memory pages $m \in M_\omega$ in the same NUMA node, we have:

$$A(c, m) = L_{local} + \varepsilon_{c,m}$$

where $L_{local}$ is the baseline local access latency and $\varepsilon_{c,m}$ is a small random variable representing noise and contention.

For cores $c \in C_{\omega_i}$ and memory pages $m \in M_{\omega_j}$ in different NUMA nodes, we have:

$$A(c, m) = L_{local} + L_{remote}(\omega_i, \omega_j) + \varepsilon_{c,m}$$

where $L_{remote}(\omega_i, \omega_j)$ is the additional latency for remote access between nodes $\omega_i$ and $\omega_j$.

Now, let's define a virtualization mapping function $V: C_{vm} \times M_{vm} \rightarrow C \times M$ that maps virtual cores and memory pages to physical cores and memory pages, where $C_{vm}$ is the set of virtual cores and $M_{vm}$ is the set of virtual memory pages.

The observed latency for a virtual memory access is then:

$$A_{vm}(c_{vm}, m_{vm}) = A(V(c_{vm}, m_{vm}))$$

However, the guest OS doesn't know $V$, which is the source of the "hidden NUMA" problem.

7.2 Statistical Analysis of Latency Distribution

Given this model, we can analyze the distribution of memory access latencies as observed from within the VM. Let's define a random variable $X$ representing the latency of a random memory access from a specific virtual core $c_{vm}$ to a random virtual memory page $m_{vm}$.

The probability mass function of $X$ is:

$$\begin{aligned} f_X(x) = & \sum_{\omega \in \Omega} P(V(c_{vm}, m_{vm}) \in C_\omega \times M_\omega) \cdot f_{local}(x) \\ & + \sum_{\omega_i, \omega_j \in \Omega, \omega_i \neq \omega_j} P(V(c_{vm}, m_{vm}) \in C_{\omega_i} \times M_{\omega_j}) \cdot f_{remote}(x, \omega_i, \omega_j) \end{aligned}$$

where $f_{local}(x)$ is the probability density function of local access latencies and $f_{remote}(x, \omega_i, \omega_j)$ is the probability density function of remote access latencies between nodes $\omega_i$ and $\omega_j$.

For a uniform random distribution of memory accesses across all virtual memory, the expected latency is:

$$\begin{gathered} E[X] = \sum_{\omega \in \Omega} P(V(c_{vm}) \in C_\omega) \cdot S_\omega \\ \text{where } S_\omega = S_{\omega,1} + S_{\omega,2} \\ S_{\omega,1} = P(V(m_{vm}) \in M_\omega) \cdot L_{local} \\ S_{\omega,2} = \sum_{\omega' \in \Omega, \omega' \neq \omega} P(V(m_{vm}) \in M_{\omega'}) \cdot (L_{local} + L_{remote}(\omega, \omega')) \end{gathered}$$

where $V(c_{vm})$ and $V(m_{vm})$ are the physical core and memory page mappings, respectively.

With thread-local allocation strategies, we aim to maximize $P(V(m_{vm}) \in M_\omega | V(c_{vm}) \in C_\omega)$, which reduces the expected latency toward $L_{local}$.

8. Memory Pool Performance Analysis

We can model the expected performance gain from NUMA-aware object pools by considering the probability distribution of memory accesses across NUMA nodes.

Let's define:

$p_{local}$ as the probability that an object is accessed by the same thread that allocated it
$p_{same\_node}$ as the probability that an object is accessed by a thread on the same NUMA node
$p_{remote}$ as the probability that an object is accessed by a thread on a different NUMA node

For a database workload, we typically observe that $p_{local} \gg p_{same\_node} > p_{remote}$, meaning most accesses are local to the allocating thread.

With a thread-local-first strategy, the expected memory access latency becomes:

$$E[L] = p_{local} \cdot L_{local} + p_{same\_node} \cdot L_{same\_node} + p_{remote} \cdot L_{remote}$$

Compared to a naive allocator, which might have equal probabilities of allocating from any node, this represents a significant performance improvement. The magnitude of this improvement depends on the specific workload characteristics and the underlying hardware topology.

9. Detecting NUMA Effects in Virtual Environments

While we cannot directly observe the NUMA topology in many virtualized environments, we can infer its presence through careful performance analysis. One approach is to measure memory access latencies across different regions of memory:

void detect_numa_effects() {
  constexpr size_t kMemorySize = 16ULL * 1024 * 1024 * 1024;  // 16 GB
  constexpr size_t kChunkSize = 64 * 1024 * 1024;  // 64 MB chunks
  constexpr size_t kNumChunks = kMemorySize / kChunkSize;

  // Allocate a large memory region
  auto memory = std::make_unique<std::byte[]>(kMemorySize);

  // Create thread-local buffers on each hardware thread
  auto num_threads = std::thread::hardware_concurrency();
  auto threads = std::vector<std::thread> {};
  auto latencies = std::vector<std::vector<double>>(num_threads);

  for (auto thread_id = 0uz; thread_id < num_threads; ++thread_id) {
    threads.emplace_back([thread_id, &memory, &latencies, kChunkSize, kNumChunks]() {
      // Pin this thread to a specific hardware thread
      pin_thread_to_core(thread_id);

      auto thread_latencies = std::vector<double> {};
      thread_latencies.reserve(kNumChunks);

      // Measure access time to each memory chunk
      for (auto chunk = 0uz; chunk < kNumChunks; ++chunk) {
        auto start = std::chrono::high_resolution_clock::now();

        // Access memory with a strided pattern to defeat prefetching
        for (auto offset = 0uz; offset < kChunkSize; offset += 4096) {
          volatile std::byte value = memory[chunk * kChunkSize + offset];
          (void)value;  // Prevent unused variable warning
        }

        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration<double, std::micro> {end - start};
        thread_latencies.push_back(elapsed.count());
      }

      latencies[thread_id] = std::move(thread_latencies);
    });
  }

  // Wait for threads to complete
  for (auto& thread : threads) {
    thread.join();
  }

  // Analyze latencies to detect potential NUMA boundaries
  analyze_latencies(latencies);
}

By analyzing the resulting latency measurements, we can identify patterns that suggest the presence of NUMA boundaries, such as clusters of memory regions with similar access times from specific threads.

9.1 Thread Affinity Limitations in Virtual Environments

The above code tries to pin threads to specific cores using pin_thread_to_core(), but this approach has significant limitations in virtualized environments:

#include <sched.h>
#include <pthread.h>
#include <iostream>

// Attempt to pin a thread to a specific core in a VM
bool pin_thread_to_core(int32_t core_id) {
  auto cpuset = cpu_set_t {};
  CPU_ZERO(&cpuset);
  CPU_SET(core_id, &cpuset);

  // Apply the CPU affinity mask to the current thread
  auto result = pthread_setaffinity_np(pthread_self(),
                                       sizeof(cpu_set_t), &cpuset);
  if (result != 0) {
    std::cerr << "Error setting thread affinity: "
              << result << std::endl;
    return false;
  }

  return true;
}

This approach has several important caveats in virtualized environments:

Hypervisor Scheduling Dynamics: Even when pthread_setaffinity_np() successfully sets the affinity mask within the guest OS, the hypervisor frequently remaps vCPUs to different physical cores, particularly during load balancing operations. This remapping can occur every few milliseconds, negating the intended locality benefits.
vCPU Migration: The hypervisor prioritizes overall host utilization over guest affinity preferences, often causing vCPUs to move between physical NUMA domains without the guest's knowledge.
Effectiveness Measurement: In practice, thread affinity in VMs typically provides only 20-40% of the benefit observed on bare metal systems, depending on the hypervisor's scheduling policies and host load conditions.

A more reliable approach combines affinity hints with runtime monitoring:

class AdaptiveThreadManager {
public:
  AdaptiveThreadManager() : check_interval_ms_(100) {
    monitoring_thread_ = std::thread {&AdaptiveThreadManager::monitor_migrations_, this};
  }

  ~AdaptiveThreadManager() {
    shutdown_ = true;
    if (monitoring_thread_.joinable()) {
      monitoring_thread_.join();
    }
  }

  // Request affinity but track if it's being respected
  void request_thread_affinity(std::thread::id thread_id, int32_t preferred_core) {
    auto lock = std::scoped_lock {mutex_};
    thread_preferences_[thread_id] = preferred_core;
    apply_affinity_mask_(thread_id, preferred_core);
  }

private:
  std::mutex mutex_;
  std::atomic<bool> shutdown_ {false};
  std::thread monitoring_thread_;
  std::unordered_map<std::thread::id, int32_t> thread_preferences_;
  std::unordered_map<std::thread::id, int32_t> thread_actual_cores_;
  int32_t check_interval_ms_;

  void monitor_migrations_() {
    while (!shutdown_) {
      // Check if threads are where we expect them to be
      std::this_thread::sleep_for(std::chrono::milliseconds {check_interval_ms_});
      check_thread_placements_();
    }
  }

  void check_thread_placements_() {
    auto lock = std::scoped_lock {mutex_};
    for (const auto& [thread_id, preferred_core] : thread_preferences_) {
      auto current_core = get_current_core_for_thread_(thread_id);
      if (current_core != preferred_core) {
        // Thread has been migrated by the hypervisor
        thread_actual_cores_[thread_id] = current_core;
        // Potentially update memory allocation strategies based on observed location
        handle_thread_migration_(thread_id, preferred_core, current_core);

        // Optionally, attempt to reapply affinity
        apply_affinity_mask_(thread_id, preferred_core);
      }
    }
  }
};

This adaptive approach acknowledges the reality of virtualized environments and adjusts memory allocation strategies based on observed thread location rather than assuming thread affinity requests will be honored.

9.2 Real-Time Adaptation to Hidden NUMA Topologies

Building on the detection mechanisms, a truly robust database memory system should continuously adapt to the hidden topology:

class AdaptiveNumaAllocator {
public:
  AdaptiveNumaAllocator() : adaptation_interval_ms_(5000) {
    // Start with a naive allocation model
    current_topology_model_ = NumaTopologyModel::kUnknown;

    // Start adaptation thread
    adaptation_thread_ = std::thread {&AdaptiveNumaAllocator::adaptation_loop_, this};
  }

  ~AdaptiveNumaAllocator() {
    shutdown_ = true;
    if (adaptation_thread_.joinable()) {
      adaptation_thread_.join();
    }
  }

private:
  enum class NumaTopologyModel {
    kUnknown,
    kSingleNode,
    kTwoNodeSymmetric,
    kTwoNodeAsymmetric,
    kMultiNode
  };

  std::atomic<NumaTopologyModel> current_topology_model_;
  std::atomic<bool> shutdown_ {false};
  std::thread adaptation_thread_;
  int32_t adaptation_interval_ms_;

  // Memory access latency data
  struct LatencyMeasurement {
    std::vector<std::vector<double>> thread_to_region_latency;
  };

  LatencyMeasurement latest_measurement_;

  void adaptation_loop_() {
    while (!shutdown_) {
      // Sleep for the adaptation interval
      std::this_thread::sleep_for(
          std::chrono::milliseconds {adaptation_interval_ms_});

      // Measure memory access latencies
      latest_measurement_ = measure_memory_access_latencies_();

      // Analyze measurements to infer NUMA topology
      auto inferred_topology = infer_numa_topology_(latest_measurement_);

      // Update allocation strategy if topology model has changed
      if (inferred_topology != current_topology_model_) {
        update_allocation_strategy_(inferred_topology);
        current_topology_model_ = inferred_topology;
      }
    }
  }

  auto infer_numa_topology_(const LatencyMeasurement& measurement)
      -> NumaTopologyModel {
    // Apply clustering algorithm to identify latency patterns
    auto clusters = identify_latency_clusters(measurement);

    // Analyze cluster patterns to infer topology
    if (clusters.size() <= 1) {
      return NumaTopologyModel::kSingleNode;
    } else if (clusters.size() == 2) {
      // Check if the clusters are roughly symmetric
      return is_symmetric_(clusters) ?
          NumaTopologyModel::kTwoNodeSymmetric :
          NumaTopologyModel::kTwoNodeAsymmetric;
    } else {
      return NumaTopologyModel::kMultiNode;
    }
  }

  void update_allocation_strategy_(NumaTopologyModel topology) {
    switch (topology) {
      case NumaTopologyModel::kSingleNode:
        // Optimize for single-node access patterns
        apply_single_node_strategy_();
        break;
      case NumaTopologyModel::kTwoNodeSymmetric:
        // Apply dual-node optimization with balanced resource allocation
        apply_symmetric_two_node_strategy_();
        break;
      case NumaTopologyModel::kTwoNodeAsymmetric:
        // Favor the node with better average access characteristics
        apply_asymmetric_two_node_strategy_();
        break;
      case NumaTopologyModel::kMultiNode:
        // Complex multi-node strategy
        apply_multi_node_strategy_();
        break;
      default:
        // Conservative strategy when topology is unknown
        apply_conservative_strategy_();
        break;
    }
  }
};

Key aspects of this adaptive approach include:

Statistical Detection: Using statistical analysis to identify memory access patterns that suggest underlying NUMA boundaries:

auto identify_latency_clusters(const LatencyMeasurement& measurement)
    -> std::vector<LatencyCluster> {
  // Apply k-means clustering to memory access latencies
  auto all_latencies = std::vector<double> {};
  for (const auto& thread_latencies : measurement.thread_to_region_latency) {
    all_latencies.insert(all_latencies.end(),
                         thread_latencies.begin(),
                         thread_latencies.end());
  }

  // Initial number of clusters to try (can be adaptive)
  auto k_clusters = 2;

  // Apply k-means algorithm
  auto clusters = k_means_clustering(all_latencies, k_clusters);

  // Validate clusters using silhouette coefficient or similar metric
  auto quality = evaluate_cluster_quality(clusters, all_latencies);

  // If quality is poor, try different numbers of clusters
  if (quality < 0.5) {
    for (auto k = 3; k <= 4; ++k) {
      auto new_clusters = k_means_clustering(all_latencies, k);
      auto new_quality = evaluate_cluster_quality(new_clusters, all_latencies);

      if (new_quality > quality) {
        quality = new_quality;
        clusters = new_clusters;
      }
    }
  }

  return clusters;
}

Temporal Stability Analysis: Tracking how thread-to-memory affinities change over time to detect hypervisor scheduling patterns and adapt allocation strategies accordingly.
Dynamic Thread Grouping: As the system detects which threads tend to be scheduled together on the same physical NUMA node (even if the VM is unaware of this), it can dynamically adjust memory allocation policies to maximize locality based on these observed patterns.

10. Practical Implementation Considerations

When implementing specialized memory pools for database engines in virtualized environments, several practical considerations emerge:

Memory Fragmentation: Custom pools must carefully manage fragmentation, especially for long-running database instances.
Cache Line Optimization: Aligning allocations to cache line boundaries (typically 64 bytes) can significantly reduce false sharing.
Allocation Size Distribution: Analyzing the distribution of allocation sizes allows for optimizing the most common cases.
Thread Affinity: Even without explicit NUMA knowledge, maintaining consistent thread-to-core affinity can improve locality.

Let's examine a practical implementation of a size-optimized object pool:

// A memory pool optimized for common database object sizes
class DatabaseObjectPool {
public:
  DatabaseObjectPool() {
    // Initialize pools for common database object sizes
    // These sizes are derived from empirical analysis of allocation patterns
    pools_[0] = std::make_unique<FixedSizePool>(64, 10000);     // Small objects (e.g., scalar values)
    pools_[1] = std::make_unique<FixedSizePool>(128, 5000);     // Medium objects (e.g., tuple headers)
    pools_[2] = std::make_unique<FixedSizePool>(256, 2500);     // Large objects (e.g., string buffers)
    pools_[3] = std::make_unique<FixedSizePool>(1024, 1000);    // Extra large (e.g., complex predicates)
    pools_[4] = std::make_unique<FixedSizePool>(4096, 500);     // Query context objects
    pools_[5] = std::make_unique<FixedSizePool>(16384, 100);    // Hash table chunks
  }

  void* allocate(size_t size) {
    // Find the appropriate pool based on size
    auto pool_index = get_pool_index_(size);
    if (pool_index < kNumPools) {
      return pools_[pool_index]->allocate();
    }

    // Fall back to standard allocation for unusual sizes
    return ::operator new(size);
  }

  void deallocate(void* ptr, size_t size) {
    auto pool_index = get_pool_index_(size);
    if (pool_index < kNumPools) {
      pools_[pool_index]->deallocate(ptr);
    } else {
      ::operator delete(ptr);
    }
  }

private:
  static constexpr size_t kNumPools = 6;
  std::array<std::unique_ptr<FixedSizePool>, kNumPools> pools_;

  size_t get_pool_index_(size_t size) const {
    if (size <= 64) {
      return 0;
    }
    if (size <= 128) {
      return 1;
    }
    if (size <= 256) {
      return 2;
    }
    if (size <= 1024) {
      return 3;
    }
    if (size <= 4096) {
      return 4;
    }
    if (size <= 16384) {
      return 5;
    }
    return kNumPools;  // Indicates no suitable pool
  }
};

This implementation provides optimized allocation for the most common object sizes in database engines, reducing fragmentation and improving locality by keeping similar-sized objects together.

11. Conclusion

The challenge of optimizing database performance in virtualized environments with hidden NUMA topologies requires a multi-faceted approach that combines theoretical understanding with practical engineering. While general-purpose allocators like jemalloc provide a solid foundation, specialized memory pools that adapt to the specific characteristics of database workloads and hidden hardware topologies can deliver substantial performance improvements.

The key insights from our analysis include:

Thread Locality is Critical: Even without explicit NUMA knowledge, designing allocation strategies around thread locality can significantly improve performance. However, as our analysis of thread affinity limitations in VMs demonstrates, we must account for the reality that hypervisors frequently rebalance VMs across physical cores.
Workload-Specific Optimization: Understanding the specific memory access patterns of database operations allows for targeted optimization, particularly when combined with Huge Page utilization to reduce TLB misses.
Adaptive Approaches: Self-tuning memory management systems can detect and adapt to underlying hardware characteristics even when they're obscured by virtualization. Real-time adaptation through statistical analysis of memory access patterns provides a robust approach when static configuration is insufficient.
Size-Class Specialization: Optimizing for common allocation sizes reduces fragmentation and improves memory locality, while carefully handling cross-thread memory management avoids excessive lock contention.
Huge Page Integration: Incorporating Huge Pages into database memory pools combines NUMA locality benefits with TLB coverage improvements, providing multiplicative performance gains for analytical workloads.

As cloud infrastructure continues to evolve, the gap between logical abstraction and physical reality is likely to persist. Database engine developers who understand and address this gap through sophisticated memory management strategies will be better positioned to deliver optimal performance across diverse deployment environments.

By combining thread-local allocation strategies, workload-specific memory pools, Huge Page utilization, cross-thread deallocation optimization, and runtime adaptation techniques, database engines can thrive even in the challenging landscape of virtualized NUMA systems, where the physical topology remains hidden behind layers of abstraction.