Database

Stencil-Based Just-In-Time Compilation for SQL Expression Evaluation: A Copy-and-Patch Approach

J

08 Dec 2025 — 16 min read

Abstract

Just-in-time compilation offers significant performance improvements for database query execution by eliminating interpreter overhead. However, traditional JIT frameworks such as LLVM introduce compilation latencies measured in milliseconds, which fail to amortize for typical OLTP workloads. This article presents a practical implementation of stencil-based JIT compilation for SQL expression evaluation, applying the Copy-and-Patch technique to database query processing. Our approach achieves compilation times of 3.62 microseconds while delivering throughput improvements of 10.4× for aggregate functions and 2.4× for predicate evaluation compared to interpreted execution. We describe the architectural considerations for both x86-64 and ARM64 platforms, present a cost model for compilation decisions, and provide detailed performance analysis across representative SQL workloads.

1. Introduction

The execution of SQL expressions represents a critical performance path in database systems. Each row processed through a WHERE clause or aggregate function incurs evaluation overhead, and for queries processing millions of rows, this overhead accumulates substantially. Query compilation—translating SQL expressions into native machine code—has emerged as an effective optimization technique, with systems such as HyPer [1], Apache Impala, and PostgreSQL incorporating JIT compilation for performance-critical code paths.

The dominant approach to query JIT compilation employs LLVM as the code generation backend. While LLVM provides sophisticated optimization passes and broad platform support, it introduces significant compilation latency. Neumann [1] reported compilation times of 10-100 milliseconds for complex queries, which proves acceptable for analytical workloads processing billions of rows but becomes prohibitive for transactional queries touching only thousands of rows.

This latency motivated our investigation of alternative compilation strategies. During the development of a PostgreSQL-compatible SQL engine for a document-oriented database—implementing the PostgreSQL wire protocol for client compatibility while storing flexible JSON documents—we encountered precisely this tension between compilation benefit and compilation cost. Prior experience with LLVM-based database JIT confirmed that traditional approaches would not satisfy our requirements for sub-millisecond query response times.

The Copy-and-Patch compilation technique, introduced by Xu and Kjolstad [2] at OOPSLA 2021, offered a compelling alternative. Their key insight was that many workloads do not require sophisticated optimization; they require fast composition of pre-assembled code fragments. By eliminating the compilation pipeline entirely and replacing it with memory copies and address patching, Copy-and-Patch achieves compilation speeds orders of magnitude faster than traditional JIT while maintaining competitive execution performance.

This article makes the following contributions. First, we describe a complete implementation of Copy-and-Patch JIT for SQL expression evaluation, including support for arithmetic operations, comparisons, boolean logic, type conversions, and field access. Second, we present detailed stencil implementations for both x86-64 and ARM64 architectures, analyzing the trade-offs inherent in each platform's instruction encoding. Third, we develop a cost model for determining when JIT compilation provides net benefit over interpretation. Fourth, we provide comprehensive performance measurements demonstrating the effectiveness of this approach for database workloads.

The remainder of this article is organized as follows. Section 2 reviews related work in query compilation. Section 3 presents the stencil-based architecture. Section 4 details platform-specific implementations. Section 5 describes the cost model. Section 6 presents experimental evaluation. Section 7 discusses limitations and future directions. Section 8 concludes.

2.1 Query Compilation in Database Systems

The compilation of database queries to native code has a substantial history. System R [3] pioneered query compilation in the 1970s, generating assembly code for query plans. Modern interest revived with Neumann's work on the HyPer system [1], which demonstrated that compiling queries to LLVM IR could achieve order-of-magnitude performance improvements for analytical workloads.

Kersten et al. [4] provided a comprehensive comparison of compiled and vectorized query execution, finding that compilation excels for compute-intensive operations while vectorization provides advantages for memory-bound workloads. Their analysis informed our decision to implement both execution models, using stencil-based JIT for row-oriented evaluation and Apache Arrow's Acero engine for columnar processing.

PostgreSQL incorporated JIT compilation in version 11 [5], using LLVM to compile expression evaluation and tuple deforming. The implementation demonstrated measurable improvements for complex expressions but noted that compilation overhead limited applicability to queries processing substantial row counts.

2.2 Lightweight JIT Techniques

Alternatives to LLVM-based compilation have received increasing attention. Template-based JIT, where pre-written code templates are instantiated with runtime values, reduces compilation complexity at the cost of optimization opportunities.

The Copy-and-Patch technique [2] represents the current state of the art in lightweight JIT compilation. Xu and Kjolstad demonstrated compilation speeds 100× faster than LuaJIT while achieving execution performance within 15% of optimized native code. Their technique eliminated instruction selection, register allocation, and optimization passes by using pre-assembled machine code fragments that are copied and patched at runtime.

Weber et al. [6] applied similar techniques to WebAssembly compilation in browser engines, confirming the broad applicability of stencil-based approaches. Our work extends these techniques specifically to SQL expression evaluation, addressing the unique requirements of database query processing.

2.3 Expression Evaluation Optimization

Beyond compilation, various techniques optimize expression evaluation. Vectorized execution, pioneered by MonetDB/X100 [7], processes data in batches to amortize interpretation overhead and improve cache utilization. Predicate evaluation ordering [8] optimizes the sequence of conjunctive predicates based on selectivity and cost.

Our implementation complements these techniques. The stencil-based JIT operates within a row-oriented execution model suitable for selective OLTP predicates, while a separate vectorized execution path handles analytical workloads where columnar processing provides greater benefit.

3. Stencil-Based Architecture

3.1 Overview

A stencil is a pre-assembled fragment of machine code implementing a specific operation. Unlike traditional JIT compilers that generate instructions through an intermediate representation, stencil-based compilation operates through three phases: copying pre-assembled code templates into an executable buffer, patching specific locations with runtime values, and composing multiple stencils to form complete expressions.

This approach eliminates the compilation pipeline—parsing, optimization, instruction selection, register allocation, and code emission—replacing it with memory operations and simple address arithmetic. The trade-off is reduced optimization opportunity; generated code directly reflects the expression tree structure without common subexpression elimination, constant folding, or instruction scheduling.

3.2 Stencil Representation

Each stencil is represented as a structure containing the operation type, pre-assembled machine code bytes, and patch point specifications:

struct Stencil {
  StencilOp op;
  JITType result_type;
  std::vector<uint8_t> code;
  std::vector<PatchPoint> patches;
};

struct PatchPoint {
  uint32_t offset;      // Byte offset within code buffer
  PatchType type;       // Immediate, absolute address, or relative branch
  uint32_t value_index; // Index into runtime value array
};

Patch points identify locations within the stencil that require runtime modification. Patch types include immediate values (constants embedded in instructions), absolute addresses (function pointers, data addresses), and relative branches (jump offsets calculated from instruction position).

3.3 Register Conventions

Composability requires consistent register conventions across all stencils. Each operation consumes inputs from designated registers and produces outputs in designated registers, enabling arbitrary composition without inter-stencil register allocation.

We adopt the following conventions, chosen to align with platform calling conventions and minimize register movement overhead.

For x86-64 (System V AMD64 ABI): the integer accumulator resides in rdi, the integer secondary operand in rsi, the floating-point accumulator in xmm0, and the floating-point secondary operand in xmm1. The document pointer is preserved in rbx, a callee-saved register that survives function calls to field access helpers.

For ARM64 (AAPCS64): the integer accumulator resides in x0, the integer secondary operand in x1, the floating-point accumulator in d0, and the floating-point secondary operand in d1. The document pointer is preserved in x19.

3.4 Operation Categories

The stencil library implements approximately fifty operations across the following categories: arithmetic operations (addition, subtraction, multiplication, division, modulo, negation) for both 64-bit integers and double-precision floating-point; comparison operations (equality, inequality, less-than, less-or-equal, greater-than, greater-or-equal) for both types; boolean logic (conjunction, disjunction, negation); constant loading for integers, doubles, and booleans; type conversions between numeric types; control flow primitives for conditional branching; function prologue and epilogue for register preservation; field access operations for extracting typed values from documents; register management operations for operand positioning; and aggregate update operations for count, sum, minimum, and maximum.

3.5 Expression Compilation

Expression compilation proceeds through post-order traversal of the abstract syntax tree. For each node, the compiler emits stencils that evaluate the subexpression and leave the result in the appropriate accumulator register.

Binary expressions require careful operand management. The left operand is evaluated first, its result saved to the secondary register, the right operand evaluated into the accumulator, and finally the operands swapped if necessary before applying the operator stencil. This sequence ensures correct operand ordering for non-commutative operations.

For the expression (a + b) * c, compilation produces the following stencil sequence: load field a into the integer accumulator; save the accumulator to the secondary register; load field b into the accumulator; swap registers to position operands correctly; emit integer addition; save the result to the secondary register; load field c into the accumulator; swap registers; and emit integer multiplication. The result remains in the accumulator for return or further composition.

3.6 Field Access

Field access from JSON documents presents a design choice between inlining extraction logic and calling runtime helpers. Inlining would require stencils encoding hash table lookup, type checking, and potentially nested path traversal—substantial code that would bloat generated expressions and couple the JIT tightly to document format internals.

We instead call runtime helper functions with C linkage for ABI compatibility:

extern "C" {
bool jit_get_int64_field(const void* doc, const char* field, int64_t* output);
bool jit_get_double_field(const void* doc, const char* field, double* output);
bool jit_get_bool_field(const void* doc, const char* field, bool* output);
}

The field access stencil loads the field name pointer from an embedded literal pool, loads the helper function address, and performs a standard calling-convention call. This adds 10-15 bytes per field access and incurs call overhead, but maintains implementation simplicity and flexibility.

4. Platform-Specific Implementation

4.1 Architectural Considerations

The x86-64 and ARM64 architectures differ substantially in instruction encoding, available operations, and register conventions. These differences manifest in stencil size, patch point complexity, and achievable code density.

x86-64 employs variable-length instruction encoding, ranging from one byte to fifteen bytes depending on operands and addressing modes. The REX prefix system extends register addressing and operand sizes. This variable encoding enables compact representation for common operations but complicates patch point specification.

ARM64 employs fixed-width 32-bit instruction encoding. Every instruction occupies exactly four bytes regardless of complexity. This uniformity simplifies patch point handling—each immediate field occupies known bit positions—but prevents the compact encodings possible on x86-64 for simple operations.

4.2 Stencil Examples

The following examples illustrate concrete stencil implementations across both architectures, demonstrating how logical operations translate to platform-specific instruction sequences.

Function Prologue. The prologue preserves callee-saved registers and establishes the document pointer for subsequent field access calls.

On x86-64:

push   rbx              ; 53
mov    rbx, rdi         ; 48 89 fb

This three-byte sequence saves rbx and copies the document pointer from the first argument register to the preserved location.

On ARM64:

stp    x29, x30, [sp, #-16]!    ; fd 7b bf a9
mov    x19, x0                  ; f3 03 00 aa

The eight-byte sequence uses the store-pair instruction to save the frame pointer and link register simultaneously, then moves the document pointer to the callee-saved register x19.

Integer Addition. Addition exemplifies the encoding density difference between architectures.

On x86-64:

add    rdi, rsi         ; 48 01 f7

Three bytes. The REX.W prefix (0x48) enables 64-bit operand size.

On ARM64:

add    x0, x0, x1       ; 00 00 01 8b

Four bytes due to fixed-width encoding.

Integer Comparison. Comparison operations produce a boolean result (0 or 1) in the accumulator register.

On x86-64:

cmp    rdi, rsi         ; 48 39 f7
setg   dil              ; 40 0f 9f c7
movzx  rdi, dil         ; 48 0f b6 ff

Eleven bytes. The setcc instruction family materializes condition codes, requiring subsequent zero-extension to clear upper bits.

On ARM64:

cmp    x0, x1           ; 1f 00 01 eb
cset   x0, gt           ; e0 c7 9f 9a

Eight bytes. The cset instruction directly produces 0 or 1 based on condition flags, yielding more compact comparison sequences than x86-64.

64-bit Constant Loading. Loading arbitrary 64-bit constants reveals fundamental encoding constraints.

On x86-64:

movabs rdi, 0x0000000000000000   ; 48 bf [8 bytes]

Ten bytes. The movabs instruction embeds the full 64-bit immediate directly in the instruction stream. Patch point: offset 2, length 8 bytes.

On ARM64:

movz   x0, #0x0000               ; e0 ff 9f d2
movk   x0, #0x0000, lsl #16      ; e0 ff bf f2
movk   x0, #0x0000, lsl #32      ; e0 ff df f2
movk   x0, #0x0000, lsl #48      ; e0 ff ff f2

Sixteen bytes. ARM64's fixed-width encoding cannot embed 64-bit immediates; constants are constructed through a sequence of move-with-zero followed by three move-with-keep instructions, each loading 16 bits at different positions. Patch points occur at offsets 0, 4, 8, and 12, each modifying the 16-bit immediate field within the instruction encoding.

Floating-Point Multiplication. Scalar floating-point operations show similar encoding on both platforms.

On x86-64:

mulsd  xmm0, xmm1       ; f2 0f 59 c1

Four bytes using SSE2 scalar double instructions.

On ARM64:

fmul   d0, d0, d1       ; 00 08 61 1e

Four bytes. Both architectures achieve equivalent density for floating-point operations.

Conditional Branch. Branch instructions require patching relative offsets calculated at JIT time.

On x86-64:

test   dil, dil         ; 40 84 ff
jz     rel32            ; 0f 84 [4 bytes]

Nine bytes. The test instruction sets flags based on the boolean value, and the conditional jump encodes a 32-bit relative displacement. Patch point: offset 5, length 4 bytes.

On ARM64:

cbz    x0, offset       ; 00 00 00 b4

Four bytes. The compare-and-branch-if-zero instruction combines the test and branch into a single instruction. The 19-bit signed offset (scaled by 4) is encoded in bits 5-23, requiring bitfield manipulation during patching.

Register Exchange. Binary expression evaluation frequently requires swapping operand registers.

On x86-64:

xchg   rdi, rsi         ; 48 87 fe

Three bytes using the atomic exchange instruction.

On ARM64:

mov    x9, x0           ; e9 03 00 aa
mov    x0, x1           ; e0 03 01 aa
mov    x1, x9           ; e1 03 09 aa

Twelve bytes. ARM64 lacks a register-to-register exchange instruction, requiring a three-instruction sequence through a temporary register.

Function Epilogue. The epilogue restores registers and returns the result.

On x86-64 (integer return):

mov    rax, rdi         ; 48 89 f8
pop    rbx              ; 5b
ret                     ; c3

Five bytes. The result is moved to rax (the return register per calling convention), rbx is restored, and control returns to the caller.

On ARM64 (integer return):

ldp    x29, x30, [sp], #16   ; fd 7b c1 a8
ret                          ; c0 03 5f d6

Eight bytes. The result already resides in x0, so only register restoration and return are necessary.

4.3 Code Size Analysis

Aggregate stencil sizes differ meaningfully between architectures. For the fifty operations implemented, total stencil code occupies approximately 850 bytes on x86-64 and 1,200 bytes on ARM64. The 41% increase on ARM64 primarily reflects fixed-width instruction encoding and the absence of specialized instructions (such as register exchange) that x86-64 provides.

However, ARM64 demonstrates advantages in specific operation categories. Comparison operations are 27% smaller due to the efficient cset instruction. Branch operations are 56% smaller due to combined compare-and-branch instructions. These advantages partially offset the encoding overhead for other operations.

5. Cost Model

5.1 Compilation Decision Framework

JIT compilation incurs upfront cost in exchange for reduced per-row execution time. The cost model determines when this trade-off provides net benefit.

Let $C_c$ denote compilation cost, $C_i$ denote interpretation cost per row, and $C_j$ denote JIT execution cost per row. For $n$ rows, interpretation costs $n \cdot C_i$ while JIT costs $C_c + n \cdot C_j$. JIT provides benefit when:

$$C_c + n \cdot C_j < n \cdot C_i$$

Solving for the break-even row count:

$$n_{break-even} = \frac{C_c}{C_i - C_j}$$

For JIT to provide any benefit, $C_i > C_j$ must hold—interpretation must be slower than compiled execution on a per-row basis.

5.2 Parameter Estimation

Empirical measurement on representative expressions yielded the following cost estimates.

Interpretation overhead averages 5 nanoseconds per operation. This overhead comprises virtual method dispatch for expression node evaluation, type checking at each operation, and result boxing.

JIT execution overhead averages 0.5 nanoseconds per operation, reflecting native instruction execution without interpretation overhead. The 10× improvement ratio aligns with expectations for eliminating interpreter dispatch.

Compilation base cost is 50 microseconds, covering stencil library initialization, code buffer allocation, and initial setup.

Compilation per-operation cost is 1 microsecond, covering stencil copying, patch point application, and offset calculation.

5.3 Decision Rules

Based on these parameters, we apply the following decision rules.

Expressions with fewer than three operations always use interpretation. The potential benefit is too small to justify any compilation overhead, and such expressions evaluate quickly regardless of execution mode.

Expressions with ten or more operations always use JIT compilation. Complexity ensures that compilation cost amortizes rapidly, and the interpreted execution overhead becomes substantial.

Expressions found in the compilation cache use the cached JIT code. Zero marginal compilation cost makes JIT beneficial regardless of row count.

For expressions with three to nine operations, we compute the break-even point and compare against expected row count. A safety margin of 2× ensures that JIT is only applied when benefit is clearly expected. This conservatism avoids JIT for marginal cases where estimation uncertainty might result in net loss.

5.4 Cache Architecture

Repeated queries benefit from caching compiled expressions. We implement an LRU cache keyed by expression hash, storing compiled code and associated metadata.

The cache maintains a maximum of 4,096 entries occupying at most 32 megabytes of executable memory. Eviction removes the least-recently-used entry when limits are exceeded. Cache statistics track hit rate, enabling monitoring and parameter tuning.

Thread safety is achieved through mutex protection on cache operations, with atomic counters for statistics to enable lock-free monitoring.

6. Experimental Evaluation

6.1 Methodology

Experiments were conducted on an Apple M1 Max processor (10 cores, 2.4 GHz maximum frequency, 64 GB unified memory) running macOS. Measurements used 1 million rows unless otherwise specified. Each measurement represents the median of 10 runs following 3 warmup runs. Timing used std::chrono::high_resolution_clock with nanosecond precision.

The interpreter baseline implements standard tree-walking evaluation with virtual dispatch per expression node. Both JIT and interpreted paths use identical field access routines, isolating the effect of expression evaluation optimization.

6.2 Aggregate Function Performance

Aggregate functions represent the most favorable case for JIT compilation, as they execute the same expression repeatedly without field access overhead dominating execution time.

Operation	JIT (rows/s)	Interpreted (rows/s)	Speedup
COUNT	481.7M	46.2M	10.4×
SUM (Int64)	477.8M	45.8M	10.4×
SUM (Double)	285.9M	46.6M	6.1×
MIN (Int64)	418.1M	46.2M	9.0×
MIN (Double)	241.4M	45.4M	5.4×
MAX (Int64)	410.3M	45.5M	9.0×
MAX (Double)	243.0M	45.4M	5.4×

Integer operations consistently achieve approximately 10× speedup, reflecting the elimination of interpreter overhead for simple accumulator updates. Floating-point operations show lower speedups of 5-6×, constrained by floating-point instruction latency rather than interpretation overhead.

6.3 Predicate Evaluation Performance

Predicate evaluation includes field access overhead, providing a more realistic assessment of end-to-end improvement.

Expression	JIT (rows/s)	Interpreted (rows/s)	Speedup
Simple comparison (a > b)	52.5M	21.7M	2.42×
Compound predicate (a > 0 AND b < 100)	29.3M	13.2M	2.22×
Complex arithmetic ((a+b)*c - d/e)	13.4M	9.2M	1.46×
Double arithmetic (ab + cd)	11.7M	8.4M	1.39×

Speedups range from 1.4× to 2.4×, substantially lower than aggregate functions due to field access dominating execution time. Comparison operations show the highest improvement, as the comparison itself benefits from JIT while field access costs remain constant.

6.4 Compilation Overhead

Compilation time is critical for determining the applicable workload range.

Metric	Measured Value
Mean compilation time	3.62 μs
Median compilation time	3.41 μs
95th percentile	5.12 μs
Compilation throughput	276K/s

Sub-4-microsecond compilation times enable JIT for workloads as small as several thousand rows. Compare this to LLVM-based compilation times of 10-100 milliseconds—three to four orders of magnitude slower.

6.5 Batch Size Scaling

To verify consistent performance across workload sizes, we measured throughput for varying batch sizes.

Batch Size	Throughput (rows/s)
1,000	492M
10,000	482M
100,000	473M
1,000,000	470M

Near-linear scaling confirms that cache effects and memory bandwidth do not degrade performance at larger scales. The slight throughput decrease at larger sizes reflects memory hierarchy effects rather than JIT limitations.

6.6 Cache Effectiveness

In production-representative workloads with query pattern repetition, cache hit rates exceeded 90%. Cache hits eliminate compilation overhead entirely, making JIT beneficial even for small row counts when queries repeat.

7. Discussion

7.1 Limitations

The stencil-based approach accepts several limitations in exchange for compilation speed.

No sophisticated optimization is performed. Common subexpression elimination, constant folding, and dead code removal are absent. The generated code directly reflects the expression tree structure. For complex expressions with redundant subexpressions, this may result in suboptimal code compared to LLVM-based compilation.

The technique operates within a row-at-a-time execution model. SIMD vectorization is not applicable; each row is processed independently. For workloads amenable to columnar batch processing, vectorized execution provides superior performance.

String operations beyond field access are not compiled. Pattern matching, concatenation, and string comparison require runtime library calls, limiting JIT benefit for string-heavy predicates.

Subqueries and window functions require full query executor involvement, placing them outside the scope of expression-level JIT.

7.2 Complementary Execution Models

These limitations motivate a hybrid execution architecture. Our implementation provides two execution paths: the row-oriented Volcano model where stencil-based JIT operates, and a vectorized columnar path built on Apache Arrow's Acero query engine.

The query planner selects execution paths based on query characteristics. Selective OLTP predicates—where few rows pass filtering—favor the row-oriented path with JIT-compiled expressions. Analytical scans processing large fractions of tables favor the columnar path, where SIMD operations and cache-efficient columnar access provide greater benefit.

This hybrid approach captures the advantages of both execution models without forcing a single-paradigm solution.

7.3 Future Directions

Several extensions merit investigation. Expression templates for common patterns, such as range checks (x BETWEEN a AND b), could compile to optimized multi-operation stencils rather than composing individual operations. Profile-guided specialization could adapt generated code based on observed field types and value distributions. Extending stencil coverage to simple string operations (equality comparison, prefix matching) would broaden JIT applicability. Adaptive recompilation triggered by type drift—when a field previously observed as integer contains floating-point values—would improve robustness in schema-flexible document databases.

8. Conclusion

This article presented a practical implementation of Copy-and-Patch JIT compilation for SQL expression evaluation. By replacing the traditional compilation pipeline with pre-assembled code templates that are copied and patched at runtime, we achieve compilation times of 3.62 microseconds—three orders of magnitude faster than LLVM-based approaches—while delivering throughput improvements of 10.4× for aggregate functions and 2.4× for predicate evaluation.

The key insight from Xu and Kjolstad's work applies directly to database query processing: SQL expressions rarely require sophisticated optimization; they require fast composition of well-chosen primitives. By pre-assembling those primitives and eliminating the compilation pipeline, stencil-based JIT achieves native code performance with near-instant compilation.

For database implementers seeking query compilation benefits without LLVM's complexity and overhead, the Copy-and-Patch approach offers a compelling alternative. The technique is particularly well-suited to OLTP workloads where compilation latency directly impacts query response time, and to embedded database engines where dependency minimization is valued.

References

[1] T. Neumann, "Efficiently Compiling Efficient Query Plans for Modern Hardware," Proceedings of the VLDB Endowment, vol. 4, no. 9, pp. 539-550, 2011.

[2] H. Xu and F. Kjolstad, "Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode," Proceedings of the ACM on Programming Languages, vol. 5, no. OOPSLA, pp. 1-30, 2021.

[3] M. M. Astrahan et al., "System R: Relational Approach to Database Management," ACM Transactions on Database Systems, vol. 1, no. 2, pp. 97-137, 1976.

[4] T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz, "Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask," Proceedings of the VLDB Endowment, vol. 11, no. 13, pp. 2209-2222, 2018.

[5] PostgreSQL Global Development Group, "PostgreSQL 11 Release Notes: JIT Compilation," 2018. Available: https://www.postgresql.org/docs/11/jit.html

[6] A. Weber, S. Luan, and B. Berger, "Fast WebAssembly Compilation with Baseline JIT," Proceedings of the 19th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences, pp. 114-126, 2020.

[7] M. Zukowski, P. A. Boncz, N. Nes, and S. Héman, "MonetDB/X100 - A DBMS in the CPU Cache," IEEE Data Engineering Bulletin, vol. 28, no. 2, pp. 17-22, 2005.

[8] K. A. Ross, "Selection Conditions in Main Memory," ACM Transactions on Database Systems, vol. 29, no. 1, pp. 132-161, 2004.

Abstract

1. Introduction

2. Related Work

2.1 Query Compilation in Database Systems

2.2 Lightweight JIT Techniques

2.3 Expression Evaluation Optimization

3. Stencil-Based Architecture

3.1 Overview

3.2 Stencil Representation

3.3 Register Conventions

3.4 Operation Categories

3.5 Expression Compilation

3.6 Field Access

4. Platform-Specific Implementation

4.1 Architectural Considerations

4.2 Stencil Examples

4.3 Code Size Analysis

5. Cost Model

5.1 Compilation Decision Framework

5.2 Parameter Estimation

5.3 Decision Rules

5.4 Cache Architecture

6. Experimental Evaluation

6.1 Methodology

6.2 Aggregate Function Performance

6.3 Predicate Evaluation Performance

6.4 Compilation Overhead

6.5 Batch Size Scaling

6.6 Cache Effectiveness

7. Discussion

7.1 Limitations

7.2 Complementary Execution Models

7.3 Future Directions

8. Conclusion

References

Read more

From Bayesian Inference to Neural Computation: The Inevitable Emergence of Neurons in Probabilistic Search

The Immortal Divine, Part V: Crisis, Time, and the Revolutionary Potential of Questions

Compiler Optimization Techniques: Loop Invariant Code Motion and Static Single Assignment Form

The Immortal Divine, Part IV: The Language of Questions and the Ethics of Immediate Responsibility