Bridging Computing Eras: A 64-bit Extension of the 8086 Architecture

J

08 Apr 2025 — 27 min read

1. Introduction

The evolution of computer architectures represents one of the most fascinating chapters in computing history. While modern processors have embraced radical design changes to achieve unprecedented performance, there remains significant educational and research value in understanding the architectural foundations upon which they were built.

My childhood memories of developing on the Intel 8088 are still remarkably vivid. The late nights spent coding in assembly language, the thrill of seeing my programs execute correctly, and the frustration of debugging with limited tools—these experiences shaped my understanding of computing at a fundamental level. It's this deep nostalgia that motivated me to start this personal project: developing an emulator for a virtual 64-bit extension of the 8086 architecture. This emulator, which I'm implementing in modern C++, seeks to bridge the gap between those formative early computing experiences and today's powerful architectures.

This essay examines a unique architectural experiment: extending the classic Intel 8086 architecture to 64-bit capabilities while maintaining its fundamental design principles and instruction compatibility.

2. Architectural Foundations and Extensions

The original 8086 microprocessor, introduced in 1978, featured a 16-bit architecture with segmented memory addressing and a relatively simple instruction set. Our extension preserves these core characteristics while expanding capabilities in several critical dimensions:

Register width expansion from 16 to 64 bits
Unified 64-bit addressing model that transparently supports both segmented and linear memory access
Full virtual memory system with paging, protection, and address translation
Modern performance features including branch prediction, speculative execution, pipelining, and multi-level caching

This approach creates a bridge between computing paradigms – honoring historical principles while enabling modern capabilities.

Our C++ emulator implements this foundation through a modular architecture:

class CPU_8086_64 {
private:
  // CPU state
  struct Registers {
    uint64_t rax{0}, rbx{0}, rcx{0}, rdx{0};
    uint64_t rsi{0}, rdi{0}, rbp{0}, rsp{0};
    uint16_t cs{0}, ds{0}, es{0}, ss{0}, fs{0}, gs{0};
    uint64_t rip{0}; // 64-bit instruction pointer
    uint64_t rflags{0};

    // Control registers (similar to modern x86-64)
    uint64_t cr0{0}; // Contains paging enable bit
    uint64_t cr3{0}; // Page directory base register
    uint64_t cr4{0}; // Contains additional feature control bits
  } regs;

  // Architectural mode - we only support 64-bit mode
  enum class CPUMode {
    MODE_64BIT  // Only one mode supported
  };

  CPUMode current_mode{CPUMode::MODE_64BIT};

  Memory& memory;  // Reference to memory subsystem
  BranchPredictor branch_predictor;
  Pipeline pipeline;
  SpeculativeExecutionEngine speculative_engine;
  CacheSystem cache_system;

public:
  CPU_8086_64(Memory& mem) : memory(mem),
                            branch_predictor(),
                            pipeline(*this, memory, branch_predictor),
                            speculative_engine(*this, memory, branch_predictor),
                            cache_system(*this, memory) {}

  // The CPU always operates in 64-bit mode
  CPUMode get_mode() const {
    return current_mode;
  }

  void execute_cycle() {
    // Main execution cycle integrating pipeline, branch prediction,
    // speculative execution, and cache interactions
    pipeline.cycle();
  }

  std::expected<void, EmulationError> run_until_halt();
};

This foundation combines the simplicity of the 8086's instruction set with modern architectural features. The processor maintains backward compatibility with 8086 code while allowing new software to utilize 64-bit capabilities. Unlike x86-64, our architecture operates exclusively in 64-bit mode, which simplifies many aspects of the design while still preserving compatibility with 16-bit code through the register model.

3. Memory Model Implementation

Our architecture implements a comprehensive memory management system that bridges several eras of computer architecture: the segmentation model of the original 8086, the flat 32-bit addressing of the 386, and the advanced virtual memory capabilities of modern 64-bit systems.

3.1 Comprehensive Memory System

The memory system includes virtual memory capabilities essential for modern operating systems:

class Memory {
private:
    // Constants
    static constexpr size_t PAGE_SIZE = 4096;  // 4KB pages (standard size)
    static constexpr size_t PAGE_MASK = PAGE_SIZE - 1;

    // Physical memory organized in page-sized chunks for efficiency
    struct PhysicalPage {
        std::vector<uint8_t> data;

        PhysicalPage() : data(PAGE_SIZE, 0) {}
    };

    // Sparse physical memory representation (only allocate used pages)
    std::unordered_map<uint64_t, PhysicalPage> physical_memory;

    // Virtual memory components
    struct PageTableEntry {
        uint64_t physical_frame : 40;  // 40 bits for physical frame number
        uint64_t present : 1;          // Is page present in physical memory?
        uint64_t writable : 1;         // Is writing allowed?
        uint64_t user_accessible : 1;  // Can user-mode code access this page?
        uint64_t write_through : 1;    // Write-through caching enabled?
        uint64_t cache_disabled : 1;   // Is caching disabled for this page?
        uint64_t accessed : 1;         // Has page been accessed?
        uint64_t dirty : 1;            // Has page been modified?
        uint64_t huge_page : 1;        // Is this a huge page (2MB/1GB)?
        uint64_t global : 1;           // Is page global (shared across processes)?
        uint64_t executable : 1;       // Can instructions be executed from this page?
        uint64_t available : 12;       // Bits available for OS use
    };

    // Process-specific address spaces
    struct AddressSpace {
        std::unordered_map<uint64_t, PageTableEntry> page_table;
        uint64_t process_id;

        // Memory-mapped regions (files, devices, etc.)
        struct MappedRegion {
            uint64_t start_address;
            uint64_t end_address;
            std::string source;        // File path or device identifier
            bool writable;
            bool executable;
        };
        std::vector<MappedRegion> mapped_regions;
    };

    // All address spaces
    std::unordered_map<uint64_t, AddressSpace> address_spaces;

    // Currently active address space
    AddressSpace* current_address_space = nullptr;

    // TLB for address translation caching
    struct TLBEntry {
        uint64_t physical_page;
        bool valid;
        uint8_t permissions;
        std::chrono::steady_clock::time_point last_access;
    };
    std::unordered_map<uint64_t, TLBEntry> tlb;

    // Swap space for pages moved to disk
    struct SwapEntry {
        uint64_t virtual_address;
        std::vector<uint8_t> data;
    };
    std::unordered_map<uint64_t, SwapEntry> swap_space;

    bool paging_enabled = false;

    // Page replacement algorithm state
    std::list<uint64_t> lru_pages;  // For LRU replacement policy

public:
    Memory() {
        // Create default address space (for system/BIOS)
        address_spaces[0] = AddressSpace{.process_id = 0};
        current_address_space = &address_spaces[0];
    }

    // Switch to a different address space (context switch)
    void switch_address_space(uint64_t process_id) {
        if (!address_spaces.contains(process_id)) {
            // Create new address space if needed
            address_spaces[process_id] = AddressSpace{.process_id = process_id};
        }

        current_address_space = &address_spaces[process_id];
        flush_tlb(); // Clear TLB on context switch
    }

    void flush_tlb() {
        tlb.clear();
    }

    uint8_t read_byte(uint64_t virtual_address) {
        try {
            // Translate virtual address to physical
            uint64_t physical_address = translate_address(virtual_address);

            // Access physical memory
            uint64_t page_number = physical_address / PAGE_SIZE;
            uint64_t page_offset = physical_address & PAGE_MASK;

            // Ensure the page exists in physical memory
            if (!physical_memory.contains(page_number)) {
                physical_memory[page_number] = PhysicalPage();
            }

            // Update LRU information
            update_page_access(page_number);

            return physical_memory[page_number].data[page_offset];
        }
        catch (PageFaultException& e) {
            // Handle page fault (load from disk if available)
            handle_page_fault(e.get_virtual_address());
            // Retry after handling the fault
            return read_byte(virtual_address);
        }
    }

    // Process traditional 8086 segmented addressing
    uint8_t read_byte_segmented(uint16_t segment, uint16_t offset) {
        // Convert segment:offset to linear address using 8086 formula
        uint32_t linear_address = (static_cast<uint32_t>(segment) << 4) + offset;
        return read_byte(linear_address);
    }

    // Additional memory management methods...
};

This memory model provides several important features:

Physical Memory Efficiency: Uses page-sized chunks (4KB) allocated only when needed
Virtual Memory System: Full support for address translation, permissions, and page faults
Multi-Process Support: Maintains separate address spaces for different processes
TLB for Performance: Caches address translations to speed up memory access
Demand Paging: Pages are loaded into physical memory only when accessed
Page Swapping: Least recently used pages can be moved to disk when memory is full

3.2 Segmented Addressing and Backward Compatibility

The traditional 8086 segmented addressing model uses a segment:offset pair to calculate a physical memory address using the formula:

$$\text{Physical Address} = \text{Segment} \times 16 + \text{Offset}$$

In the original 8086, both segment and offset were 16-bit values, limiting the addressable memory to 1MB (20-bit addresses). Our architecture preserves this calculation method while transparently mapping the resulting address into the unified 64-bit address space.

Let's examine several concrete examples of segmented addressing and how they work in our unified architecture:

3.2.1 Code Segment (CS) Addressing Example

The Code Segment register (CS) is used for instruction fetching. Consider this scenario:

; In a legacy program
cs = 0x1000
ip = 0x0234  ; Instruction Pointer

With these values:

The physical address calculation is: 0x1000 × 16 + 0x0234 = 0x10000 + 0x0234 = 0x10234
This address points to a location within the first 1MB of memory
In our unified architecture, this same instruction could also be accessed directly using the 64-bit linear address 0x0000000000010234

When the CPU executes this legacy code, it transparently performs this translation, allowing instruction fetching to work identically to the original 8086 while actually operating within the 64-bit address space.

3.2.2 Data Segment (DS) Addressing Example

Consider a data access using the Data Segment register:

; Accessing data using segmented addressing
ds = 0x2000
mov ax, [0x1234]  ; Effective address is DS:0x1234

Here:

The physical address calculation is: 0x2000 × 16 + 0x1234 = 0x20000 + 0x1234 = 0x21234
The same memory location could be accessed using linear addressing: mov ax, [0x0000000000021234]
Both methods access the exact same physical memory location

This dual-accessibility enables fascinating compatibility scenarios where legacy code using segmented addressing can interact with modern code using linear addressing, both operating on the same memory.

3.2.3 Multiple Segment Access Example

One of the unique aspects of segmented addressing is that the same physical memory location can be accessed using different segment:offset combinations. For example:

Physical address 0x12345 could be accessed as:
- 0x1230:0x0045 (0x1230 × 16 + 0x0045 = 0x12345)
- 0x1000:0x2345 (0x1000 × 16 + 0x2345 = 0x12345)
- 0x0800:0xA345 (0x0800 × 16 + 0xA345 = 0x12345)

Our architecture preserves this flexibility within the first 1MB of memory while also allowing direct 64-bit access to the same locations.

3.2.4 Stack Operations Example

The Stack Segment (SS) and Stack Pointer (SP) are used for stack operations:

; Stack setup
ss = 0x6000
sp = 0x1000  ; Stack grows downward from this offset

; Push operation
push ax      ; Decrements SP to 0x0FFE, then stores AX at SS:SP (0x6000:0x0FFE)

The physical address calculation for the stack operation is:

0x6000 × 16 + 0x0FFE = 0x60000 + 0x0FFE = 0x60FFE

In our architecture, this same stack location could be accessed directly using the 64-bit address 0x0000000000060FFE, which might be useful for debugging tools or system management code that needs to examine the stack directly.

3.2.5 Far Pointers and Jumps

The 8086 architecture introduced the concept of "far" pointers and jumps that explicitly specify both segment and offset:

; Far jump example
jmp 0x2000:0x0100  ; Jump to address 0x20100 (0x2000 × 16 + 0x0100)

Our architecture preserves this capability while internally mapping to the 64-bit address space:

; The same destination could be reached with a 64-bit direct jump
jmp 0x0000000000020100  ; Equivalent to the far jump above

This allows both addressing mechanisms to coexist, providing seamless compatibility for legacy code while enabling modern code to use the full 64-bit address space.

4. Register Extension Strategy

The register extension strategy follows established principles seen in the evolution from x86 to x86-64. The original 16-bit registers (ax, bx, cx, dx, si, di, bp, sp) are preserved as the lower portions of their 64-bit counterparts (rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp).

Our emulator implements this register hierarchy using modern C++ bitfields and unions:

class RegisterFile {
private:
  // For each general purpose register, create a union
  // that allows access at different widths
  union {
    uint64_t rax;
    struct {
      uint32_t eax;
      uint32_t rax_high;
    };
    struct {
      uint16_t ax;
      uint16_t ax_high;
      uint32_t rax_upper;
    };
    struct {
      uint8_t al;
      uint8_t ah;
      uint16_t ax_upper;
      uint32_t rax_upper;
    };
  } rax_register{};

  // Similar definitions for other registers...

public:
  // Get/set methods with appropriate semantics
  void set_rax(uint64_t value) { rax_register.rax = value; }
  void set_eax(uint32_t value) {
    rax_register.eax = value;
    rax_register.rax_high = 0; // Upper 32-bits cleared
  }
  void set_ax(uint16_t value) {
    rax_register.ax = value;
    // Higher parts unmodified
  }

  uint64_t get_rax() const { return rax_register.rax; }
  uint32_t get_eax() const { return rax_register.eax; }
  uint16_t get_ax() const { return rax_register.ax; }
  uint8_t get_al() const { return rax_register.al; }
  uint8_t get_ah() const { return rax_register.ah; }
};

This arrangement allows seamless interoperation between 16-bit, 32-bit, and 64-bit code as demonstrated in the BIOS:

test_cpu:
    ; 64-bit register test
    mov rax, 0x55aa55aa55aa55aa
    mov rbx, rax
    cmp rax, rbx
    jne cpu_error

    ; Basic arithmetic operation test
    mov rax, 1
    add rax, 2
    cmp rax, 3
    jne cpu_error

Here, full 64-bit registers are used for testing. In contrast, the bootloader uses a mix of 16-bit and 64-bit registers:

print_string:
    push rax
    push rbx

    mov ah, 0x0e             ; Teletype output function (16-bit register usage)
    mov bx, 0x0007           ; Page 0, white text (16-bit register usage)

print_loop:
    mov al, [rsi]            ; Load from 64-bit address
    test al, al              ; Test for NULL character
    jz print_done

    int 0x10                 ; BIOS video service
    inc rsi                  ; Increment 64-bit pointer
    jmp print_loop

This function uses 16-bit register parts (ah, al, bx) for BIOS compatibility while simultaneously using 64-bit registers (rsi) for memory addressing, showcasing how legacy and modern code patterns work together seamlessly.

5. Branch Prediction and Speculative Execution

Our 64-bit extension implements sophisticated branch prediction and speculative execution mechanisms that were absent in the original 8086 but are crucial for modern processor performance.

5.1 Branch Prediction Implementation

The branch predictor attempts to guess the outcome of conditional branches before they are resolved:

class BranchPredictor {
private:
    // Two-level adaptive branch predictor
    // Level 1: Branch History Table (BHT)
    // Level 2: Pattern History Table (PHT)

    // Branch History Table: maps branch addresses to history patterns
    std::unordered_map<uint64_t, uint16_t> branch_history_table;

    // Pattern History Table: maps history patterns to predictions
    // We use 2-bit saturating counters (00=strongly not taken, 11=strongly taken)
    std::array<std::array<uint8_t, 4>, 65536> pattern_history_table;

    // Length of branch history register (in bits)
    static constexpr size_t HISTORY_LENGTH = 16;

    // Mask for extracting relevant bits from history
    static constexpr uint16_t HISTORY_MASK = (1 << HISTORY_LENGTH) - 1;

public:
    BranchPredictor() {
        // Initialize pattern history table with weak taken predictions
        for (auto& row : pattern_history_table) {
            for (auto& entry : row) {
                entry = 0b10; // Weakly taken
            }
        }
    }

    bool predict_branch(uint64_t branch_address) {
        // Get history pattern for this branch
        uint16_t history = 0;
        if (branch_history_table.contains(branch_address)) {
            history = branch_history_table[branch_address];
        }

        // Look up prediction in pattern history table
        uint8_t counter_value = pattern_history_table[history % pattern_history_table.size()]
                                                     [branch_address % 4];

        // Predict taken if counter >= 2
        return counter_value >= 2;
    }

    void update_predictor(uint64_t branch_address, bool taken) {
        // Get current history for this branch
        uint16_t history = 0;
        if (branch_history_table.contains(branch_address)) {
            history = branch_history_table[branch_address];
        }

        // Update counter in pattern history table
        uint8_t& counter = pattern_history_table[history % pattern_history_table.size()]
                                               [branch_address % 4];

        if (taken) {
            // Increment counter, saturate at 3
            if (counter < 3) counter++;
        } else {
            // Decrement counter, saturate at 0
            if (counter > 0) counter--;
        }

        // Update history pattern (shift left and add new outcome)
        history = ((history << 1) | (taken ? 1 : 0)) & HISTORY_MASK;
        branch_history_table[branch_address] = history;
    }
};

This two-level adaptive predictor significantly outperforms simpler schemes by tracking patterns of branch behavior over time. For example, in a memory testing loop:

memory_test_loop:
    mov [rax], rdx           ; Test with 64-bit address and data
    cmp [rax], rdx
    jne memory_test_error
    not rdx
    add rax, 8
    sub rcx, 8
    jnz memory_test_loop

The branch predictor will quickly learn that the jne memory_test_error branch is almost never taken during normal operation, and the jnz memory_test_loop branch is taken many times before finally not being taken. This allows the CPU to speculatively execute the next iteration of the loop before knowing if the branch will be taken.

5.2 Speculative Execution Engine

Building on the branch predictor, our architecture implements speculative execution, allowing the processor to continue executing instructions beyond branches before knowing their outcome:

class SpeculativeExecutionEngine {
private:
    // Reference to CPU components
    RegisterFile& registers;
    Memory& memory;
    BranchPredictor& branch_predictor;

    // Stack to store CPU state for speculation rollback
    struct SpeculativeState {
        RegisterFile registers;
        uint64_t instruction_pointer;
        std::vector<MemoryOperation> memory_operations;
    };

    std::vector<SpeculativeState> speculation_stack;

    // Log of speculative memory operations
    struct MemoryOperation {
        bool is_write;
        uint64_t address;
        uint8_t original_value;
        uint8_t new_value;
    };
    std::vector<MemoryOperation> current_speculative_operations;

    // Maximum speculation depth
    static constexpr size_t MAX_SPECULATION_DEPTH = 16;

public:
    SpeculativeExecutionEngine(RegisterFile& regs, Memory& mem, BranchPredictor& pred)
        : registers(regs), memory(mem), branch_predictor(pred) {}

    bool can_speculate() {
        return speculation_stack.size() < MAX_SPECULATION_DEPTH;
    }

    void begin_speculation(uint64_t branch_address) {
        if (!can_speculate()) return;

        // Predict branch outcome
        bool prediction = branch_predictor.predict_branch(branch_address);

        // Save current CPU state
        speculation_stack.push_back({
            registers,                   // Copy of registers
            registers.get_rip(),         // Current instruction pointer
            current_speculative_operations // Memory operations so far
        });

        current_speculative_operations.clear();

        // If branch is predicted not taken, instruction pointer already correct
        // If branch is predicted taken, we need to update the instruction pointer
        if (prediction) {
            // For simplicity, assume branch target is the next instruction's operand
            uint64_t target = get_branch_target(branch_address);
            registers.set_rip(target);
        }
    }

    // Additional methods for speculative memory access and rollback...

    void commit_speculation(uint64_t branch_address, bool actual_outcome) {
        if (speculation_stack.empty()) return;

        // Update branch predictor
        branch_predictor.update_predictor(branch_address, actual_outcome);

        // If prediction was correct, just discard saved state
        bool prediction = branch_predictor.predict_branch(branch_address);
        if (prediction == actual_outcome) {
            speculation_stack.pop_back();
            current_speculative_operations.clear();
            return;
        }

        // Prediction was wrong, need to rollback
        rollback_speculation();
    }

    void rollback_speculation() {
        if (speculation_stack.empty()) return;

        // Restore CPU state
        auto state = speculation_stack.back();
        registers = state.registers;

        // Undo memory operations (in reverse order)
        for (auto it = current_speculative_operations.rbegin();
             it != current_speculative_operations.rend(); ++it) {
            if (it->is_write) {
                memory.write_byte(it->address, it->original_value);
            }
        }

        // Restore previous memory operations log
        current_speculative_operations = state.memory_operations;

        // Pop the speculation stack
        speculation_stack.pop_back();
    }
};

This implementation allows the processor to execute instructions speculatively beyond branches, rolling back if the prediction was incorrect. This provides significant performance benefits in real hardware, and in our emulation environment it allows for accurate representation of modern processor behavior.

6. Pipelined Execution Model

Unlike the original 8086, which executed instructions sequentially, our architecture implements a pipelined execution model that breaks instruction processing into stages that can operate in parallel. This provides a much more accurate representation of modern processor behavior in our emulation environment.

class Pipeline {
private:
    // Pipeline stages
    struct FetchStage {
        uint64_t instruction_address;
        RawInstruction raw_data;
        bool valid;
    };

    struct DecodeStage {
        Instruction decoded;
        uint64_t instruction_address;
        bool valid;
    };

    struct ExecuteStage {
        Instruction instruction;
        OperandValues operands;
        uint64_t instruction_address;
        bool valid;
    };

    struct MemoryStage {
        Instruction instruction;
        OperandValues operands;
        uint64_t memory_address;
        uint64_t memory_value;
        bool memory_access_required;
        bool is_write;
        uint64_t instruction_address;
        bool valid;
    };

    struct WriteBackStage {
        Instruction instruction;
        RegisterValue result;
        RegisterID destination;
        bool register_write_required;
        uint64_t instruction_address;
        bool valid;
    };

    // Pipeline registers
    FetchStage fetch_stage;
    DecodeStage decode_stage;
    ExecuteStage execute_stage;
    MemoryStage memory_stage;
    WriteBackStage writeback_stage;

    // Components
    CPU_8086_64& cpu;
    Memory& memory;
    BranchPredictor& branch_predictor;
    SpeculativeExecutionEngine& speculative_engine;
    CacheSystem& cache_system;

    // Forwarding and hazard detection
    bool data_hazard_detected;
    bool control_hazard_detected;

public:
    Pipeline(CPU_8086_64& cpu_ref, Memory& mem_ref,
             BranchPredictor& pred, SpeculativeExecutionEngine& spec,
             CacheSystem& cache)
        : cpu(cpu_ref), memory(mem_ref), branch_predictor(pred),
          speculative_engine(spec), cache_system(cache) {}

    void cycle() {
        // Execute pipeline stages in reverse order to avoid overwriting inputs
        writeback_stage_execute();
        memory_stage_execute();
        execute_stage_execute();
        decode_stage_execute();
        fetch_stage_execute();

        // Handle hazards and stalls
        resolve_hazards();
    }

    // Implementation of pipeline stages and hazard resolution...
};

This five-stage pipeline allows for detailed emulation of how modern processors handle instruction execution, with up to five instructions in different stages of processing simultaneously. While in a hardware implementation this would yield significant throughput improvements, in our emulation environment the primary benefit is architectural accuracy. This accuracy enables realistic reproduction of timing characteristics, execution patterns, and microarchitectural behaviors that would be present in actual hardware.

The pipeline implementation includes sophisticated hazard detection and resolution mechanisms:

Data Hazards: When an instruction depends on the result of a previous instruction that hasn't completed
Control Hazards: When a branch instruction changes the flow of execution
Structural Hazards: When multiple instructions need the same hardware resource

Data forwarding is used to resolve many data hazards, allowing a result to be "forwarded" from one pipeline stage to another without waiting for it to be written to the register file:

OperandValues read_operands_with_forwarding(const Instruction& inst) {
    OperandValues values;

    // Read from registers, but check for available forwarding paths first
    for (size_t i = 0; i < inst.source_registers.size(); i++) {
        RegisterID src_reg = inst.source_registers[i];

        // Try to forward from writeback stage
        if (writeback_stage.valid &&
            writeback_stage.register_write_required &&
            writeback_stage.destination == src_reg) {
            values.sources[i] = writeback_stage.result;
            continue;
        }

        // Try to forward from memory stage
        if (memory_stage.valid &&
            memory_stage.instruction.type == InstructionType::ALU &&
            memory_stage.instruction.destination == src_reg) {
            values.sources[i] = memory_stage.operands.result;
            continue;
        }

        // No forwarding path available, read from register file
        values.sources[i] = cpu.read_register(src_reg);
    }

    return values;
}

When combined with branch prediction and speculative execution, the pipeline provides a much more accurate representation of modern processor behavior compared to the original 8086's sequential execution model. This architectural accuracy allows the emulator to faithfully reproduce the timing characteristics, hazards, and execution patterns of an actual hardware implementation, even though in an emulation environment these features don't translate to performance improvements and may actually introduce computational overhead.

7. CPU Cache System

Modern processors rely heavily on cache hierarchies to bridge the speed gap between fast CPU cores and relatively slow main memory. Our 64-bit extension of the 8086 architecture implements a comprehensive multi-level cache system that was absent in the original design.

7.1 Cache Hierarchy Implementation

The cache system is organized in multiple levels with increasing size and latency:

class CacheSystem {
private:
    // Cache configuration
    struct CacheConfig {
        size_t size;             // Total size in bytes
        size_t line_size;        // Cache line size in bytes
        size_t associativity;    // Number of ways in set-associative cache
        size_t access_latency;   // Simulated access time in CPU cycles
        ReplacementPolicy policy;// LRU, FIFO, Random, etc.
    };

    // Cache line represents smallest unit of transfer between cache levels
    struct CacheLine {
        bool valid{false};       // Is this line valid?
        bool dirty{false};       // Has this line been modified?
        uint64_t tag{0};         // Tag bits of the address
        std::vector<uint8_t> data; // Actual data stored in this line
        uint64_t last_access{0}; // For LRU replacement
        uint64_t access_count{0};// For statistics
    };

    // Set of cache lines with the same index
    struct CacheSet {
        std::vector<CacheLine> lines;

        CacheSet(size_t ways, size_t line_size)
            : lines(ways, CacheLine{.data = std::vector<uint8_t>(line_size)}) {}
    };

    // Complete cache level structure
    struct CacheLevel {
        std::string name;           // L1, L2, L3, etc.
        CacheConfig config;         // Configuration parameters
        std::vector<CacheSet> sets; // All sets in this cache level
        uint64_t hits{0};           // Statistics: cache hits
        uint64_t misses{0};         // Statistics: cache misses

        CacheLevel(const std::string& name, const CacheConfig& config)
            : name(name), config(config) {
            size_t num_sets = config.size / (config.line_size * config.associativity);
            sets.reserve(num_sets);
            for (size_t i = 0; i < num_sets; i++) {
                sets.emplace_back(config.associativity, config.line_size);
            }
        }

        // Compute set index from address
        size_t get_set_index(uint64_t address) const {
            return (address / config.line_size) % sets.size();
        }

        // Compute tag from address
        uint64_t get_tag(uint64_t address) const {
            return address / (config.line_size * sets.size());
        }

        // Compute offset within a line from address
        size_t get_offset(uint64_t address) const {
            return address % config.line_size;
        }
    };

    // Cache levels in the hierarchy
    CacheLevel l1_data;
    CacheLevel l1_instruction;
    CacheLevel l2_unified;
    CacheLevel l3_unified;

    // Memory interface
    Memory& memory;

    // Cache coherency state
    enum class CoherencyState {
        MODIFIED,
        EXCLUSIVE,
        SHARED,
        INVALID
    };

    // Global cycle counter for timing simulation
    uint64_t current_cycle{0};

    // Cache coherency protocol (MESI)
    struct CoherencyProtocol {
        // Implementation of MESI protocol operations
        void handle_read(CacheLine& line, CoherencyState& state) {
            if (state == CoherencyState::INVALID) {
                // Read miss - transition to SHARED
                state = CoherencyState::SHARED;
            }
            // Other states remain unchanged on read
        }

        void handle_write(CacheLine& line, CoherencyState& state) {
            if (state == CoherencyState::SHARED ||
                state == CoherencyState::EXCLUSIVE) {
                // Write to non-modified line - transition to MODIFIED
                state = CoherencyState::MODIFIED;
                line.dirty = true;
            } else if (state == CoherencyState::INVALID) {
                // Write miss - transition to MODIFIED
                state = CoherencyState::MODIFIED;
                line.dirty = true;
            }
            // MODIFIED state remains unchanged on write
        }

        void handle_invalidation(CoherencyState& state) {
            state = CoherencyState::INVALID;
        }
    } coherency_protocol;

public:
    CacheSystem(Memory& mem)
        : l1_data("L1-Data", {32 * 1024, 64, 8, 1, ReplacementPolicy::LRU}),
          l1_instruction("L1-Instruction", {32 * 1024, 64, 8, 1, ReplacementPolicy::LRU}),
          l2_unified("L2", {256 * 1024, 64, 8, 10, ReplacementPolicy::LRU}),
          l3_unified("L3", {8 * 1024 * 1024, 64, 16, 40, ReplacementPolicy::LRU}),
          memory(mem) {}

    // Read data from cache hierarchy or memory
    uint8_t read_byte(uint64_t address, bool is_instruction) {
        current_cycle++;

        // Try L1 cache first
        CacheLevel& l1 = is_instruction ? l1_instruction : l1_data;
        auto [l1_hit, l1_value] = cache_lookup(l1, address);

        if (l1_hit) {
            // L1 cache hit
            return l1_value;
        }

        // Try L2 cache next
        auto [l2_hit, l2_value] = cache_lookup(l2_unified, address);
        if (l2_hit) {
            // L2 hit - fill L1 cache
            cache_fill(l1, address, fetch_cache_line(l2_unified, address));
            return l2_value;
        }

        // Try L3 cache next
        auto [l3_hit, l3_value] = cache_lookup(l3_unified, address);
        if (l3_hit) {
            // L3 hit - fill L2 and L1 caches
            auto line_data = fetch_cache_line(l3_unified, address);
            cache_fill(l2_unified, address, line_data);
            cache_fill(l1, address, line_data);
            return l3_value;
        }

        // Cache miss - access main memory
        uint8_t value = memory.read_byte(address);

        // Fill cache hierarchy
        std::vector<uint8_t> line_data = fetch_memory_line(address);
        cache_fill(l3_unified, address, line_data);
        cache_fill(l2_unified, address, line_data);
        cache_fill(l1, address, line_data);

        return value;
    }

    // Write data to cache hierarchy and memory
    void write_byte(uint64_t address, uint8_t value) {
        current_cycle++;

        // Try L1 data cache first
        bool l1_hit = update_cache(l1_data, address, value);

        if (!l1_hit) {
            // Check if present in L2
            bool l2_hit = update_cache(l2_unified, address, value);

            if (!l2_hit) {
                // Check if present in L3
                bool l3_hit = update_cache(l3_unified, address, value);

                if (!l3_hit) {
                    // Not in any cache level - write allocate policy
                    // Fetch line from memory first
                    std::vector<uint8_t> line_data = fetch_memory_line(address);

                    // Update the line with the new value
                    size_t offset = l1_data.get_offset(address);
                    line_data[offset] = value;

                    // Fill the entire hierarchy
                    cache_fill(l3_unified, address, line_data);
                    cache_fill(l2_unified, address, line_data);
                    cache_fill(l1_data, address, line_data, true); // Mark as dirty
                }
            }
        }

        // For write-through policy, update memory immediately
        // For write-back policy, memory is updated when dirty line is evicted
        if (is_write_through()) {
            memory.write_byte(address, value);
        }
    }

    // Simulate cache flush operation
    void flush_cache() {
        // Write back all dirty lines to memory
        flush_level(l1_data);
        flush_level(l1_instruction);
        flush_level(l2_unified);
        flush_level(l3_unified);
    }

    // Get cache statistics
    CacheStatistics get_statistics() const {
        return {
            .l1d_hit_rate = calculate_hit_rate(l1_data),
            .l1i_hit_rate = calculate_hit_rate(l1_instruction),
            .l2_hit_rate = calculate_hit_rate(l2_unified),
            .l3_hit_rate = calculate_hit_rate(l3_unified),
            .average_memory_access_time = calculate_amat()
        };
    }

    // Additional helper methods for cache operations...
};

This cache implementation provides several crucial features that were absent in the original 8086:

Multi-level Hierarchy: Separate L1 instruction and data caches, unified L2 and L3 caches
Set-associative Organization: Configurable associativity for each cache level
Cache Coherency: MESI protocol implementation to maintain data consistency
Replacement Policies: Support for LRU, FIFO, and random replacement strategies
Write Policies: Configurable write-through or write-back behavior

7.2 Cache-Memory Interaction

The interaction between the cache system and memory subsystem is crucial for performance in modern architectures:

// Memory access through cache hierarchy
uint8_t CPU_8086_64::read_memory_byte(uint64_t address) {
    // Check if paging is enabled
    if (paging_enabled) {
        address = translate_virtual_address(address);
    }

    // Access through cache hierarchy
    return cache_system.read_byte(address, false);
}

// Instruction fetch through cache hierarchy
RawInstruction CPU_8086_64::fetch_instruction(uint64_t address) {
    RawInstruction instruction;

    // Translate address if necessary
    if (paging_enabled) {
        address = translate_virtual_address(address);
    }

    // Fetch bytes through instruction cache
    for (size_t i = 0; i < instruction.size(); i++) {
        instruction.bytes[i] = cache_system.read_byte(address + i, true);
    }

    return instruction;
}

This implementation allows us to model important performance characteristics of modern processors:

Cache Miss Penalties: The latency of memory access varies dramatically based on where the data is found (L1, L2, L3, or main memory)
Spatial Locality: Sequential memory accesses benefit from cache line prefetching
Temporal Locality: Recently accessed data is kept in the faster cache levels
Cold Start Effects: Initial program execution experiences more cache misses until the working set is loaded

In our emulation environment, these cache behaviors provide critical architectural accuracy. While the cache implementation doesn't improve emulation performance (and may actually slow it down due to the added complexity), it allows the emulator to accurately model the timing behaviors of real hardware. This is especially important when emulating code that might behave differently based on cache characteristics, such as:

; Memory access pattern optimized for cache locality
memory_test_optimal:
    mov rcx, TEST_SIZE       ; Number of bytes to test
    mov rax, TEST_BUFFER     ; Buffer address

test_loop_sequential:
    mov [rax], rdx           ; Write test pattern
    cmp [rax], rdx           ; Read and verify
    jne memory_test_error
    not rdx                  ; Toggle test pattern
    add rax, 8               ; Move to next 64-bit word
    sub rcx, 8
    jnz test_loop_sequential

This sequential access pattern would show excellent cache performance compared to a random access pattern, which is accurately represented in our emulation through the cache system.

7.3 Cache Coherency Implications

The introduction of cache also requires handling the coherency challenges that arise when multiple agents (CPU, DMA controllers, multiple cores) access the same memory:

// Example: DMA transfer that bypasses the cache
void DMAController::transfer(uint64_t source, uint64_t destination, size_t size) {
    // Signal the cache that memory regions will be modified
    cache_system.invalidate_region(destination, size);

    // Perform direct memory-to-memory transfer
    for (size_t i = 0; i < size; i++) {
        uint8_t value = memory.read_byte_uncached(source + i);
        memory.write_byte_uncached(destination + i, value);
    }

    // Flush write buffers
    memory.flush_write_buffers();
}

By introducing these cache mechanisms, our architecture provides a much more accurate model of modern processor behavior. The interplay between the pipeline, branch prediction, speculative execution, and the cache hierarchy creates a complete picture of how contemporary processors achieve their performance despite the memory wall challenge.

8. Single 64-bit Mode Bootloader

A key architectural difference from traditional x86-64 is that our 8086-64 architecture operates exclusively in 64-bit mode. This simplifies the bootloader design significantly, as there's no need for the complex mode transitions (real mode → protected mode → long mode) required in x86-64 systems.

8.1 Bootloader Design

The bootloader is loaded by the BIOS at the traditional location (0x7C00) but immediately makes use of 64-bit registers and instructions:

; Bootloader entry point (loaded by BIOS at 0x7C00)
org 0x7C00
section .text

boot_start:
    ; Set up stack (using 64-bit address space)
    mov rsp, 0x0000000000007C00  ; 64-bit stack pointer setup

    ; Display welcome message
    mov rsi, welcome_msg
    call print_string

    ; Gather memory map information (for 64-bit memory management)
    call detect_memory

    ; Set up page tables
    call setup_page_tables

    ; Load kernel from disk to memory
    call load_kernel

    ; Transfer control to kernel
    jmp kernel_entry

; String output function (using 64-bit registers)
print_string:
    push rax
    push rbx

    ; Prepare BIOS teletype function
    mov ah, 0x0e
    mov bx, 0x0007

print_loop:
    mov al, [rsi]            ; Load character from 64-bit address
    test al, al              ; Test for NULL character
    jz print_done

    int 0x10                 ; BIOS video service
    inc rsi                  ; Increment 64-bit pointer
    jmp print_loop

print_done:
    pop rbx
    pop rax
    ret

This bootloader uses traditional BIOS interrupts but processes addresses and operations with 64-bit registers. Since there's no real mode or protected mode, we don't need the GDT (Global Descriptor Table) setup or mode switching code that would be present in a traditional x86-64 bootloader.

8.2 Memory Detection and Initialization

All physical memory is directly accessible in the 64-bit address space, which simplifies memory management:

; Detect available memory (using E820)
detect_memory:
    mov rax, 0                ; Memory map buffer pointer
    mov rdi, memory_map
    mov rcx, 0                ; Entry counter

    ; Prepare E820 memory map call
    xor ebx, ebx

memory_detect_loop:
    mov eax, 0xE820
    mov edx, 0x534D4150      ; "SMAP" signature
    mov ecx, 24              ; Entry size (bytes)
    int 0x15

    jc memory_detect_done    ; If carry flag set, done
    test ebx, ebx            ; If ebx=0, last entry
    jz memory_detect_done

    add rdi, 24              ; Move to next entry (64-bit address increment)
    inc qword [entry_count]  ; 64-bit counter increment
    jmp memory_detect_loop

memory_detect_done:
    ret

; Set up page tables (direct 64-bit mapping)
setup_page_tables:
    ; Create PML4 table and make it point to PDPT entry
    mov rdi, 0x10000         ; PML4 table location
    mov rax, 0x11000 | 3     ; PDPT address + present/writable bits
    mov [rdi], rax

    ; Set up PDPT using 1GB pages (for simplicity)
    mov rdi, 0x11000         ; PDPT location
    mov rax, 0x0 | 0x83      ; Start of memory + 1GB page + present/writable bits
    mov [rdi], rax

    ; Map physical memory of kernel to the same virtual address
    mov rax, 0x200000 | 0x83 ; Kernel load address + 1GB page + present/writable bits
    mov [rdi + 8], rax

    ret

8.3 Kernel Loading Process

The bootloader loads the 64-bit kernel from disk and prepares page tables:

; Load kernel from disk
load_kernel:
    ; Prepare kernel location info
    mov rsi, kernel_info_msg
    call print_string

    ; Prepare disk read
    mov rax, 0                 ; Function = reset
    mov dl, [boot_drive]
    int 0x13
    jc disk_error

    ; Use extended disk read (LBA mode)
    mov ah, 0x42
    mov dl, [boot_drive]
    mov rsi, disk_packet      ; 64-bit address usage
    int 0x13
    jc disk_error

    ; Kernel load success message
    mov rsi, kernel_loaded_msg
    call print_string

    ret

; Disk packet structure (for LBA usage)
disk_packet:
    db 16                    ; Packet size
    db 0                     ; Reserved
    dw 64                    ; Number of sectors to read (adjust based on kernel size)
    dw 0x0000                ; Destination memory offset (kernel load location)
    dw 0x1000                ; Destination memory segment
    dq 1                     ; LBA starting address (kernel location)

8.4 Kernel Transition

Finally, the bootloader transfers control to the kernel through a simple interface for compatibility:

; Transfer control to kernel
kernel_entry:
    ; Pass necessary information to kernel in registers
    mov rdi, memory_map      ; Memory map pointer
    mov rsi, [entry_count]   ; Memory map entry count
    mov rdx, 0x10000         ; PML4 table address

    ; Jump to kernel entry point
    mov rax, KERNEL_ENTRY_POINT
    jmp rax                  ; Direct jump to 64-bit address

Key features of this implementation include:

Single Mode Operation: No traditional mode transitions, operating in 64-bit mode from the start
Simplified Memory Model: Immediate use of 64-bit linear address space
Legacy BIOS Compatibility: Able to invoke BIOS interrupts while using 64-bit registers
Direct Page Table Setup: Establishing paging structures without complex mode transitions

8.5 Overall System Initialization Flow

The complete boot process flows as follows:

BIOS powers on and performs POST (Power-On Self Test)
BIOS loads first sector of boot device to 0x7C00
CPU begins executing bootloader code in 64-bit mode
Bootloader collects memory map and sets up page tables
Bootloader loads 64-bit kernel from disk
Bootloader transfers control to kernel
Kernel continues execution in 64-bit mode

This design bypasses the complex mode transitions of traditional x86-64 architecture while maintaining compatibility with legacy BIOS. This approach is particularly useful for educational purposes, allowing users to experience both the simplicity of the 8086 and the power of modern 64-bit processors.

In our emulator, the bootloader is implemented as follows:

void Emulator8086_64::load_bootloader(const std::string& bootloader_path) {
  // Read bootloader binary file
  std::ifstream bootloader_file(bootloader_path, std::ios::binary);
  if (!bootloader_file) {
    throw std::runtime_error("Failed to open bootloader file");
  }

  // Read bootloader content and load into emulated memory
  std::vector<uint8_t> bootloader_data(
      (std::istreambuf_iterator<char>(bootloader_file)),
      std::istreambuf_iterator<char>());

  // Load bootloader into emulated memory at the appropriate address
  uint64_t bootloader_load_address = 0x7C00;
  for (size_t i = 0; i < bootloader_data.size(); i++) {
    memory.write_byte(bootloader_load_address + i, bootloader_data[i]);
  }

  // Verify that last two bytes of bootloader sector are 0x55, 0xAA (boot signature)
  if (bootloader_data.size() >= 512 &&
      bootloader_data[510] == 0x55 &&
      bootloader_data[511] == 0xAA) {
    std::cout << "Valid boot signature found\n";
  } else {
    std::cerr << "Warning: No valid boot signature\n";
  }
}

The bootloader is loaded at address 0x7C00 by the emulated BIOS, and after the BIOS code completes, the CPU's instruction pointer (RIP) is set to this address. The emulator then executes the bootloader code to continue system initialization and kernel loading.

9. Virtual Memory Management

The relationship between the bootloader-provided memory map and the operating system's virtual memory manager is a critical aspect of system design.

9.1 Memory Map Utilization

The memory map information collected by the bootloader (via E820 or similar mechanisms) provides essential details to the operating system:

Physical Memory Availability: Which memory regions are available and which are reserved
Hardware Memory Regions: Locations of ACPI, APIC, IOAPIC, and other hardware-related memory areas
Memory Holes: Identification of memory holes in PC architecture

The virtual memory manager must utilize this information to function correctly. Otherwise:

It might overwrite memory already in use by hardware, causing system instability
It might attempt to use memory regions that don't physically exist
It could miss opportunities for memory optimization

class VirtualMemoryManager {
private:
    // Memory map from bootloader
    struct MemoryMapEntry {
        uint64_t base_address;
        uint64_t length;
        uint32_t type;  // 1=available, 2=reserved, 3=ACPI, 4=NVS, 5=bad
    };
    std::vector<MemoryMapEntry> memory_map;

public:
    // Receive memory map info from bootloader
    void initialize_from_bootloader(void* memory_map_ptr, size_t entry_count) {
        auto* entries = static_cast<MemoryMapEntry*>(memory_map_ptr);
        memory_map.assign(entries, entries + entry_count);

        // Initialize physical memory manager based on memory map
        initialize_physical_memory_manager();
    }

    void initialize_physical_memory_manager() {
        // Register only available memory regions with physical memory manager
        for (const auto& entry : memory_map) {
            if (entry.type == 1) {  // Available memory
                physical_memory_manager.add_free_region(
                    entry.base_address, entry.length);
            }
        }
    }
};

9.2 Page Table Transition

Regarding the bootloader's page tables:

Temporary Usage: The operating system typically uses the bootloader's page tables only temporarily. The bootloader's page tables are designed to provide minimal functionality.
Kernel Replacement: The kernel initializes its own memory management system and creates a more sophisticated page table structure.
Careful Transition: The transition between page tables must be handled carefully. Improper transition can lead to page faults or system crashes.

The operating system performs this transition as part of its initialization:

class KernelInitialization {
private:
    uint64_t bootloader_pml4_address;

public:
    void start_kernel(uint64_t memory_map_ptr, uint64_t entry_count,
                     uint64_t bootloader_page_tables) {
        // Store bootloader page table address
        bootloader_pml4_address = bootloader_page_tables;

        // Pass memory map info to virtual memory manager
        virtual_memory_manager.initialize_from_bootloader(
            reinterpret_cast<void*>(memory_map_ptr), entry_count);

        // Create new kernel page tables
        uint64_t kernel_pml4 = create_kernel_page_tables();

        // Switch to new page tables
        switch_page_tables(kernel_pml4);

        // Continue with kernel initialization
        // ...
    }

    uint64_t create_kernel_page_tables() {
        // Allocate new page table structures
        uint64_t pml4_addr = virtual_memory_manager.allocate_page();

        // Important: Copy some existing mappings to ensure continuity
        copy_essential_mappings(bootloader_pml4_address, pml4_addr);

        // Set up kernel mappings
        map_kernel_memory(pml4_addr);

        return pml4_addr;
    }

    void switch_page_tables(uint64_t new_pml4) {
        // Update CR3 register to switch page tables
        asm volatile("mov %0, %%cr3" : : "r"(new_pml4) : "memory");
    }
};

In our 8086-64 emulator, this is modeled as:

class CPU_8086_64 {
private:
    uint64_t cr3;  // Current page table base register

public:
    void set_page_table_base(uint64_t pml4_address) {
        cr3 = pml4_address;
        // Invalidate TLB to clear previous address translation info
        flush_tlb();
    }

    uint64_t translate_virtual_address(uint64_t virtual_address) {
        // Use the current page table stored in cr3 for address translation
        uint64_t pml4_index = (virtual_address >> 39) & 0x1FF;
        uint64_t pdpt_index = (virtual_address >> 30) & 0x1FF;
        uint64_t pd_index = (virtual_address >> 21) & 0x1FF;
        uint64_t pt_index = (virtual_address >> 12) & 0x1FF;
        uint64_t page_offset = virtual_address & 0xFFF;

        // 4-level paging for physical address translation
        // ...

        return physical_address;
    }
};

The virtual memory manager must use the bootloader-provided memory map but should eventually replace the bootloader's page tables with its own more sophisticated structures during kernel initialization.

10. Interrupt System Preservation

The interrupt-based I/O model of the original 8086 is maintained almost unchanged in this architecture. The Interrupt Vector Table (IVT) starting at address 0x0000 contains handlers for 256 possible interrupts, though the entry format is extended to accommodate 64-bit addresses.

In our emulator, we implement the mechanism for triggering and processing interrupts:

class CPU_8086_64 {
private:
  // Other CPU members...

  void handle_interrupt(uint8_t interrupt_num) {
    // Push flags, CS, and IP to stack (for IRET)
    push(rflags);
    push(cs);
    push(rip);

    // Disable interrupts
    rflags &= ~FLAGS_IF;

    // Read handler address from IVT
    uint64_t handler_address = memory.read<uint64_t>(interrupt_num * 8);

    // Jump to handler
    rip = handler_address;
  }

public:
  void execute_cycle() {
    // Handle pending interrupts
    if ((rflags & FLAGS_IF) && pending_interrupt) {
      handle_interrupt(pending_interrupt);
      pending_interrupt = 0;
      return;
    }

    // Continue with normal execution cycle
    // ...
  }
};

The BIOS implementation initializes the IVT with handler addresses:

init_ivt:
    ; Interrupts 0x00 - 0x1F (CPU exceptions)
    mov rax, ivt_addr        ; IVT at start of memory (0x0000)
    mov rcx, 0x20
    mov rdx, exception_handler

init_exception_ivt:
    mov [rax], rdx
    add rax, 8
    dec rcx
    jnz init_exception_ivt

    ; Specific interrupt assignments
    mov qword [ivt_addr + 8*0x10], int10_handler  ; Video services
    mov qword [ivt_addr + 8*0x13], int13_handler  ; Disk services
    mov qword [ivt_addr + 8*0x16], int16_handler  ; Keyboard services

This approach preserves the familiar interrupt interface while allowing the emulator to properly handle the execution of assembly code interrupt handlers.

11. BIOS Implementation

The 64-bit BIOS represents a faithful extension of the traditional BIOS model. Our emulator loads the BIOS assembly code into memory and executes it:

class Emulator8086_64 {
private:
  Memory memory;
  CPU_8086_64 cpu;
  DiskEmulator disk;
  BranchPredictor branch_predictor;
  SpeculativeExecutionEngine speculative_engine;
  Pipeline pipeline;
  CacheSystem cache_system;

  // Optional components
  std::unique_ptr<InstructionTracer> tracer;
  std::unique_ptr<SecurityAnalyzer> security_analyzer;

public:
  Emulator8086_64(const std::string& bios_path, const std::string& disk_image_path)
    : memory(),
      cpu(memory),
      disk(disk_image_path),
      branch_predictor(),
      speculative_engine(cpu.get_registers(), memory, branch_predictor),
      pipeline(cpu, memory, branch_predictor, speculative_engine, cache_system),
      cache_system(memory) {

    // Load the BIOS binary
    load_bios(bios_path);

    // Set CPU to start at BIOS entry point
    cpu.set_rip(0xFFFF0000);
  }

  void load_bios(const std::string& bios_path) {
    // Read BIOS binary file
    std::ifstream bios_file(bios_path, std::ios::binary);
    if (!bios_file) {
      throw std::runtime_error("Failed to open BIOS file");
    }

    // Read BIOS content and load into emulated memory
    std::vector<uint8_t> bios_data((std::istreambuf_iterator<char>(bios_file)),
                                   std::istreambuf_iterator<char>());

    // Load BIOS into emulated memory at the appropriate address
    uint64_t bios_load_address = 0xFFFF0000;
    for (size_t i = 0; i < bios_data.size(); i++) {
      memory.write_byte(bios_load_address + i, bios_data[i]);
    }
  }

  void boot() {
    try {
      // Start CPU execution at BIOS entry point
      // BIOS will initialize the system and load the bootloader
      cpu.run();

      std::cout << "Emulation completed successfully\n";
    } catch (const EmulationException& e) {
      std::cerr << "Emulation error: " << e.what() << '\n';
    }
  }
};

This implementation correctly separates the hardware emulation (in C++) from the software being emulated (the BIOS and bootloader written in assembly). The BIOS code performs the traditional initialization sequence:

bios_start:
    ; Initial system settings
    cli                          ; Disable interrupts
    xor rax, rax                 ; rax = 0
    mov rsp, stack_start         ; Set stack pointer

    ; Perform POST (Power-On Self Test)
    call perform_post

    ; Initialize screen
    call init_screen

    ; Initialize interrupt vector table
    call init_ivt

    ; Initialize and test memory
    call init_memory

    ; Initialize BIOS data area
    call init_bios_data

    ; Output message
    mov rsi, boot_message
    call print_string

    ; Search for boot device and load boot sector
    call load_boot_sector

    ; Transfer control to boot sector (in legacy region)
    mov rax, boot_sector        ; 0x7c00
    jmp rax

The processor doesn't need to implement the actual BIOS or bootloader functionality in C++ - it simply provides the hardware environment for the assembly code to execute. This approach maintains clear separation between hardware emulation and software execution.

12. Educational Significance

Perhaps the greatest value of this architecture lies in its educational potential. By maintaining the 8086's simple instruction set within a unified 64-bit environment with advanced features like pipelining, branch prediction, speculative execution, and multi-level caching, it creates an ideal learning environment for understanding computer architecture evolution.

Our emulator includes comprehensive instrumentation for educational purposes:

class InstructionTracer {
private:
  CPU_8086_64& cpu;
  std::ofstream trace_file;

  // Step execution statistics
  struct ExecutionStats {
    uint64_t instructions_executed{0};
    uint64_t memory_reads{0};
    uint64_t memory_writes{0};
    uint64_t jumps_taken{0};
    uint64_t jumps_not_taken{0};
    uint64_t page_faults{0};
    uint64_t branch_mispredictions{0};
    uint64_t speculation_rollbacks{0};
    uint64_t cache_hits_l1{0};
    uint64_t cache_hits_l2{0};
    uint64_t cache_hits_l3{0};
    uint64_t cache_misses{0};
    std::unordered_map<std::string, uint64_t> instruction_counts;
  } stats;

public:
  // Tracing and visualization methods...

  void before_instruction(uint64_t address, const Instruction& inst) {
    // Format the current CPU state
    std::string reg_state = format_register_state();
    std::string pipeline_state = format_pipeline_state();
    std::string speculation_status = format_speculation_status();
    std::string cache_state = format_cache_state();

    // Log to trace file
    trace_file << std::format("{:#016x},{},{},{},{},{},{},{}\n",
                address,
                inst.mnemonic,
                inst.operands_str,
                reg_state,
                memory_accesses,
                pipeline_state,
                speculation_status,
                cache_state);

    // Update statistics
    stats.instructions_executed++;
    stats.instruction_counts[inst.mnemonic]++;
  }

  // Generate visualization of processor state over time
  void generate_execution_visualization(const std::string& output_path) {
    // Create interactive HTML visualization showing:
    // - Pipeline state over time
    // - Branch prediction outcomes
    // - Cache hit/miss patterns
    // - Memory access patterns
  }

  // Generate summary report
  void generate_summary_report(const std::string& output_path) {
    // Output detailed execution statistics and analysis
  }
};

This instrumentation allows students to observe and understand:

The interaction between segmentation and paging in the memory hierarchy
How branch prediction and speculative execution affect performance
The impact of pipeline hazards and stalls
How cache hierarchies improve memory access performance
The effects of different memory access patterns on cache efficiency
How virtual memory systems handle page faults
The potential security implications of microarchitectural features

The emulator serves as a bridge between architectural concepts, allowing students to experience the evolution of processor design in a single integrated system.

13. Conclusion

The 64-bit extension of the 8086 architecture represents more than a technical curiosity—it embodies a bridge across computing eras. By preserving the elegant simplicity of the original design while extending its capabilities to match modern requirements, this architecture creates a unique platform for education, research, and reflection on the evolution of computing systems.

The complete emulator system brings all these components together:

int main() {
  // Load BIOS and disk image
  Emulator8086_64 emulator("bios_8086_64.bin", "disk.img");

  // Optional: enable instruction tracing
  emulator.enable_tracing("execution_trace.csv");

  // Optional: enable security analysis
  emulator.enable_security_analysis();

  // Start the emulation
  emulator.boot();

  return 0;
}

This emulator implementation brings to life the theoretical architecture described in this essay, allowing anyone to experience the bridge between computing eras through hands-on exploration. The unified architecture is particularly elegant - it demonstrates how architectural principles evolve over time while preserving compatibility with earlier designs.

Key features that make this project unique:

True Architectural Bridge: Seamlessly connects 16-bit segmented addressing and 64-bit virtual memory
Single 64-bit Mode: Operates exclusively in 64-bit mode, simplifying system design while maintaining backward compatibility
Modern Performance Features: Implements branch prediction, speculative execution, pipelining, and multi-level caching while maintaining backward compatibility
Educational Value: Demonstrates the evolution of processor design within a single integrated system
Architectural Accuracy: Faithfully reproduces the behavior and timing characteristics of modern processors, even though these features don't actually improve emulation performance

We are aware that the absence of protected mode allowing free use of interrupts can cause security issues, and we plan to improve this so that the operating system can control it. While our architecture's single-mode design offers simplicity, it does create challenges for implementing proper privilege separation. Future iterations will introduce mechanisms for the operating system to safely manage interrupt handling without compromising the clean architectural model.

This project demonstrates that architectural elegance is timeless. The clarity and simplicity of the 8086 design remain valuable even when extended with sophisticated features of modern processors. As we continue to push the boundaries of computing performance, these foundational principles will undoubtedly continue to inform and inspire future architectural innovations.