The Memory Maze - Navigating Hierarchical Technology, Caches, and Virtual Memory

May 2, 2023 — #Computer Architecture #CS Basics

Hey memory explorers! We know computers need to store vast amounts of data and instructions. But there's a catch: the fastest memory is expensive and small, while large-capacity memory is slower and cheaper. How do systems square this circle? The answer lies in Hierarchical Memory Technology. It’s a layered approach that aims to provide the illusion of a single, large, fast memory.

Fundamental Properties of Memory Hierarchies

The effectiveness of memory hierarchies hinges on a few core principles:

1. Locality of Reference

Computer programs don't access memory locations randomly. They exhibit locality of reference, which has two main types:

Temporal Locality: If a program accesses a memory location, it's very likely to access that same location again soon. Think of loop variables or instructions within a loop.
Spatial Locality: If a program accesses a memory location, it's very likely to access nearby memory locations soon. This is common with sequential instruction execution and accessing elements in arrays.

Memory hierarchies exploit this by keeping recently and frequently accessed data, along with data near it, in faster, smaller levels of memory closer to the processor.

2. Inclusion Property

Generally, the inclusion property states that the contents of any level in the hierarchy are a subset of the contents of the levels below it. For example, any data found in the L1 cache must also be present in the L2 cache (if it exists), and anything in the L2 cache must also be in main memory. This makes management simpler.

Exception: Some designs might use a multilevel exclusion policy where, for example, L1 data is never found in L2. This can potentially increase the effective size of the combined caches but complicates data movement.

3. Coherence Property

When data can exist in multiple places (e.g., in main memory and also in a cache, or in multiple caches in a multiprocessor system), coherence becomes crucial. It means that all copies of the same data item must eventually reflect the same value. If one copy is modified (e.g., in a cache), other copies must be either updated or invalidated to prevent the system from using stale data.

For instance, if a DMA controller modifies a block in main memory, any cached copy of that block becomes stale. The cache's valid bit for that block might be cleared, or more sophisticated coherence mechanisms might be employed to update or invalidate the cache. If a write-back cache holds a dirty copy of that block, the situation is even more complex, as the main memory and cache versions diverge until write-back.

Cache Memory Organizations: A Quick Recap

We've discussed caches before, but let's quickly touch upon how blocks from main memory are placed (mapped) into the much smaller cache.

Direct Mapping: Each main memory block can only go into one specific cache line. This is determined by cache_line_index = main_memory_block_number MOD number_of_cache_lines. It's simple but can lead to conflict misses if multiple active memory blocks map to the same cache line. A tag stored with each cache line identifies which specific memory block is currently occupying it.
Fully Associative Mapping: A main memory block can be placed in any available cache line. This is the most flexible and avoids conflict misses but requires comparing the incoming block's tag with all tags in the cache simultaneously, making it complex and expensive for large caches.
Set-Associative Mapping: A compromise. The cache is divided into sets, and each set contains a fixed number (k) of cache lines (a k-way set-associative cache). A main memory block first maps to a specific set (set_index = main_memory_block_number MOD number_of_sets), and then it can be placed in any of the k lines within that set. This reduces conflict misses compared to direct mapping while being less complex than fully associative mapping. Tag comparison is done only within the selected set.

Techniques for Reducing Cache Misses

A cache miss means the processor has to wait for data from slower main memory, stalling the pipeline. Reducing misses is critical for performance. Misses can be categorized (the "3Cs"):

Compulsory Misses (Cold Start Misses): Occur on the very first access to a block; the block has never been in the cache. These are unavoidable for the first reference but can be ameliorated.
Capacity Misses: Occur because the cache is not large enough to hold all the blocks needed by the program during a particular phase. Blocks are discarded and later re-fetched.
Conflict Misses (Collision Misses): Occur in direct-mapped or set-associative caches when multiple active blocks map to the same cache line or set, and one evicts another that is still needed.

Techniques to reduce these misses include:

Larger Block Size: Can reduce compulsory misses and improve spatial locality by fetching more data per miss. However, very large blocks can increase the miss penalty (more to transfer), waste cache space if only a small part of the block is used (pollution), and potentially increase conflict misses.
Larger Cache Size: Directly reduces capacity misses by allowing more of the program's working set to reside in the cache. The main trade-offs are increased cost, physical size, power consumption, and potentially a slightly longer hit time.
Higher Associativity: Reduces conflict misses by providing more choices for block placement within a set (for set-associative) or anywhere (for fully associative). However, it increases the complexity of tag comparison and replacement logic, which can increase hit time and cost.
Multilevel Caches (L1, L2, sometimes L3): Using multiple levels of cache is a very effective strategy. L1 is small and fast for minimal hit time. L2 is larger and a bit slower, designed to catch most L1 misses, thus reducing the effective miss penalty of L1. L3 serves a similar purpose for L2 misses.
Victim Caches: A small, fully associative cache that sits between a main cache (e.g., L1) and the next level of memory (e.g., L2). It stores blocks that were recently evicted (victims) from the main cache. On a miss in the main cache, the victim cache is checked. If the block is there, it's a "victim hit," which is much faster than going to L2. This primarily helps reduce conflict misses in the main cache.
Hardware Prefetching: The hardware monitors memory access patterns. If it detects a pattern (e.g., sequential access to array elements or a fixed stride), it proactively fetches blocks into the cache before the CPU explicitly requests them. This can convert potential misses into hits.
Software Prefetching: The compiler or programmer inserts special prefetch instructions into the code. These instructions tell the hardware to load specific data into the cache in anticipation of its use. This requires more programmer/compiler effort but can be very effective if access patterns are known.
Compiler Optimizations: The compiler can reorder code and data structures to improve spatial and temporal locality, thereby reducing cache misses. For example, changing loop structures (like loop interchange for array access) or data layout can make access patterns more cache-friendly.

(Challenge: Which type of cache miss (compulsory, capacity, conflict) do you think is hardest to reduce significantly without major cost implications?)

Virtual Memory Organization

Just as caches bridge the CPU-main memory speed gap, virtual memory bridges the main memory-secondary storage (disk) size/cost gap. It gives programs the illusion of having a very large, private main memory (the virtual address space), even if the physical main memory is much smaller.

Key concepts:

The CPU generates virtual addresses. These are translated into physical addresses (actual locations in main memory) by a hardware Memory Management Unit (MMU), usually with support from the Operating System (OS).
Paging: Both the virtual address space and physical memory are divided into fixed-size blocks called pages and page frames, respectively. A common page size is 4KB or 8KB. Data is moved between disk and main memory in units of pages.
Page Table: A data structure, typically stored in main memory and managed by the OS, that maps virtual pages to physical page frames. The processor usually has a Page Table Base Register (PTBR) that points to the start of the active process's page table.
- A virtual address is split into a virtual page number (VPN) and an offset within that page.
- The VPN is used as an index into the page table to find the Page Table Entry (PTE) for that virtual page.
- Each PTE contains:
  - The physical page frame number (PFN) if the page is in main memory.
  - A valid bit (or present bit): Indicates if the page is currently loaded in a physical page frame.
  - A modified bit (or dirty bit): Indicates if the page has been written to since being loaded. If dirty, it must be saved back to disk before its frame is reused.
  - Protection bits: Control access rights (e.g., read-only, read-write, execute-only).
Translation Lookaside Buffer (TLB): Accessing the page table in main memory for every memory reference would add significant overhead (an extra memory access). The TLB is a small, fast, typically associative cache within the MMU that stores recently used PTEs.
- When the MMU gets a virtual address, it first checks the TLB for the VPN.
- TLB Hit: The PFN is retrieved quickly from the TLB.
- TLB Miss: The MMU (or OS, depending on the hardware design) accesses the page table in main memory to fetch the PTE. This PTE is then loaded into the TLB (possibly replacing an older entry), and the translation continues. The TLB must be kept consistent with the page table; for example, if the OS changes a PTE, it must invalidate the corresponding TLB entry.

Virtual Memory Mapping and Management Techniques

The Mapping Process

The CPU generates a virtual address (VA).
The MMU splits the VA into a Virtual Page Number (VPN) and an Offset.
The MMU checks the TLB for an entry matching the VPN.
- TLB Hit: The Physical Page Frame Number (PFN) is retrieved from the TLB.
- TLB Miss: The MMU uses the PTBR and the VPN to calculate the address of the Page Table Entry (PTE) in main memory. It fetches the PTE. If the PTE is valid, it's loaded into the TLB.
If the fetched PTE's valid bit is 0, a page fault occurs (see below).
If valid, the PFN from the PTE (now in TLB or directly from memory) is combined with the original Offset to form the Physical Address (PA).
The physical address is used to access main memory.

Page Fault Handling

A page fault is an exception raised by the hardware (MMU) when a program tries to access a part of its virtual address space that is not currently in physical memory (i.e., the PTE's valid bit is 0). The OS then performs these steps:

Interrupt the current process.
Determine the virtual address that caused the fault.
Verify that the virtual address is valid and that the access does not violate protection bits. If not, it's a protection fault (e.g., segmentation fault).
Locate the required page on the disk.
Find a free page frame in main memory.
If no frame is free, select a victim page in main memory using a page replacement policy (see next section).
If the victim page is dirty (modified bit is set), write it back to the disk.
Read the required page from disk into the now-available page frame.
Update the page table entry for the newly loaded page (set valid bit, PFN, clear dirty bit). Invalidate any old TLB entry for the victim page.
Restart the instruction that caused the page fault.

Memory Management by the OS

The OS manages the allocation of page frames, handles page faults, maintains page tables, and enforces memory protection. It also decides when to swap pages in and out, trying to keep the most actively used pages in memory.

Memory Replacement Policies

When a cache is full or all page frames in main memory are occupied, and a new block/page needs to be brought in, an existing one must be evicted or replaced. The choice of which item to replace is determined by a replacement policy. The goal is to replace the item least likely to be needed in the near future.

Common policies include:

LRU (Least Recently Used): Replaces the block or page that has not been accessed for the longest period. This exploits temporal locality. It performs well but can be complex and costly to implement perfectly for caches with high associativity or for managing many page frames. Hardware usually implements approximations.
FIFO (First-In, First-Out): Replaces the block or page that has been in the cache or memory the longest. Simpler to implement than LRU but might evict frequently used items that were loaded long ago.
Random: Selects a random block/page for replacement. Very simple, and its performance can be surprisingly good, often close to more complex schemes, especially when access patterns are erratic.
LFU (Least Frequently Used): Replaces the block/page that has been accessed the fewest times. Requires counters for each entry.
Clock Algorithm (Second Chance): An approximation of LRU often used for page replacement. It maintains a "use bit" (or reference bit) for each page frame. When a replacement is needed, the OS scans frames in a circular manner. If a frame's use bit is 1, it's cleared, and the frame gets a "second chance." If the use bit is 0, that page is selected for replacement.

When replacing a dirty page/block (one that has been modified), it must be written back to the lower level of memory (disk for pages, main memory for cache blocks using write-back) before the new page/block can be loaded. This adds to the miss/fault penalty.

Conclusion: The Intricate Dance of Data

The memory system is a sophisticated hierarchy, not just a single monolithic block. From the lightning-fast registers and multi-level caches built with SRAM, to the vast DRAM-based main memory, and finally to persistent secondary storage, each level plays a role. Key principles like locality, inclusion, and coherence guide their design. Caches employ clever mapping functions and miss-reduction strategies. Virtual memory dramatically expands the usable address space through techniques like paging, address translation via MMUs and TLBs, and careful OS management of page faults and replacement policies. Together, these technologies orchestrate an intricate dance of data, striving to keep the processor fed and programs running smoothly.

Key Takeaways:

Memory hierarchies balance speed, size, and cost by using different memory technologies at different levels.
Locality of reference (temporal and spatial) is the fundamental reason why hierarchies work.
Cache memory uses mapping functions (direct, associative, set-associative) to store frequently accessed data closer to the CPU, with various techniques to reduce misses.
Virtual memory allows programs to use an address space larger than physical memory by using disk storage as an extension, managed through paging, page tables, and the MMU/TLB.
Replacement policies (like LRU) are crucial for deciding what data to evict when cache or memory is full.