The Computer's Memory - A Deep Dive into Storage, Speed, and Hierarchy

April 29, 2023 — #Computer Architecture #CS Basics

Hey memory masters and data devotees! We've journeyed through the CPU's core and its arithmetic capabilities. But where does all the data and all those instructions live? That's the domain of the Memory System. It's not just a simple box of bits; it's a sophisticated, hierarchical structure designed to balance speed, size, and cost. Let's unlock its secrets!

Basic Concepts: The ABCs of Memory

At its core, memory is all about storing and retrieving information. The processor accesses memory using an address. If a processor uses k-bit addresses, it can address up to $2^k$ memory locations. Data is typically accessed in fixed-size groups called words (e.g., 32-bit or 64-bit), though most systems are byte-addressable, meaning each byte has a unique address. Key speed metrics are:

Memory Access Time: The time elapsed between initiating a read/write request and the data being available/stored.
Memory Cycle Time: The minimum time delay required between the initiation of two successive memory operations.

The fundamental challenge is the speed disparity: processors are much faster than large main memories. This leads to the memory hierarchy concept: a layered approach using different memory types to provide both high speed and large capacity economically. The typical hierarchy is: Processor Registers -> L1 Cache -> L2 Cache -> Main Memory -> Secondary Storage (Disk). As we move down the hierarchy, size increases, speed decreases, and cost per bit decreases.

Semiconductor RAM Memories: The Workhorse

Random Access Memory (RAM) is the main operational memory where active programs and data reside. It's characterized by the ability to access any location in a constant amount of time.

Internally, memory chips consist of an array of memory cells, each storing one bit. These cells are organized into rows, where each row typically corresponds to one memory word. An address decoder selects a specific row (word line), and sense/write circuits connect to bit lines for reading or writing data.

Static RAM (SRAM)

Structure: SRAM cells are essentially latches or flip-flops, commonly built using multiple transistors (often six). They retain their state as long as power is supplied.
Properties:
- Volatile (data lost when power is off).
- Very fast access times (few nanoseconds).
- Higher cost and lower cell density (more transistors per bit) compared to DRAM.
- Often used for cache memories due to their speed.

Dynamic RAM (DRAM)

Structure: DRAM cells are simpler, typically consisting of a single transistor and a capacitor to store a bit as electric charge.
Properties:
- Do not retain state indefinitely; the capacitor charge leaks and must be periodically refreshed.
- Volatile.
- Slower access times than SRAM.
- Higher cell density and lower cost per bit than SRAM, making them suitable for main memory.

DRAM Evolution and Features:

Asynchronous DRAM: Early DRAMs used asynchronous control signals. To reduce the number of pins on the chip, the address is often multiplexed: the row address is sent first and latched by the Row Address Strobe (RAS) signal, followed by the column address latched by the Column Address Strobe (CAS) signal.
Fast Page Mode (FPM DRAM): Once a row is selected (latched by RAS), multiple consecutive column locations within that row can be accessed rapidly by only asserting CAS and providing new column addresses, without reasserting RAS. This is efficient for transferring blocks of data.
Synchronous DRAM (SDRAM): Operation is synchronized with an external clock signal (typically the processor bus clock). This allows for more precise timing control and higher speeds. SDRAMs often feature:
- Output latches for data.
- Burst Mode: After an initial address is provided, SDRAM can automatically generate a sequence of addresses for consecutive locations, controlled by the clock, for fast block transfers.
DDR SDRAM (Double Data Rate SDRAM): Achieves higher bandwidth by transferring data on both the rising and falling edges of the clock signal. Modern systems use DDR3, DDR4, DDR5, etc., each generation offering increased speed and efficiency. Often, the internal cell array is organized into multiple banks to allow concurrent access preparations, further improving throughput.

(Think Point: Why is refreshing necessary for DRAM but not for SRAM?) Answer: DRAM stores bits as charge on capacitors, which leak over time. SRAM uses flip-flops which hold their state as long as power is applied.

Read-Only Memories (ROMs): Permanent Storage

Unlike RAM, ROMs are non-volatile, meaning they retain their stored information even when power is turned off. This makes them essential for storing firmware, such as the computer's boot program (BIOS/UEFI) or embedded system software.

ROM (Masked ROM): Data is written into the ROM during the manufacturing process. Not user-alterable.
PROM (Programmable ROM): Can be programmed once by the user using special equipment.
EPROM (Erasable Programmable ROM): Can be erased by exposing the chip to ultraviolet (UV) light and then reprogrammed. Useful during development.
EEPROM (Electrically Erasable Programmable ROM): Can be erased and reprogrammed electrically, often byte-by-byte, without being removed from the circuit.
Flash Memory: A type of EEPROM where read operations can access individual cells, but write/erase operations are performed on entire blocks of cells. This makes it denser and cheaper per bit than traditional EEPROM. Its low power consumption and robustness make it ideal for SSDs, USB drives, and memory cards.

Speed, Size, and Cost: The Eternal Triangle

The design of any memory system is governed by the trade-off between these three factors:

Speed (Access Time): Faster memory (e.g., SRAM) is more expensive.
Size (Capacity): Larger capacity memory (e.g., DRAM, magnetic disks) tends to be slower and cheaper per bit.
Cost: The ultimate constraint.

The memory hierarchy is the primary architectural strategy to achieve a system that appears to have the speed of the fastest memory, the capacity of the largest memory, and an intermediate cost.

Memory Module Design: Building Larger Memories

Individual memory chips often don't provide enough capacity or the desired word width for a computer system. Therefore, multiple chips are combined on a small circuit board called a memory module.

Wider Word: To create a memory with a 32-bit word using 8-bit wide chips, four chips would be used in parallel. Each chip contributes 8 bits to the 32-bit word for any given address.
Increased Capacity: To increase the number of words, multiple such banks of chips are used. Chip Select (CS) signals, derived from higher-order address bits, are used to select which bank (or row of chips) is active for a given memory access.
SIMMs and DIMMs:
- SIMM (Single In-line Memory Module): Has memory chips on one or both sides, but all pins are connected in parallel on both sides of the edge connector.
- DIMM (Dual In-line Memory Module): Has separate electrical contacts on each side of the edge connector, allowing for wider data paths (e.g., 64-bit or more). These modules plug into sockets on the motherboard, enabling easy upgrades and system configuration.
Memory Controller: Processors usually output a full address. For DRAMs that require multiplexed row and column addresses, a memory controller circuit (often part of the chipset or integrated into the CPU) is needed. It takes the full address from the processor, splits it into row and column components, and generates the necessary RAS, CAS, and other timing signals for the DRAM modules.

Cache Memories: Speeding Up Access

The cache is a small, fast memory (typically SRAM) located between the processor and the main memory (DRAM). Its purpose is to store frequently accessed data and instructions, making them available to the processor much faster than fetching from main memory. This works due to the principle of locality of reference:

Temporal Locality: Recently accessed items are likely to be accessed again soon.
Spatial Locality: Items whose addresses are near a recently accessed item are likely to be accessed soon.

Cache Operation Basics

When the processor requests data, it first checks the cache.
Cache Hit: If the data is found in the cache, it's a hit. The data is read quickly from the cache.
Cache Miss: If the data is not in the cache, it's a miss. A block of data (containing the requested word and adjacent words) is copied from main memory into a cache line (also called a cache block). The requested word is then forwarded to the processor.
- Load-through (or Early Restart): The requested word is sent to the CPU as soon as it arrives from memory, without waiting for the rest of the block to be loaded into the cache, reducing stall time.
Valid Bit: Each cache line has a valid bit. When the system is powered on, all valid bits are cleared. When a block is loaded, its valid bit is set.
Write Policies (on a Write Hit):
- Write-Through: Data is written to both the cache and main memory simultaneously. Simple, but can be slow if the processor has to wait for main memory write. Often uses a write buffer to store pending writes, allowing the processor to continue without stalling.
- Write-Back (Copy-Back): Data is written only to the cache. A dirty bit (or modified bit) for the cache line is set. The modified block is written back to main memory only when the cache line is replaced. Faster for frequent writes to the same block, but more complex.
Write Policies (on a Write Miss):
- Write-Allocate: The block is first loaded into the cache (like a read miss), and then the write operation proceeds on the cached copy. Typically used with write-back.
- No-Write-Allocate (or Write-Around): The data is written directly to main memory, bypassing the cache for this write. Typically used with write-through.

Mapping Functions: Where to Put a Block?

Since the cache is much smaller than main memory, a mapping function decides where an incoming memory block should be placed in the cache. Consider a system with a conceptual 2K-word cache organized as 128 blocks, each containing 16 words, and a 64K-word main memory, organized as 4K blocks of 16 words each. A 16-bit main memory address could be conceptually divided: 4 bits for the Word offset within a block, some bits for a cache Block index (in direct mapping), and the remaining high-order bits for a Tag.

Direct Mapping:
- A memory block j can only be placed in a specific cache block, typically determined by (j mod m), where m is the number of blocks in the cache (e.g., 128).
- The low-order bits of the memory block number serve as the cache block index.
- The high-order bits of the memory block number are stored as a tag in the cache line along with the data and a valid bit.
- Operation: The processor's address is split. The index bits point to a cache line. The tag bits from the address are compared with the tag stored in that cache line. If they match and the valid bit is set, it's a hit.
- Pros: Simple and inexpensive to implement.
- Cons: Contention can occur if two frequently used memory blocks map to the same cache line, leading to repeated misses (thrashing) even if other cache lines are free.
Associative Mapping (Fully Associative):
- A memory block can be placed in any available cache line.
- The memory address is split into a Word offset and a Tag. The entire remaining part of the address after the offset forms the tag.
- Operation: To find a block, its tag is simultaneously compared with the tags in all cache lines.
- Pros: Most flexible, highest cache utilization, avoids contention issues of direct mapping.
- Cons: Very expensive to implement due to the parallel comparison logic needed for all cache lines. Only feasible for small caches.
Set-Associative Mapping:
- A compromise between direct and fully associative mapping. The cache is divided into v sets, each containing k blocks (this is a k-way set associative cache).
- A memory block j is first mapped to a specific set using a calculation like set number = (j mod v).
- The block can then be placed in any of the k blocks within that determined set.
- The memory address is split into Word offset, Set index, and Tag.
- Operation: The Set index bits select a set. The Tag from the address is then compared simultaneously with the tags of all k blocks within that set.
- Example: A 2-way set-associative cache with 128 total blocks would have 64 sets, each containing 2 blocks. A memory block maps to one of these 64 sets, and can reside in either of the two blocks within that set.
- Pros: Better hit rate than direct mapping for a similar cost, less complex than fully associative.
- Cons: More complex than direct mapping.

Replacement Algorithms

When a cache miss occurs and the cache (or the target set in set-associative mapping) is full, a decision must be made about which existing block to replace.

Direct Mapping: Replacement is trivial – there's only one possible line for any given memory block.
Associative & Set-Associative:
- LRU (Least Recently Used): Replaces the block in the set that has not been accessed for the longest time. Generally performs well but is complex to implement perfectly for higher associativities (requires tracking access order).
- FIFO (First-In, First-Out): Replaces the block that has been in the set the longest. Simpler than LRU.
- Random: Randomly selects a block within the set to replace. Very simple to implement and often performs nearly as well as LRU.

(Challenge Question: If you have a 4-way set-associative cache with 256 sets, how many total blocks are in the cache? How many tag comparisons are needed for each cache access?) Answer: 256 sets * 4 blocks/set = 1024 blocks. Four parallel tag comparisons are needed within the selected set.

Memory Interleaving: Boosting Bandwidth

Memory interleaving aims to increase the effective memory bandwidth by dividing the main memory into multiple independent modules and arranging addresses so that consecutive words can reside in different modules. This allows simultaneous access to several modules, especially useful for block transfers (like cache line fills).

High-Order Interleaving: The high-order bits of the memory address select the module, and the low-order bits select the word within the module. Consecutive addresses (e.g., 0, 1, 2, 3...) are in the same module until that module fills up. This doesn't help much for single block transfers of consecutive words but can be useful if multiple independent memory requests for different large regions are pending.
Low-Order Interleaving: The low-order bits of the address select the module, and the high-order bits select the word within that module. Consecutive words (e.g., word 0, word 1, word 2, word 3) are in different modules (module 0, module 1, module 2, module 3, then module 0 again for word 4, assuming 4 modules). This is highly effective for block transfers, as all modules can be activated simultaneously to fetch different parts of the block in parallel.

Memory Performance Considerations (Including Caches)

Optimizing memory system performance is a continuous effort.

AMAT (Average Memory Access Time): A key metric. For a single-level cache: AMAT = HitTime + MissRate × MissPenalty For a system with L1 and L2 caches: AMAT = HitTime_L1 + MissRate_L1 × (HitTime_L2 + MissRate_L2 × MissPenalty_L2)
Reducing Miss Penalty:
- Multilevel Caches (L1, L2, L3): An L2 cache between L1 and main memory catches many L1 misses, reducing the effective miss penalty for L1. L3 caches serve a similar role for L2 misses.
- Critical Word First / Early Restart: On a miss, fetch the requested word first and send it to the CPU immediately, then fill the rest of the cache line. Reduces the processor's wait time.
- Victim Caches: A small, fully associative cache that stores blocks recently evicted from the main cache (typically L1). If an L1 miss occurs, the victim cache is checked first, potentially providing a quick "hit" on an evicted block. This is especially useful for reducing conflict misses in direct-mapped or low-associativity caches.
Reducing Hit Time:
- Small and Simple Caches: Smaller L1 caches with direct mapping can have lower hit times, as less logic is involved in decoding and tag comparison.
Reducing Miss Rate:
- Larger Block Size: Can improve spatial locality by bringing in more potentially useful adjacent data. However, if too large, it can increase the miss penalty (more data to transfer) and cache pollution (bringing in unused data that evicts useful data).
- Larger Cache Size: Generally reduces capacity misses (misses that occur because the cache isn't large enough to hold all currently active data), but increases cost and potentially hit time.
- Higher Associativity: Reduces conflict misses (misses that occur because too many active blocks map to the same set/line) but increases hit time (more comparisons) and complexity.
- Prefetching: Attempting to bring data/instructions into the cache before they are explicitly requested.
  - Hardware Prefetching: Circuitry detects access patterns (e.g., sequential access, strides) and automatically fetches subsequent blocks.
  - Software Prefetching: Special "prefetch" instructions inserted by the compiler or programmer hint to the hardware to load specific data into the cache.
Write Buffers: These queues hold write operations, allowing the processor to continue execution without waiting for slower main memory writes (in a write-through system) or for a dirty block to be written back (in a write-back system). This effectively hides write latency.
Lockup-Free Cache: A cache that can continue to service other memory requests (hits or other misses) while one or more misses are already being processed from main memory. It requires internal buffers and logic to track multiple outstanding misses.

Virtual Memories: Expanding the Horizon

Virtual memory techniques allow programs to use a logical address space that can be much larger than the actual physical main memory installed in the computer. Only the active portions of programs and data are kept in main memory, while the rest resides on a slower, larger secondary storage device (like a disk).

Address Translation: The processor generates virtual addresses (also called logical addresses). These are translated into physical addresses (actual main memory locations) by a hardware unit called the Memory Management Unit (MMU), often working in conjunction with the operating system.
Paging: The virtual address space and physical main memory are divided into fixed-size blocks called pages and page frames, respectively (e.g., 4KB, 8KB). Data is transferred between disk and main memory in units of pages.
Page Table: This crucial data structure, managed by the OS and used by the MMU, stores the mapping between virtual pages and their corresponding physical page frames in main memory.
- The starting physical address of the page table itself is usually stored in a special processor register (e.g., Page Table Base Register).
- A virtual address is typically split into a virtual page number (high-order bits) and an offset within that page (low-order bits).
- The virtual page number is used as an index into the page table to find the entry for that page.
- Each page table entry (PTE) contains:
  - The physical page frame number (if the page is in main memory).
  - Control bits:
    - Valid bit (or Present bit): Indicates if the page is currently in main memory.
    - Modified (Dirty) bit: Indicates if the page has been written to since being loaded into memory. If set, the page must be written back to disk before its page frame can be reused.
    - Protection bits (e.g., Read, Write, Execute permissions).
Translation Lookaside Buffer (TLB): Accessing the page table in main memory for every memory reference would be slow (it would mean two memory accesses: one for the PTE, one for the actual data). The TLB is a small, fast, typically associative cache within the MMU that stores recently used page table entries.
- When a virtual address is generated, the MMU first checks the TLB for the corresponding virtual page number.
- TLB Hit: If found, the physical page frame number is retrieved quickly from the TLB, and the physical address is formed by combining it with the offset.
- TLB Miss: If not found, the MMU (or OS, depending on design) accesses the full page table in main memory to get the PTE. This PTE is then loaded into the TLB (possibly replacing an existing TLB entry using an algorithm like LRU), and the translation proceeds. The TLB must be kept coherent with the main page table; if the OS changes a PTE, corresponding TLB entries must be invalidated.
Page Fault: If the valid bit in the PTE (whether found in the TLB or looked up in the main page table) indicates that the required page is not currently in main memory, a page fault exception occurs.
- The processor suspends the current program and transfers control to the operating system.
- The OS locates the required page on the disk.
- It finds a free page frame in main memory. If main memory is full, a page replacement algorithm (like LRU or a clock-based approximation) selects a victim page to be removed.
- If the selected victim page is "dirty" (modified bit is set), it must be written back to the disk.
- The required page is then read from disk into the chosen (now free) page frame.
- The page table (and potentially the TLB) is updated with the new mapping and status information for the loaded page.
- The interrupted program is then allowed to resume execution, re-trying the instruction that caused the page fault.

(Quick Quiz: How does virtual memory give the illusion of a larger memory than physically present?) Answer: By keeping only the currently needed parts (pages) of a program in physical RAM and swapping other parts to/from disk as needed, it allows programs to address a much larger virtual space.

Conclusion: The Complex Web of Storage

The memory system is far more than just a passive storage area. It's an active, intricate hierarchy involving various technologies (SRAM, DRAM, Flash), clever architectural designs (caches, interleaving, memory modules), and sophisticated management techniques (mapping functions, replacement algorithms, virtual memory with MMUs and page tables). All these elements work together to try and deliver data and instructions to the blazingly fast processor as quickly and efficiently as possible, managing the fundamental trade-offs between speed, size, and cost.

Key Takeaways:

The memory hierarchy (registers, cache, main memory, secondary storage) is essential for balancing performance, capacity, and cost.
SRAM is fast and used for caches; DRAM is denser, cheaper, and used for main memory, requiring refresh.
ROM and its variants (PROM, EPROM, EEPROM, Flash) provide non-volatile storage for firmware and persistent data.
Caches exploit locality using mapping functions (direct, associative, set-associative) and replacement algorithms (LRU, Random) to keep frequently used data close to the CPU.
Memory interleaving and various cache optimization techniques (write buffers, prefetching, victim caches, critical-word-first) aim to improve bandwidth and reduce latency.
Virtual memory, through paging and hardware/software address translation (MMU, Page Tables, TLB), provides a large logical address space, efficiently managing the limited physical memory and enabling features like multitasking and memory protection.