Memory Lane - How Your OS Manages Computer Memory!

April 7, 2023 — #OS #CS Basics

Hey memory explorers! Ever wonder how your computer juggles multiple applications, each demanding its own slice of RAM? That vital task falls to the Operating System's Memory Manager. Main memory is where programs must reside to be executed by the CPU, but it's a finite, precious resource. Let's trace the evolution of memory management techniques, from the early days to the sophisticated virtual memory systems we rely on today.

Early Days: Simple Allocation Schemes

In the beginning, memory management was straightforward, often dealing with one program at a time or simple batch processing.

Resident Monitor: In early batch systems, a small portion of memory was permanently occupied by the monitor – a rudimentary OS responsible for sequencing jobs. The rest of memory held the currently executing user program. Protection was minimal but crucial to prevent user programs from corrupting the monitor.
Multiprogramming with Fixed Partitions: To run multiple programs concurrently (multiprogramming), an early approach was to divide memory into several fixed-size partitions. When a job arrived, it was loaded into a partition large enough to hold it. The degree of multiprogramming was limited by the number of partitions. This was simple but inefficient – if a program didn't fill its partition, the leftover space (internal fragmentation) was wasted. Also, the fixed partition sizes might not match program needs well.
Multiprogramming with Variable Partitions (Contiguous Allocation): A more flexible approach treated memory as one large block initially. When a process arrived, a chunk of memory exactly its size (or slightly larger depending on allocation units) was carved out for it. This kept each process in a single contiguous section of memory. When a process terminated, its memory became a hole of free space. The OS maintained a list of these holes.
- Dynamic Storage Allocation: How to allocate a hole to a new process?
  - First-Fit: Scan the list of holes and allocate the first one that is large enough. Fast.
  - Best-Fit: Scan the entire list and allocate the smallest hole that is large enough. Aims to leave the smallest possible leftover hole, but can create many tiny, unusable holes. Requires searching the whole list unless ordered by size.
  - Worst-Fit: Scan the entire list and allocate the largest hole. Aims to leave a large leftover hole that might be useful later. Also requires a full search.
- Problem: Over time, memory becomes checkered with holes of various sizes scattered throughout – this is external fragmentation. Total free memory might be sufficient for a new process, but no single hole is large enough. Requires compaction (shuffling allocated memory to consolidate free space), which is time-consuming and requires processes to be relocatable. First-fit and best-fit are generally better than worst-fit regarding both time and fragmentation.
Hardware Support: Base and Limit Registers: To support contiguous allocation and protect processes from each other and the OS, simple hardware was introduced. A base register holds the starting physical address of a process's partition, and a limit register holds the size of the partition (range). Every memory address generated by the CPU (a logical address) automatically had the base register's value added to it by the Memory Management Unit (MMU) to get the physical address. The MMU also checked if the logical address was less than the limit register value; if not, it was an error (trap to OS). This allowed processes to be loaded into different physical locations (relocation) and protected memory boundaries. Some architectures used multiple base/limit registers, perhaps for code and data separately.

Sophisticated Schemes: Paging and Segmentation

Contiguous allocation suffered from fragmentation. Two more advanced techniques emerged to tackle this, allowing a process's physical memory to be non-contiguous.

Paging:
- Concept: Divides physical memory into fixed-size blocks called frames and logical memory (the process's address space) into blocks of the same size called pages. When a process runs, its pages can be loaded into any available frames in physical memory.
- Address Translation: The logical address generated by the CPU is split into two parts: a page number (p) and a page offset (d).
  - The page number acts as an index into a page table. Each process has its own page table.
  - The page table entry corresponding to page p contains the base address of the physical memory frame where that page is stored.
  - This frame number is combined with the page offset (which is the same in logical and physical addresses) to form the final physical memory address.
- Hardware Support: Requires a Page Table Base Register (PTBR) to point to the start of the page table in memory and often a Page Table Length Register (PTLR) to indicate its size.
- Pros: Eliminates external fragmentation entirely. Allocation is easy – just find any N free frames for a program needing N pages. Allows for sharing of code pages between processes.
- Cons: Still suffers from internal fragmentation (the last page of a process might not be full, wasting part of a frame). Page table lookup requires an extra memory access for every logical address reference, potentially doubling memory access time.
- Performance Boost: TLBs: To overcome the two-access problem, a special hardware cache called the Translation Look-aside Buffer (TLB) or associative memory is used. It stores recent page-to-frame translations. When the CPU generates a logical address, the MMU checks the TLB first. If the page number is found (a TLB hit), the frame number is retrieved instantly without accessing the page table in memory. If it's not found (a TLB miss), the page table is consulted, and the translation is usually added to the TLB, replacing another entry. High hit ratios (e.g., 99%) make the average memory access time close to one physical memory access plus the small TLB lookup time.
- Large Address Spaces: For 32-bit or especially 64-bit systems, a single-level page table can become enormous (megabytes or gigabytes!). Solutions include:
  - Hierarchical Paging: Page the page table itself (e.g., two-level or three-level paging).
  - Hashed Page Tables: Hash the virtual page number to find an entry in a hash table, which contains linked lists of elements (virtual page #, frame #, next pointer) for collision handling. Good for sparse address spaces.
  - Inverted Page Tables: Have one entry per physical frame, storing which process and virtual page occupies it. Reduces table size but increases search time (usually mitigated by hashing).
Segmentation:
- Concept: Views memory from the user's perspective. A program is a collection of logical segments, like main program, procedure, function, stack, heap, symbol table. Each segment is a logical unit with a specific purpose and potentially varying size.
- Address Translation: A logical address is a two-part tuple: <segment-number, offset>.
  - The segment number indexes into a segment table.
  - Each segment table entry contains the base (starting physical address) and limit (length) of the segment.
  - The offset is added to the base to get the physical address. The offset is also checked against the limit for validity.
- Hardware Support: Requires a Segment Table Base Register (STBR) and Segment Table Length Register (STLR).
- Pros: Supports the user's view of memory. Easier to implement protection and sharing at the segment level (e.g., share a code segment, protect a data segment). Segments can grow or shrink.
- Cons: Segment sizes vary, leading back to the external fragmentation problem, similar to variable partition contiguous allocation. Compaction might be needed.

Some architectures, like Intel IA-32, supported both segmentation and paging, where segments were themselves paged.

The Magic of Virtual Memory: Beyond Physical Limits

The limitations of early schemes and the power of paging led to the concept of Virtual Memory.

Core Idea: Separates the user's logical memory view from the actual physical memory. The logical address space can now be much larger than the available physical RAM.
How? Not all of a program needs to be in physical memory at the same time. Only the parts currently being executed or accessed need to be resident. The rest can stay on a fast backing store (usually a disk).
Benefits:
- Programs are no longer constrained by the amount of physical RAM.
- More programs can run concurrently (higher degree of multiprogramming) because each uses less physical memory.
- Less I/O is needed to load or swap processes, potentially speeding up execution start times and overall throughput.
- Allows for efficient process creation (e.g., sharing pages) and flexible memory allocation (e.g., sparse address spaces).

Virtual memory is typically implemented using Demand Paging.

Demand Paging: Virtual Memory in Action

Demand paging is the cornerstone of most modern virtual memory systems. It combines paging with the idea of loading pages only when they are needed ("on demand").

Lazy Swapper (Pager): Instead of loading an entire process into memory at startup, the pager brings a page into physical memory only when it's actually accessed.
Valid-Invalid Bit: The page table is augmented with a valid-invalid bit for each entry.
- valid (v): The associated page is currently in a physical memory frame.
- invalid (i): The page is either not part of the process's logical address space or it is part of it but currently resides only on the backing store.
Page Fault: What happens when a process tries to access a page marked as invalid in the page table?
1. The MMU hardware detects the invalid bit during address translation and generates a trap (an interrupt) to the OS – this is a page fault.
2. The OS takes over. It checks an internal table to see if the access was to an illegal address (outside the process's logical space) or just to a page currently on disk. If illegal, the process is terminated.
3. If the page is valid but on disk, the OS locates it on the backing store.
4. It finds a free frame in physical memory. (If no frame is free, it initiates page replacement – see next section).
5. The OS schedules a disk operation to read the desired page from the backing store into the chosen free frame.
6. While the disk I/O happens (which is slow), the OS usually switches the CPU to another ready process.
7. When the disk I/O completes, the OS receives an interrupt.
8. The OS updates the process's page table entry for the newly loaded page, setting the valid-invalid bit to valid and recording the frame number.
9. The OS restarts the instruction that was interrupted by the page fault. The process can now continue as if the page had always been in memory.
Performance: Demand paging is efficient if the page fault rate (p) is low. The Effective Access Time (EAT) is: EAT = (1 – p) * (memory access time) + p * (page fault service time) Since page fault service time (involving disk I/O) is orders of magnitude slower than memory access time (milliseconds vs. nanoseconds), even a small p significantly impacts EAT. Keeping p extremely low is crucial.
Copy-on-Write (COW): An optimization, especially for process creation (fork()). Instead of copying all parent pages for the child, both parent and child initially share the pages, marked as read-only. If either process tries to write to a shared page, a page fault occurs, and only then is a private copy of that specific page made for the writing process.

Handling Page Faults: Page Replacement Algorithms

What if a page fault occurs and there are no free frames available? The OS must choose an existing frame to replace. It writes the contents of the victim frame to the backing store (if it has been modified – checked using a modify/dirty bit) and then loads the desired page into the now-free frame.

The choice of which page to replace is determined by a Page Replacement Algorithm. The goal is to choose a victim page that is least likely to be needed soon, minimizing future page faults. Common algorithms include:

First-In, First-Out (FIFO): Replaces the page that has been in memory the longest. Simple to implement (using a queue) but often performs poorly, as it might replace a heavily used page that was loaded long ago. Suffers from Belady's Anomaly: adding more frames can sometimes increase the page fault rate!
Optimal (OPT or MIN): Replaces the page that will not be used for the longest period in the future. This gives the lowest possible page fault rate and does not suffer from Belady's Anomaly. However, it's impossible to implement in practice because it requires predicting the future. It serves as a benchmark for evaluating other algorithms.
Least Recently Used (LRU): Replaces the page that has not been used for the longest period. It uses the recent past as an approximation of the near future, leveraging the locality of reference principle (processes tend to reuse pages they've used recently). Performs well in practice and doesn't suffer from Belady's Anomaly.
- Implementation: Requires hardware support. Can use counters (each page table entry gets a timestamp on access; replace page with oldest timestamp – requires search) or a stack (move referenced page to top; replace page at bottom – no search, but updates are complex). Both add overhead.
LRU Approximations: Because true LRU is complex/expensive, approximations are often used:
- Reference Bit Algorithm: Each page has a reference bit, set by hardware whenever the page is accessed. Periodically, the OS clears the bits. To replace, choose a page with the bit currently set to 0. Doesn't know the order of use, just whether it was used recently.
- Second-Chance (Clock) Algorithm: A FIFO-like approach using the reference bit. Organizes pages in a circular queue with a pointer. When a page is needed, check the page at the pointer: if reference bit is 0, replace it; if reference bit is 1, clear the bit (give it a second chance) and advance the pointer to the next page, repeating the check.
- Enhanced Second-Chance Algorithm: Uses both the reference bit and the modify bit (dirty bit) as an ordered pair (reference, modify). Prefers to replace pages in this order: (0,0) -> (0,1) -> (1,0) -> (1,1), potentially cycling through the queue multiple times.
Counting Algorithms: Keep a count of references per page.
- Least Frequently Used (LFU): Replace page with the smallest reference count. Assumes actively used pages have high counts.
- Most Frequently Used (MFU): Replace page with the highest reference count. Assumes pages with low counts were just brought in and haven't been used yet. Less common.
Page-Buffering Algorithms: Keep a pool of free frames. When a fault occurs, load the needed page into a free frame immediately. Select a victim frame later and write it out (if dirty) when convenient, adding it to the free pool. Reduces fault service time.

Allocation of Frames: How Many Pages Does Each Process Get?

With virtual memory and page replacement, the OS also needs to decide how many physical frames to allocate to each running process.

Minimum: Every process needs a minimum number of frames to execute instructions correctly (e.g., an instruction might reference multiple memory locations across page boundaries).
Maximum: Limited by the total available physical memory.

Allocation Schemes:

Fixed Allocation:
- Equal Allocation: Split m frames equally among n processes ( m/n each). Simple.
- Proportional Allocation: Allocate frames proportional to the process size (or priority). Larger/higher-priority processes get more frames.
Priority Allocation: Use a proportional scheme based on priority, not size. If a page fault occurs, a high-priority process might replace a frame from a lower-priority process (global replacement) or only from its own allocated frames (local replacement).

Global vs. Local Replacement:

Global: A process can replace any frame in memory, even one belonging to another process. Leads to higher system throughput (as pages are selected from a larger pool) but means a process's performance can be affected by unrelated processes.
Local: A process can only replace one of its own allocated frames. Provides more consistent per-process performance but can lead to underutilization if some processes don't need all their allocated frames while others are faulting heavily.

Avoiding Meltdown: Thrashing

If a process doesn't have enough frames to hold its current working set or locality (the set of pages it's actively using), it will constantly page fault. It replaces a page it needs almost immediately, leading to more faults. This state of excessive paging and low CPU utilization is called Thrashing. The process spends more time paging than executing!

If the OS observes low CPU utilization, it might mistakenly increase the degree of multiprogramming, admitting more processes, which further reduces the frames available per process and worsens the thrashing!

How to deal with thrashing?

Locality Model: Processes move between localities (sets of actively used pages). Thrashing occurs if the sum of the sizes of the current localities (Σ WSSi) exceeds the total available physical memory (m). The OS should ensure a process has enough frames for its current locality.
Working-Set Model (WSM): Tries to estimate the locality size (WSS) for each process using a working-set window (Δ, a number of recent page references). The set of unique pages referenced in the last Δ references is the working set. If the total demand D = Σ WSSi exceeds physical frames m, the OS should suspend some processes to prevent thrashing.
Page-Fault Frequency (PFF): A more direct approach. Monitor the page fault rate for each process. Establish upper and lower bounds for an acceptable rate.
- If rate > upper bound -> process needs more frames.
- If rate < lower bound -> process might have too many frames (can take one away).
- If the rate is high and no free frames are available, suspend a process.

The Memory Management Tapestry

From the simplest partitioning schemes to the complex interplay of demand paging, page replacement, frame allocation, and thrashing prevention, memory management is a rich and critical area of OS design. Virtual memory allows modern systems to run large, numerous applications efficiently, but it relies on these sophisticated techniques working together seamlessly behind the scenes.