Powering Modern Data Systems - CDC, Retries, DB Scaling & Locking Deep Dive

August 20, 2024 — #interview #SDE

Hey everyone, and welcome back to the blog! In today's data-intensive world, building applications that are not only functional but also resilient, scalable, and capable of handling real-time data streams is paramount. Whether you're an engineer here in the bustling tech ecosystem of Bengaluru or anywhere else, you're likely grappling with these challenges.

Today, we're going to explore interconnected sets of techniques fundamental to modern data architectures: Change Data Capture (CDC) for keeping data in sync, Retry Mechanisms with Backoff Strategies for building resilient services, various Database Scaling Strategies to ensure your data tier can handle growth, and a crucial element for data integrity in concurrent environments: Database Locking. Let's dive in!

I. Change Data Capture (CDC): Tracking Data in Real-Time

In many systems, we need to react to changes happening in our source databases almost instantly. This is where Change Data Capture comes in.

What is CDC? Change Data Capture (CDC) is a crucial design pattern used to ensure that changes made to a source data store (like an operational database) can be identified, captured, and propagated to downstream systems or consumers in near real-time. It's all about tracking the "deltas" – inserts, updates, and deletes.
Why is CDC Important? CDC enables a host of valuable use cases:
- Real-time Analytics: Feeding data changes into analytical systems for up-to-the-minute insights.
- Data Replication & Synchronization: Keeping multiple data stores (e.g., data warehouses, caches, search indexes) consistent with the source database.
- Microservice Data Consistency: Allowing different microservices that own their own data to react to changes in other services' data.
- Auditing: Providing a stream of all data modifications for audit and compliance purposes.
How CDC Works: Common Methods There are several ways to implement CDC, each with pros and cons:
1. Timestamp-based CDC: (Relies on last_updated columns; simple but can miss deletes and cause DB load).
2. Trigger-based CDC: (Uses DB triggers to capture changes; captures all changes but adds overhead).
3. Log-based CDC (Most Popular & Efficient):
  - How: This method involves reading changes directly from the database's native transaction log (also known as a redo log, binary log (binlog) in MySQL, or Write-Ahead Log (WAL) in PostgreSQL).
  - A CDC tool (e.g., Debezium) uses a source connector to read these logs, interpret the changes, and transform them into a stream of events.
  - Pros: Very low overhead on the source database, captures all committed changes accurately and in order, near real-time.
  - Cons: Log formats are database-specific, parsing can be complex.
Typical Log-based CDC Architecture with Kafka: A common pattern involves a CDC Connector reading database logs, publishing change events to Apache Kafka, and various downstream Consumers (databases, caches, search indexes) processing these events.

II. Retry Mechanisms & Backoff Strategies: Building Resilient Systems

In distributed systems, transient failures are common. Robust systems handle these gracefully using retries, but naive retries can be harmful. (The importance of "timeouts, retries, and backoff with jitter" is highlighted by Amazon's engineering practices).

The Need for Retries & Idempotency: Retrying failed operations can lead to success for temporary issues. However, operations should ideally be idempotent (safe to retry multiple times without different outcomes beyond the first).
Backoff Strategies (to avoid overwhelming services):
- Exponential Backoff: Increase wait time exponentially after each failure (e.g., 1s, 2s, 4s).
- Exponential Backoff with Jitter (Gold Standard): Adds randomness to the exponential delay (delay = min(cap, base * (2 ** attempt)) + random_jitter) to prevent synchronized retries (thundering herd).
- Considerations: Maximum number of retries, maximum backoff time, and distinguishing transient vs. permanent errors.

III. Database Scaling Strategies: Handling Growth and Load

Databases are often bottlenecks. Several strategies exist to scale them:

Indexing:
- What: Special lookup tables to speed up data retrieval by avoiding full table scans.
- Helps: Drastically speeds up read queries.
- Trade-offs: Slows down writes/updates (indexes also need updating); consumes storage.
Materialized Views:
- What: Stores pre-computed results of a query.
- Helps: Speeds up complex/frequent queries.
- Trade-offs: Data can be stale until refreshed; refresh consumes resources; uses storage.
Vertical Scaling (Scaling Up):
- What: Increasing resources (CPU, RAM, disk) of the existing database server.
- Pros: Simpler initially.
- Cons: Hard limits, high cost beyond a point, single point of failure if not HA.
Replication (Read Replicas):
- What: Creating read-only copies of the primary database. Writes go to primary, reads to replicas.
- Helps: Scales read throughput, reduces load on primary.
- Trade-offs: Replication lag (eventual consistency for reads), complexity in routing read/write queries.
Sharding (Horizontal Partitioning):
- What: Dividing data by rows across multiple database servers (shards).
- Helps: Scales both reads and writes, distributes data for massive datasets.
- Strategies: Range-based, Hash-based, Directory-based sharding.
- Trade-offs: Significant complexity (cross-shard joins, referential integrity, re-sharding is painful).
Database Caching:
- What: Using in-memory caches (e.g., Redis) or database internal caches (buffer pool) for frequently accessed data/queries.
- Helps: Reduces read latency and database load.
- Trade-offs: Cache invalidation, consistency, potential cache-related issues (penetration, avalanche).
Denormalization:
- What: Intentionally adding redundant data to tables to reduce complex joins for specific read queries.
- Helps: Speeds up read performance for targeted queries.
- Trade-offs: Increased storage, data redundancy (risk of inconsistency), more complex writes.

IV. Database Locking: Managing Concurrent Access and Ensuring Integrity

When multiple transactions try to access and modify data simultaneously, we need mechanisms to prevent chaos (like lost updates or dirty reads) and ensure data integrity, often guided by ACID properties. Database locks are a fundamental concurrency control mechanism.

Why Locks? To control how multiple transactions interact with shared data, ensuring that operations are performed in a safe and consistent manner, according to the chosen isolation level.

Lock Granularity: How Much to Lock?

Locks can be applied at different levels, creating a trade-off between concurrency and overhead:

Table-level Lock: The entire table is locked.
- Pros: Simple to implement, low overhead for managing locks.
- Cons: Severely limits concurrency; only one transaction can operate on the table at a time.
Page-level Lock: Locks a disk page (which typically contains multiple rows).
- Pros: Fewer locks to manage than row-level locks.
- Cons: Can still cause contention if unrelated rows on the same page are frequently accessed by different transactions.
Row-level Lock: Locks only the specific rows being accessed or modified.
- Pros: Offers the highest degree of concurrency, as unrelated rows are not blocked.
- Cons: Higher overhead for the database to manage a potentially large number of individual row locks.

Common Lock Modes/Types:

Shared Lock (S Lock):
- If transaction A locks some data, transaction B can only read the data but cannot modify it.
- Multiple transactions can hold S locks on the same resource simultaneously, allowing concurrent reads.
- Prevents other transactions from acquiring an Exclusive (X) lock on the resource.
Exclusive Lock (X Lock):
- If an X lock is applied to some data, other transactions can neither read nor change the data.
- Only one transaction can hold an X lock on a resource at a time. It's required for write operations (UPDATE, DELETE, INSERT).
- Prevents any other transaction from acquiring S locks or X locks on the same resource.
Update Lock (U Lock):
- Purpose: Used to prevent a common form of deadlock that can occur when multiple transactions attempt to read a resource, then decide to update it (which would require promoting their S lock to an X lock). If two transactions try this simultaneously, they can deadlock waiting for each other to release their S locks.
- How it Works: A transaction reads data with a U lock. U locks are compatible with S locks (other transactions can still read with S locks) but not with other U locks or X locks on the same resource. If the transaction holding the U lock decides to modify the data, the U lock can then be promoted to an X lock (assuming no other S locks are present from other transactions that would conflict with an X lock).
Schema Locks (Sch-S, Sch-M):
- Purpose: Protect the definition (schema) of database objects (like tables, indexes) during Data Definition Language (DDL) operations (e.g., ALTER TABLE, CREATE INDEX).
- Sch-S (Schema Stability Lock): Acquired by operations that depend on a stable schema (like compiling a query). Prevents concurrent DDL operations that would alter the object's schema but allows other DML or query operations.
- Sch-M (Schema Modification Lock): An exclusive lock required by DDL operations that change an object's schema. Prevents all other operations on that object.
Bulk Update Lock (BU Lock):
- Purpose: Used during bulk data load operations (like BULK INSERT or bcp) to optimize performance.
- How it Works: Typically allows concurrent inserts into an empty table or a table with no indexes by multiple threads of the bulk operation, while preventing other transactions from accessing the table during the load.
Key-Range Locks (Predicate Locks):
- Purpose: Used, especially in conjunction with the serializable isolation level, to prevent phantom reads.
- How it Works: Instead of just locking existing rows, these locks protect a range of index values based on a query's predicate (e.g., WHERE age > 30 AND age < 40). This prevents other transactions from inserting new rows that would match this predicate (and thus appear as "phantoms" if the original query were re-run) or from modifying existing rows in a way that would make them match or stop matching the predicate.

Deadlocks

It's important to remember that with locking comes the risk of deadlocks – where two or more transactions are blocked indefinitely, each waiting for a resource held by the other. Databases have mechanisms to detect deadlocks and usually resolve them by choosing one transaction as a "victim" and rolling it back.

Key Takeaways

Change Data Capture (CDC) is vital for real-time data integration, with log-based CDC being a popular method.
Retry Mechanisms with Exponential Backoff and Jitter are crucial for building resilient distributed services.
Scaling databases requires a multi-faceted approach: Indexing and Materialized Views for query optimization; Vertical Scaling for more power; Replication for read scalability; Database Caching for reducing latency; Denormalization for specific read optimizations; and Sharding for massive scale.
Database Locks (with varying granularities like Table, Page, Row and modes like Shared, Exclusive, Update, Schema, Key-Range) are fundamental for concurrency control, ensuring data integrity and consistency (ACID properties) when multiple transactions access data simultaneously.
Understanding these techniques and their trade-offs is essential for designing robust, scalable, and reliable data systems.

The journey to building such systems is one of continuous learning, applying the right tools and techniques to meet specific challenges, especially in the demanding tech landscapes we see today.