Talking to Your Database - Relational Algebra, Calculus, and the Power of SQL

March 24, 2023 — #DBMS #CS Basics

Hey data wranglers! We've designed our database using E/R models and mapped it to a relational schema. Fantastic! But how do we actually interact with this data? How do we ask questions, retrieve information, and make changes? We need a language!

Today, we're diving into the fascinating world of Relational Query Languages. We'll start with the formal, theoretical foundations:

Relational Algebra: A procedural language that defines how to get the results step-by-step.
Relational Calculus: A non-procedural language that defines what results we want, without specifying the steps.

And then, we'll focus on the practical, universally adopted standard:

SQL (Structured Query Language): The declarative language used by virtually all relational database systems today. We'll cover defining our database (DDL), manipulating data (DML), and some powerful query techniques.

Let's learn how to talk to our databases!

Relational Algebra: The Step-by-Step Approach

Think of Relational Algebra as a set of operations, like mathematical operators, but designed specifically for tables (relations). It's procedural because expressions specify the sequence of operations to perform. Each operation takes one or two relations as input and produces a new relation as output.

Core Unary Operators (Operate on one relation):

SELECT (σ): Filters rows.
- Purpose: To choose tuples (rows) from a relation that satisfy a specific condition (predicate).
- Syntax: σ <selection_condition> (<relation_name>)
- Condition: Can use comparison operators (=, ≠, <, ≤, >, ≥) and logical connectors (AND (∧), OR (∨), NOT (¬)).
- Example: σ Major='CompSci' ∧ GPA>3.5 (STUDENT) - Selects Computer Science students with a GPA above 3.5.
- Result: A relation with the same attributes as the original, but only the rows meeting the condition.
PROJECT (π): Selects columns.
- Purpose: To choose specific attributes (columns) from a relation.
- Syntax: π <attribute_list> (<relation_name>)
- Key Feature: Automatically eliminates duplicate rows in the result.
- Example: π Name, Major (STUDENT) - Selects only the Name and Major columns from the STUDENT table.
- Result: A relation containing only the specified columns, with duplicate rows removed.
RENAME (ρ): Gives names.
- Purpose: To assign a name to the result of a relational algebra expression (useful for intermediate results) or to rename a relation and/or its attributes.
- Syntax:
  - ρ NewRelationName (Expression)
  - ρ NewRelationName(NewAttrName1, NewAttrName2, ...) (Expression)
  - ρ (NewAttrName1, NewAttrName2, ...) (Relation)
- Example: ρ CS_Students (σ Major='CompSci' (STUDENT)) - Names the result of the selection.

Core Binary Operators (Operate on two relations):

Union Compatibility: For UNION, INTERSECTION, and DIFFERENCE, the two input relations must be union compatible:
1. They must have the same number of attributes (same degree).
2. The domains of corresponding attributes must be compatible.
UNION (∪): Combines rows.
- Purpose: Produces a relation containing all tuples that appear in either or both of the two union-compatible input relations. Duplicates are eliminated.
- Syntax: Relation1 ∪ Relation2
- Example: π CustName (ActiveCustomers) ∪ π CustName (InactiveCustomers) - Gives a list of all unique customer names.
SET DIFFERENCE (-): Removes rows.
- Purpose: Produces a relation containing all tuples that appear in the first relation but not in the second relation (must be union compatible).
- Syntax: Relation1 - Relation2
- Example: π StudentID (AllStudents) - π StudentID (GraduatedStudents) - Gives IDs of currently enrolled students.
CARTESIAN PRODUCT (×) (or Cross Product): Combines all possible row pairs.
- Purpose: Produces a relation containing all possible combinations of tuples by concatenating each tuple from the first relation with every tuple from the second relation.
- Syntax: Relation1 × Relation2
- Result: Degree = Degree(R1) + Degree(R2); Cardinality = Cardinality(R1) * Cardinality(R2).
- Use: Rarely used alone. It forms the basis for JOIN operations when combined with SELECT.
- Example: STUDENT × COURSE - Creates a massive table pairing every student with every course, mostly meaningless until filtered (joined).

Derived Operators (Expressible using core operators):

INTERSECTION (∩): Finds common rows.
- Purpose: Produces a relation containing only tuples that appear in both input relations (must be union compatible).
- Syntax: Relation1 ∩ Relation2
- Equivalence: R1 ∩ R2 = R1 - (R1 - R2)
- Example: π StudentID (TookMathCourse) ∩ π StudentID (TookPhysicsCourse) - Gives IDs of students who took both.
JOIN (⋈): Intelligently combines related rows (The Powerhouse!).
- Purpose: Combines tuples from two relations based on a related condition between their attributes. Far more useful than Cartesian Product alone.
- Theta Join (R ⋈θ S): The most general form. Combines tuples from R and S where the condition θ (using any comparison operator) holds true between attributes of R and S. Equivalent to σ θ (R × S).
- Equi Join: A Theta Join where the condition θ only uses equality (=). Often results in duplicate columns for the join attributes in the output.
- Natural Join (R ⋈ S): The most common join. It's an equi join on all attributes having the same name in both relations, followed by projecting out one copy of the duplicate join columns.
  - Example: STUDENT ⋈ ENROLLMENT (assuming both have StudentID) - Combines student info with their enrollment info, matching on StudentID, without repeating the StudentID column.
- Outer Joins (⟕, ⟖, ⟗): Extensions of join to keep tuples that don't have matching counterparts in the other relation, padding the missing side with NULLs.
  - Left Outer Join (R ⟕ S): Keeps all tuples from the left relation (R).
  - Right Outer Join (R ⟖ S): Keeps all tuples from the right relation (S).
  - Full Outer Join (R ⟗ S): Keeps all tuples from both relations.
- Self Join: Joining a relation with itself (using RENAME to create distinct aliases). Useful for finding relationships within the same table (e.g., EMPLOYEE E1 ⋈ E1.ManagerID = E2.EmpID EMPLOYEE E2 to find employee-manager pairs).
DIVISION (÷): For "for all" queries.
- Purpose: Used for queries involving universal quantification (e.g., "Find students who have taken all CompSci courses"). If R(A, B) and S(B), then R ÷ S gives the set of A values from R that are associated with every B value present in S.
- Conditions: The attributes of the divisor relation (S) must be a subset of the attributes of the dividend relation (R).
- Example: (π StudentID, CourseID (ENROLLMENT)) ÷ (π CourseID (σ Dept='CompSci'(COURSE))) - Finds IDs of students who enrolled in every CompSci course.
ASSIGNMENT (←): Breaks down complex queries.
- Purpose: Allows assigning the result of a relational algebra expression to a temporary relation variable. Makes complex queries easier to write and understand sequentially.
- Syntax: TempVariable ← Expression
- Example:
```
CS_Courses ← π CourseID (σ Dept='CompSci'(COURSE))
StudentCourses ← π StudentID, CourseID (ENROLLMENT)
Result ← StudentCourses ÷ CS_Courses
```

Extensions:

Generalized Projection: Allows arithmetic functions in the projection list (e.g., π Salary * 1.1 (EMPLOYEE)).
Aggregation (g): Performs aggregate functions (like SUM, AVG, COUNT, MAX, MIN), often combined with grouping attributes (similar to SQL's GROUP BY). Syntax: grouping_attributes g aggregate_functions (Relation).

Relational Algebra provides a formal way to specify queries and is used internally by database systems for query optimization.

Relational Calculus: The Declarative "What"

Relational Calculus offers a different, non-procedural perspective. Instead of specifying the steps to get the result, you describe the properties of the desired result using mathematical logic (predicate calculus).

Two Flavors:

Tuple Relational Calculus (TRC):
- Variables: Range over tuples (rows) of relations.
- Syntax: { T | P(T) } - "Find all tuples T such that predicate P(T) is true."
- Formula P(T): Built using:
  - Atoms: T ∈ RelationName, T.attribute op S.attribute, T.attribute op constant.
  - Logical Connectives: ¬ (NOT), ∧ (AND), ∨ (OR), ⇒ (IMPLIES).
  - Quantifiers: ∃T (There exists a tuple T), ∀T (For all tuples T). Variables bound by quantifiers are bound, others are free. The result usually depends on the free variables.
- Example (Find names of sailors with rating > 7): { T | ∃S ∈ Sailors (S.rating > 7 ∧ T.sname = S.sname) } (Here, T is implicitly a tuple with just the sname attribute).
- Safety: Need to write "safe" expressions that guarantee finite results and only use values from the database domains.
Domain Relational Calculus (DRC):
- Variables: Range over domain values (attribute values) rather than entire tuples.
- Syntax: { <x1, x2, ..., xn> | P(x1, x2, ..., xn) } - "Find all tuples of values <x1,...xn> such that predicate P is true."
- Formula P(...): Built using:
  - Atoms: <x1,...,xk> ∈ RelationName, x op y, x op constant.
  - Logical Connectives: Same as TRC.
  - Quantifiers: ∃x (There exists a domain value x), ∀x (For all domain values x).
- Example (Find names of sailors with rating > 7): { <N> | ∃I, R, A (<I, N, R, A> ∈ Sailors ∧ R > 7) } (Find names N such that there exist an ID I, rating R, and age A where the tuple <I,N,R,A> is in Sailors and R is greater than 7).

Relational Calculus provides a declarative way to specify queries and is formally equivalent in expressive power to Relational Algebra. It forms the theoretical basis for languages like SQL.

SQL: The Industry Standard Query Language

While Algebra and Calculus provide the theoretical underpinning, SQL (Structured Query Language) is the de facto standard language for interacting with relational databases. It's primarily declarative, meaning you specify what data you want, and the DBMS figures out the best way to get it (using its internal query optimizer, often based on relational algebra concepts).

SQL Sublanguages (Conceptual):

DDL (Data Definition Language): For defining and managing the database structure.
DML (Data Manipulation Language): For retrieving and modifying data.
DCL (Data Control Language): For managing permissions and access (GRANT, REVOKE).
TCL (Transaction Control Language): For managing transactions (COMMIT, ROLLBACK, SAVEPOINT).

Let's focus on DDL and DML.

Data Definition in SQL (DDL): Creating the Structure

DDL commands define the database schema.

CREATE TABLE: Defines a new table and its attributes.

CREATE TABLE Students (
    StudentID INT PRIMARY KEY, -- Defines primary key
    FirstName VARCHAR(50) NOT NULL, -- Specifies data type and NOT NULL constraint
    LastName VARCHAR(50),
    Major VARCHAR(30),
    GPA DECIMAL(3, 2), -- Example: Numeric type with precision
    DeptID INT,
    UNIQUE (FirstName, LastName), -- Example: UNIQUE constraint
    -- Foreign key definition
    FOREIGN KEY (DeptID) REFERENCES Departments(DeptID)
        ON DELETE SET NULL -- Optional: Action on deletion in parent table
        ON UPDATE CASCADE -- Optional: Action on update in parent table
);

Key Definitions:
- PRIMARY KEY: Uniquely identifies rows, cannot be NULL. Can be defined inline (for single attribute) or separately (for composite keys).
- FOREIGN KEY: Establishes a link to another table's primary key, enforcing referential integrity. REFERENCES specifies the parent table and column(s). ON DELETE/ON UPDATE clauses define actions when the referenced PK changes.
- UNIQUE: Ensures values in a column (or combination of columns) are unique, but allows NULLs (usually one NULL is permitted).
- NOT NULL: Ensures a column cannot contain NULL values.
Other DDL:
- ALTER TABLE: Modifies an existing table (e.g., ADD COLUMN, DROP COLUMN, ADD CONSTRAINT, MODIFY COLUMN).
- DROP TABLE: Deletes a table and all its data.

Data Manipulation in SQL (DML): Working with Data

DML commands are used to query and modify the data within the tables.

SELECT: Retrieves data. This is the most frequently used SQL command.

SELECT StudentID, FirstName, GPA -- Columns to retrieve (* for all)
FROM Students                -- Table to retrieve from
WHERE Major = 'CompSci'       -- Condition to filter rows (like σ)
  AND GPA > 3.0
ORDER BY LastName, FirstName; -- Optional: Sort the results

INSERT: Adds new rows to a table.

-- Insert single row with explicit values
INSERT INTO Students (StudentID, FirstName, LastName, Major, GPA, DeptID)
VALUES (101, 'Alice', 'Smith', 'CompSci', 3.8, 10);

-- Insert results from another query
INSERT INTO HighGPAStudents (StudentID, Name)
SELECT StudentID, FirstName || ' ' || LastName -- Concatenation example
FROM Students
WHERE GPA >= 3.9;

UPDATE: Modifies existing rows.

UPDATE Students
SET Major = 'Computer Science' -- Column(s) to change and new value(s)
WHERE Major = 'CompSci';       -- Condition to select rows to update

DELETE: Removes rows from a table.

DELETE FROM Students
WHERE StudentID = 101; -- Condition to select rows to delete

Advanced Querying in SQL:

Joins: Combining data from multiple tables is done using JOIN clauses in the FROM section:

SELECT S.FirstName, D.DeptName
FROM Students S
INNER JOIN Departments D ON S.DeptID = D.DeptID -- Join condition
WHERE S.GPA > 3.5;
-- Other types: LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN

Nested Queries (Subqueries): Queries embedded inside other SQL statements. Very powerful for complex logic.

In WHERE Clause:

-- Find students in departments located in 'Building A'
SELECT FirstName FROM Students
WHERE DeptID IN (SELECT DeptID FROM Departments WHERE Location = 'Building A');

-- Find students with GPA higher than the average GPA
SELECT FirstName, GPA FROM Students
WHERE GPA > (SELECT AVG(GPA) FROM Students);

In FROM Clause: (Treat the subquery result as a temporary table)

SELECT T.Major, T.AvgGPA
FROM (SELECT Major, AVG(GPA) as AvgGPA FROM Students GROUP BY Major) T
WHERE T.AvgGPA > 3.2;

In SELECT Clause: (Usually returns a single value)

SELECT StudentID, FirstName,
       (SELECT COUNT(*) FROM Enrollments E WHERE E.StudentID = S.StudentID) AS CourseCount
FROM Students S;

Aggregation: Performing calculations across groups of rows.
- Aggregate Functions: COUNT() (counts rows/non-null values), SUM(), AVG(), MAX(), MIN().
- GROUP BY Clause: Groups rows that have the same values in specified columns so aggregates can be applied to each group.
- HAVING Clause: Filters the results of GROUP BY based on the aggregate values (similar to WHERE, but operates after aggregation).
```
SELECT Major, COUNT(*) AS NumStudents, AVG(GPA) AS AverageGPA
FROM Students
WHERE GPA IS NOT NULL -- Filter rows *before* grouping
GROUP BY Major        -- Group by major
HAVING COUNT(*) > 10; -- Filter groups *after* aggregation
```

Beyond SQL: Procedural Extensions (PL/SQL Example)

While SQL is powerful for data definition and manipulation, sometimes complex business logic requires procedural capabilities (like loops, conditions, variables). Many DBMS vendors provide procedural extensions to SQL. Oracle's is PL/SQL (Procedural Language/SQL), Microsoft SQL Server has T-SQL, PostgreSQL has PL/pgSQL.

Purpose: Combine SQL's data access with procedural programming constructs.
Features: Variables, constants, conditional statements (IF-THEN-ELSE), loops (FOR, WHILE), exception handling, creation of stored procedures, functions, and triggers (code that automatically runs on data modification events).
Use Cases: Implementing complex validation rules, automating multi-step processes, encapsulating business logic within the database via stored procedures.

-- Simple PL/SQL Block Example (Conceptual Oracle Syntax)
DECLARE
  v_student_gpa students.gpa%TYPE; -- Declare variable of same type as column
  v_student_id students.studentid%TYPE := 101;
BEGIN
  SELECT gpa INTO v_student_gpa
  FROM students
  WHERE studentid = v_student_id;

  IF v_student_gpa >= 3.5 THEN
    DBMS_OUTPUT.PUT_LINE('Student ' || v_student_id || ' is on the Dean''s List.');
  ELSE
    DBMS_OUTPUT.PUT_LINE('Student ' || v_student_id || ' GPA: ' || v_student_gpa);
  END IF;
EXCEPTION
  WHEN NO_DATA_FOUND THEN
    DBMS_OUTPUT.PUT_LINE('Student ' || v_student_id || ' not found.');
END;
/

PL/SQL and its counterparts allow developers to build more sophisticated logic directly within the database server.

Conclusion: Choosing Your Language

We've journeyed through the formal foundations of Relational Algebra and Calculus, understanding the how and the what of database querying. These underpin the powerful, declarative, and practical standard: SQL. Mastering SQL's DDL for defining structure and its DML for manipulating data (including joins, subqueries, and aggregation) is essential for anyone working with relational databases. And for more complex tasks, procedural extensions like PL/SQL offer even greater capabilities.

Understanding these languages allows us to effectively unlock the information stored within our carefully designed databases.

Happy querying, and may your data always be accurate and accessible!