Database normalization is the process of organizing and structuring data in a relational database to reduce data redundancy and improve data integrity. The main goal of normalization is to eliminate or minimize data anomalies, such as update, insert, and delete anomalies that can arise when a database is not properly structured. The process involves decomposing tables into smaller, more manageable structures and establishing relationships between them.
Normalization can be achieved through a series of guidelines called Normal Forms:
- First Normal Form (1NF): Requires that all values in a table must be atomic and there should be no repeating groups of columns.
- Second Normal Form (2NF): Requires that a table be in 1NF and that all non-prime attributes (columns that are not a part of the primary key) are fully functionally dependent on the entire primary key.
- Third Normal Form (3NF): Requires that a table be in 2NF and that there are no transitive dependencies: non-prime attributes should not depend on other non-prime attributes.
While achieving higher normal forms is desired depending on specific needs, in many cases a database that meets 3NF is typically considered as well-normalized.
1NF: First Normal Form
First Normal Form (1NF) is the most basic level of normalization. It ensures that a table's structure is simple and that all of its data is stored in a way that avoids duplication and allows for efficient querying. The key requirement for a table to be in 1NF is that all its columns must contain atomic values, and there should be no repeating groups of columns.
Let's look at an example to illustrate the concept of 1NF:
Student_ID | Courses | Grades |
---|---|---|
101 | Math, Physics, Chemistry | A, B, C |
102 | English, History, Geography | B, C, A |
103 | Biology, Chemistry, Physics | C, A, B |
In this table, the "Courses" column violates 1NF because it contains multiple values (comma-separated lists) in each cell. Each student can be associated with multiple courses, but these courses should be stored in separate rows, not combined into a single cell.
Student_ID | Course | Grade |
---|---|---|
101 | Math | A |
101 | Physics | B |
101 | Chemistry | C |
102 | English | B |
102 | History | C |
102 | Geography | A |
103 | Biology | C |
103 | Chemistry | A |
103 | Physics | B |
Now, each row in the table represents a single course for a specific student, and the table is 1NF. Each attribute ("Student_ID", "Course", "Grade") contains atomic values, and there are no repeating groups. This structure makes it easier to query the database and avoids the issues associated with storing multiple values in a single cell.
2NF: Second Normal Form
Second Normal Form (2NF) is a level of database normalization that builds upon the foundation of First Normal Form (1NF). The primary goal of 2NF is to eliminate partial dependencies within a table by ensuring that all non-prime attributes are fully functionally dependent on the entire primary key.
To understand 2NF, let's briefly revisit the concept of functional dependency. In relational database, one attribute (or a set of attributes) is said to be functionally dependent on another attribute if the value of the first attribute uniquely determines the value of the second attribute.
Now, for a table to be in 2NF:
- It must already be in 1NF.
- There should be no partial dependencies, meaning every non-prime attribute must be fully functionally dependent of the entire primary key.
Here's an example to illustrate the transition from 1NF to 2NF:
Student_ID | Course | Instructor | Grade |
---|---|---|---|
101 | Operating System | John Doe | B |
101 | DB | Jane Smith | A- |
102 | Java Programming | Bob Brown | A |
103 | C | Alice Johnson | C+ |
103 | C++ | Charlie Brown | B+ |
In this table, ("Student_ID", "Course") is the composite primary key. For example, to determine which value is stored in the "Grade" column, we have to know both "Student_ID" and "Course"; knowing only one of them, say "Student_ID" of 101, is not enough to find "Grade." However, "Instructor" is partially dependent on the primary key because it depends only on "Course", not on the entire composite key.
To move this table to 2NF, we decompose it into two separate tables:
Student_ID | Course | Grade |
---|---|---|
101 | Operating System | B |
101 | DB | A- |
102 | Java Programming | A |
103 | C | C+ |
103 | C++ | B+ |
Course | Instructor |
---|---|
Operating System | John Doe |
DB | Jane Smith |
Java Programming | Bob Brown |
C | Alice Johnson |
C++ | Charlie Brown |
Now, both tables are in 2NF. The "Employee_Name" attribute is fully functionally dependent on the primary key ("Employee_ID") in the "Employees" table, and the "Projects" table has a single-values primary key ("Project_ID").
The 1NF and 2NF particularly useful for heavy read-only cases. For systems with millions of records, lots of complex querying and no need to add new data or update existing records, 1NF and 2NF offer faster execution without the downside risk of conflicting data that might exist in a 3NF or more.
3NF: Third Normal Form
In the context of database normalization, a transitive dependency refers to a situation such that, given three attributes A, B, and C, A determines B (A → B) and B determines C (B → C) implies A determines C (A → C).
Third Normal Form (3NF) is a level of database normalization that builds upon the requirements of 2NF. It eliminates any transitive dependencies within a table, ensuring that non-prime attributes are transitively dependent on the primary key.
Let's use an example to illustrate the transformation from 2NF to 3NF:
Course | Textbook | Price |
---|---|---|
Experimental Design | Montgomery Design and Analysis of Experiments | 156.05 |
Linear Regression | Weisberg, S. Applied Linear Regression | 86.11 |
Statistical Theory | Casella and Berger Statistical Inference | 86.49 |
Bayesian Statistics | Gelman, A. Bayesian Data Analysis | 61.59 |
In this table, having "Course" value implies having "Textbook" value. Then the "Textbook" value uniquely determines the "Price." Thus, if we know which "Course" it is, we also know the "Price" of the textbook for the course. This situation is referred to as a transitive dependency, and thereby the table violates 3NF.
To achieve 3NF, you should decompose the table as follows:
Course | Textbook |
---|---|
Experimental Design | Montgomery Design and Analysis of Experiments |
Linear Regression | Weisberg, S. Applied Linear Regression |
Statistical Theory | Casella and Berger Statistical Inference |
Bayesian Statistics | Gelman, A. Bayesian Data Analysis |
Textbook | Price |
---|---|
Montgomery Design and Analysis of Experiments | 156.05 |
Weisberg, S. Applied Linear Regression | 86.11 |
Casella and Berger Statistical Inference | 86.49 |
Gelman, A. Bayesian Data Analysis | 61.59 |
The 3NF tables are ideal when lots of data entry and updates are involved, as the 3NF design minimizes the possible duplication of data and the associated risk of data conflicts. However, if your goal is to build a read-heavy table, you may consider 1NF or 2NF, as 3NF requires table joins to retrieve the whole information.
0 Comments