Database Normalization

Database normalization is the process of organizing and structuring data in a relational database to reduce data redundancy and improve data integrity. The main goal of normalization is to eliminate or minimize data anomalies, such as update, insert, and delete anomalies that can arise when a database is not properly structured. The process involves decomposing tables into smaller, more manageable structures and establishing relationships between them.

Normalization can be achieved through a series of guidelines called Normal Forms:

  • First Normal Form (1NF): Requires that all values in a table must be atomic and there should be no repeating groups of columns.
  • Second Normal Form (2NF): Requires that a table be in 1NF and that all non-prime attributes (columns that are not a part of the primary key) are fully functionally dependent on the entire primary key.
  • Third Normal Form (3NF): Requires that a table be in 2NF and that there are no transitive dependencies: non-prime attributes should not depend on other non-prime attributes.

While achieving higher normal forms is desired depending on specific needs, in many cases a database that meets 3NF is typically considered as well-normalized. 


1NF: First Normal Form

First Normal Form (1NF) is the most basic level of normalization. It ensures that a table's structure is simple and that all of its data is stored in a way that avoids duplication and allows for efficient querying. The key requirement for a table to be in 1NF is that all its columns must contain atomic values, and there should be no repeating groups of columns.

Let's look at an example to illustrate the concept of 1NF:

Student_IDCoursesGrades
101Math, Physics, ChemistryA, B, C
102English, History, GeographyB, C, A
103Biology, Chemistry, PhysicsC, A, B

In this table, the "Courses" column violates 1NF because it contains multiple values (comma-separated lists) in each cell. Each student can be associated with multiple courses, but these courses should be stored in separate rows, not combined into a single cell.

Student_IDCourseGrade
101MathA
101PhysicsB
101ChemistryC
102EnglishB
102HistoryC
102GeographyA
103BiologyC
103ChemistryA
103PhysicsB

Now, each row in the table represents a single course for a specific student, and the table is 1NF. Each attribute ("Student_ID", "Course", "Grade") contains atomic values, and there are no repeating groups. This structure makes it easier to query the database and avoids the issues associated with storing multiple values in a single cell.


2NF: Second Normal Form

Second Normal Form (2NF) is a level of database normalization that builds upon the foundation of First Normal Form (1NF). The primary goal of 2NF is to eliminate partial dependencies within a table by ensuring that all non-prime attributes are fully functionally dependent on the entire primary key.

To understand 2NF, let's briefly revisit the concept of functional dependency. In relational database, one attribute (or a set of attributes) is said to be functionally dependent on another attribute if the value of the first attribute uniquely determines the value of the second attribute.

Now, for a table to be in 2NF:

  • It must already be in 1NF.
  • There should be no partial dependencies, meaning every non-prime attribute must be fully functionally dependent of the entire primary key.

Here's an example to illustrate the transition from 1NF to 2NF:

Student_IDCourseInstructorGrade
101Operating SystemJohn DoeB
101DBJane SmithA-
102Java ProgrammingBob BrownA
103CAlice JohnsonC+
103C++Charlie BrownB+

In this table, ("Student_ID", "Course") is the composite primary key. For example, to determine which value is stored in the "Grade" column, we have to know both "Student_ID" and "Course"; knowing only one of them, say "Student_ID" of 101, is not enough to find "Grade." However, "Instructor" is partially dependent on the primary key because it depends only on "Course", not on the entire composite key


To move this table to 2NF, we decompose it into two separate tables:

Student_IDCourseGrade
101Operating SystemB
101DBA-
102Java ProgrammingA
103CC+
103C++B+

CourseInstructor
Operating SystemJohn Doe
DBJane Smith
Java ProgrammingBob Brown
CAlice Johnson
C++Charlie Brown

Now, both tables are in 2NF. The "Employee_Name" attribute is fully functionally dependent on the primary key ("Employee_ID") in the "Employees" table, and the "Projects" table has a single-values primary key ("Project_ID").



The 1NF and 2NF particularly useful for heavy read-only cases. For systems with millions of records, lots of complex querying and no need to add new data or update existing records, 1NF and 2NF offer faster execution without the downside risk of conflicting data that might exist in a 3NF or more.


3NF: Third Normal Form

In the context of database normalization, a transitive dependency refers to a situation such that, given three attributes A, B, and C, A determines B (A → B) and B determines C (B → C) implies A determines C (A → C).

Third Normal Form (3NF) is a level of database normalization that builds upon the requirements of 2NF. It eliminates any transitive dependencies within a table, ensuring that non-prime attributes are transitively dependent on the primary key.

Let's use an example to illustrate the transformation from 2NF to 3NF:

CourseTextbookPrice
Experimental DesignMontgomery Design and Analysis of Experiments156.05
Linear RegressionWeisberg, S. Applied Linear Regression86.11
Statistical TheoryCasella and Berger Statistical Inference86.49
Bayesian StatisticsGelman, A. Bayesian Data Analysis61.59

In this table, having "Course" value implies having "Textbook" value. Then the "Textbook" value uniquely determines the "Price." Thus, if we know which "Course" it is, we also know the "Price" of the textbook for the course. This situation is referred to as a transitive dependency, and thereby the table violates 3NF. 


To achieve 3NF, you should decompose the table as follows:

CourseTextbook
Experimental DesignMontgomery Design and Analysis of Experiments
Linear RegressionWeisberg, S. Applied Linear Regression
Statistical TheoryCasella and Berger Statistical Inference
Bayesian StatisticsGelman, A. Bayesian Data Analysis

TextbookPrice
Montgomery Design and Analysis of Experiments156.05
Weisberg, S. Applied Linear Regression86.11
Casella and Berger Statistical Inference86.49
Gelman, A. Bayesian Data Analysis61.59


The 3NF tables are ideal when lots of data entry and updates are involved, as the 3NF design minimizes the possible duplication of data and the associated risk of data conflicts. However, if your goal is to build a read-heavy table, you may consider 1NF or 2NF, as 3NF requires table joins to retrieve the whole information.

Post a Comment

0 Comments