Database Normalization

Database normalization is the process of structuring data tables in a relational database. The primary purpose of the normalization is to reduce data redundancy and eliminate data anomalies. The normalization process involves decomposing a larger table into smaller, more manageable tables, and then establishing relationships between them. 

Normal forms provide a structured series of guidelines for achieving database normalization at specific levels of organization and efficiency:

  • First Normal Form (1NF): Ensures that each column contains atomic (indivisible) values and prohibits arrays in a cell; a table is two-dimensional. Also, each row should be uniquely identified.
  • Second Normal Form (2NF): Extends 1NF by requiring that all non-prime columns (columns not part of the primary key) are fully functionally dependent on the entire primary key.
  • Third Normal Form (3NF): Further builds upon 2NF by eliminating transitive dependencies. This ensures that non-prime columns depend only on the primary key and not on other non-key columns.
  • Boyce-Codd Normal Form (BCNF): Similar to 3NF but stricter, aimed at eliminating structures that could introduce rare logical inconsistencies in the data. It ensures that every non-trivial functional dependency of columns is on a superkey.
  • Forth Normal Form (4NF): Enhances BCNF by addressing certain types of complex multi-valued dependencies that are not handled by 3NF, ensuring every multi-valued dependency is dependent on a superkey.
  • Fifth Normal Form (5NF): 4NF plus every join dependency for the table is a result of the candidate keys. It addresses where certain join dependencies exist that are not covered by other normal forms.

While achieving higher normal forms is desired depending on specific needs, in many practical cases, 3NF is typically considered as well-normalized. 

1NF: First Normal Form

The First Normal Form (1NF) is the most basic level of normalization. It ensures that a table's structure is simple and that all of its data are stored in a way that allows for efficient querying. The key requirement for a table to be in 1NF is that all its columns must contain atomic values and each row must be uniquely identified. Let's look at an example to illustrate the concept of 1NF.

Job History Table:

Employee ID Start Date End Date Job ID Department ID Salary
101 21-SEP-07, 28-OCT-11 27-OCT-11, 15-MAR-15 AC_ACCOUNT, AC_MGR 110 17000, 17000
102 13-JAN-11 24-JUL-16 IT_PROG 60 17000
114 24-MAR-16 31-DEC-17 ST_CLERK 50 11000
122 01-JAN-17 31-DEC-17 ST_CLERK 50 7900
176 24-MAR-16, 01-JAN-1731-DEC-16, 31-DEC-17 SA_REP, SA_MAN 80 8600, 8600
20017-SEP-05, 01-JUL-12 17-JUN-11, 31-DEC-16AD_ASST, AC_ACCOUNT 110 4400, 4400
201 17-FEB-14 19-DEC-17 MK_REP 20 13000

This table violates 1NF. Three columns - Start Date, End Date, and Job ID - contain non-atomic values that could be further separated by commas. As a result, the Employee ID is not a unique identifier for each row: there can be multiple Start Date, End Date, and Job ID values associated with an Employee ID value.

To achieve 1NF, you should restructure this table, ensuring each column contains indivisible values. Specifically, separate the inner arrays in the Start Date, End Date, and Job ID columns into individual rows.

Job History Table (1NF):

Employee ID  Start DateEnd Date Job ID Department ID Salary
10121-SEP-0727-OCT-11AC_ACCOUNT11017000
101 28-OCT-1115-MAR-15 AC_MGR 110 17000
102 13-JAN-1124-JUL-16 IT_PROG 60 17000
114 24-MAR-16 31-DEC-17 ST_CLERK 50 11000
122 01-JAN-17 31-DEC-17 ST_CLERK 50 7900
176 24-MAR-16 31-DEC-16 SA_REP 80 8600
176 01-JAN-17 31-DEC-17 SA_MAN 80 8600
200 17-SEP-05 17-JUN-11 AD_ASST 110 4400
200 01-JUL-12 31-DEC-16 AC_ACCOUNT 110 4400
201 17-FEB-14 19-DEC-17 MK_REP 20 13000

This table now satisfies 1NF. Each cell in any of the columns contains indivisible atomic values. Also, the combination of the three columns (Employee ID, Start Date, End Date) can serve as a unique identifier for each row: there is no two rows where all three column values are equal. This simplifies queries as you don't need complex logic to handle multiple values in a cell. 

2NF: Second Normal Form

In relational databases, functional dependency is the fundamental concept that describes the relationship between columns in a table. It is a constraint that dictates how the values in one column determine the values in another column. In simpler terms, a functional dependency states that the value of one column uniquely determines the value of another column in the same table. For example, in the 1NF table we've seen earlier, one set of columns (Employee ID, Start Date, End Date) uniquely determines the value of Job ID: any one employee can hold one and only one job title during a given period.

The Second Normal Form (2NF) is built upon the foundation of 1NF. 2NF eliminates any partial dependencies in a table by ensuring that all non-prime columns are fully functionally dependent on the entire primary key. Here, partial dependency refers to a situation where a non-key column (column that is not part of the primary key) depends on only a part of the primary key, rather than on the entire primary key. For example, in the 1NF table mentioned earlier, the Salary column depends only on Employee ID, not on the entire primary key (which is the combination of Employee ID, Start Date, and End Date.) This is the scenario where we refer to as a partial dependency occurs. 

To move this table into 2NF, we should decompose it into two separate tables as follows.

Job History Table (2NF):

Employee ID  Start DateEnd Date Job ID Department ID
10121-SEP-0727-OCT-11AC_ACCOUNT110
101 28-OCT-1115-MAR-15 AC_MGR 110
102 13-JAN-1124-JUL-16 IT_PROG 60
114 24-MAR-16 31-DEC-17 ST_CLERK 50
122 01-JAN-17 31-DEC-17 ST_CLERK 50
176 24-MAR-16 31-DEC-16 SA_REP 80
176 01-JAN-17 31-DEC-17 SA_MAN 80
200 17-SEP-05 17-JUN-11 AD_ASST 90
200 01-JUL-12 31-DEC-16 AC_ACCOUNT 110
201 17-FEB-14 19-DEC-17 MK_REP 20

Salary Table (2NF):

Employee ID  Salary
101 17000
101 17000
102 17000
114 11000
122 7900
176 8600
176 8600
200 4400
200 4400
201 13000

Now, both tables are in 2NF. The 1NF and 2NF particularly useful for heavy read-only cases. For systems with millions of records, lots of complex querying and no need to add new data or update existing records, 1NF and 2NF offer faster record retrieval.

3NF: Third Normal Form

In the context of database normalization, a transitive dependency occurs when a non-prime column (a column that is not part of the primary key) depends on another non-prime column. To be more specific, consider three columns: A, B, and C. Given A acts as the primary key, A determines B and C. However, if the column B can also fully determine the column C independently, we say there is a transitive dependency.

Third Normal Form (3NF) is the level of database normalization built upon 2NF. It eliminates any transitive dependencies in a 2NF table, ensuring that non-prime attributes are transitively dependent only on the primary key. For example, let's consider the 2NF job history table we made earlier.

Job History Table (2NF):

Employee ID  Start DateEnd Date Job ID Department ID
10121-SEP-0727-OCT-11AC_ACCOUNT110
101 28-OCT-1115-MAR-15 AC_MGR 110
102 13-JAN-1124-JUL-16 IT_PROG 60
114 24-MAR-16 31-DEC-17 ST_CLERK 50
122 01-JAN-17 31-DEC-17 ST_CLERK 50
176 24-MAR-16 31-DEC-16 SA_REP 80
176 01-JAN-17 31-DEC-17 SA_MAN 80
200 17-SEP-05 17-JUN-11 AD_ASST 90
200 01-JUL-12 31-DEC-16 AC_ACCOUNT 110
201 17-FEB-14 19-DEC-17 MK_REP 20

In this table, notice that the Department ID is fully dependent on Job ID, whereas the composite primary key consists of Employee ID, Start Date, and End Date. That is, any Job ID value belongs to one and only one Department ID column value. Thus, knowing the job ID allows us to determine the corresponding department ID. This situation is referred to as a transitive dependency, and thereby the table violates 3NF. 

To achieve 3NF, you should decompose the table as follows.

Job History Table (3NF):

Employee ID  Start DateEnd Date Job ID
10121-SEP-0727-OCT-11AC_ACCOUNT
101 28-OCT-1115-MAR-15 AC_MGR
102 13-JAN-1124-JUL-16 IT_PROG
114 24-MAR-16 31-DEC-17 ST_CLERK
122 01-JAN-17 31-DEC-17 ST_CLERK
176 24-MAR-16 31-DEC-16 SA_REP
176 01-JAN-17 31-DEC-17 SA_MAN
200 17-SEP-05 17-JUN-11 AD_ASST
200 01-JUL-12 31-DEC-16 AC_ACCOUNT
201 17-FEB-14 19-DEC-17 MK_REP

Job ID Table (3NF):

Job ID Department ID
AC_ACCOUNT110
AC_MGR 110
AD_ASST 110
IT_PROG 60
SA_MAN 80
SA_REP 80
ST_CLERK 50
MK_REP 20

The 3NF tables are ideal when lots of data entry and updates are involved, as the 3NF design minimizes the possible duplication of data and the associated risk of data conflicts. However, if your goal is to build a read-heavy table, you may consider 1NF or 2NF, as 3NF requires table joins to retrieve the whole information.

Denormalization

Database normalization offers many benefits, but it also introduces some drawbacks. It minimizes data redundancy, improves data integrity, and enhances data security. However, normalized databases often necessitate intricate table joins for data retrieval, as information is distributed across multiple tables. This complexity may potentially impact query performance, particularly with more complex queries, due to the added overhead of joining tables.

Denormalization is the process of introducing intentional data redundancy, reversing the database normalization. This is usually employed to avoid expensive SQL joins between 3NF tables. The cost of denormalization is the advantages of normalization we have to give up. So, one should apply denormalization selectively as a last resort if the querying performance problems are experienced.

In Oracle, a materialized view is a database object introduced starting from version 8i. It achieves database denormalization by providing precomputed snapshot of the result set of a complex query. The database objects where it queries data from, such as tables, views, and other materialized views, are called detailed tables[1][2]. 

For data warehousing purposes, which involve maintaining a central repository of integrated data from multiple sources within an organization, commonly created materialized views include materialized aggregate views, single-table materialized aggregate views, and materialized join views. All these types of materialized views can leverage query rewrite, an optimization technique that transforms a user request written in terms of detailed tables into a semantically equivalent request that incorporates one or more materialized views.

Review the Table Structure

In any relational databases, including Oracle, tables act as the fundamental unit for data storage. Each table is defined with a table name and set of columns. You give each column a column name, a data type,  and a width. The width can be predetermined by the data type, as in DATE. If columns are of the NUMBER data type, define precision and scale instead of width. A row is a collection of column information corresponding to a single record.

Once you have created a table successfully in the database, you can review the table's structure with the DESCRIBE statement. The DESCRIBE statement, often abbreviated as DESC, isn't a SQL statement[3]; it's a SQL* Plus statement that is unique to Oracle. (Some other product vendors have since implemented DESC in their own SQL products.) The DESC statement isn't a SQL statement - it is not described in the Oracle Database SQL Language Reference manual. DESC falls under the category of one of the several SQL* Plus enhancements, unique to Oracle, and documented in Oracle's SQL* Plus User's Guide and Reference manual, which details the DESC statement. But even though it is not considered SQL, DESC is imported to understand since DESC is useful for quickly reviewing a table's structure.

One note about SQL* Plus statements: they don't require a semicolon at the end. You'll see DESC statements without a semicolon in the examples that follow. Note that a SQL* Plus statement concludes at the end of its first line unless a continuation character is placed at the end of the line to explicitly indicate that the statement continues to the next line. This is a significant difference from SQL statements, which require a semicolon at their end, and continue to multiple lines until the semicolon is encountered. The SQL* Plus use of the continuation character and the full syntax of SQL* Plus statements are beyond the scope of this book. The only point to note here is that SQL statements continue until a semicolon marks the end, spanning multiple lines if required. But a SQL* Plus statement does not behave that way and ends on its first line, with or without a semicolon, unless a continuation character is used, and we won't see the need for that in our examples that follow - we'll just see the DESC statement end without a semicolon. Also note: you can actually place a semicolon at the end of a SQL* Plus statement and - technically - it will have no effect. You won't get an error message, nor is it required. But don't forget: a semicolon is always required at the end of a SQL statement.

Let's take a look at an example of DESC. Consider the CREATE TABLE CRUISES statement you saw earlier:


Assuming this SQL statement was executed in the database successfully, resulting in the table CRUISES being stored in the database, then you could issue the following SQL* Plus command:


Figure displays the result. Notice the output list shows a three-column display.

The first column output listing is titled "Name" and shows the table's column names that you specified with the CREATE TABLE statement.

The second column in the output listing is titled "Null" and shows whether there is a NOT NULL constraint applied to that particular column in the table. In other words, will any row that's added to the database be allowed to omit this particular value?

The third column in the output listing is titled "Type" and shows the data type for the particular table's column in question.


For example, the DESC CRUISES output shows that the CRUISES table has a column titled CAPTAIN_ID, its data type is NUMBER, and it has a NOT NULL CONSTRAINT applied to it.


[1] Term used in the context of data warehousing purpose. Database objects where a materialized view query data from is called master tables when it is for replication purpose.  
[2] Materialized views can also serve as read-only copies of remote data on a local machine. For the purpose, primary key, rowid, object, and/or subqueries are commonly created as materialized views in replication environments.  
[3] And thereby it is distinguished from   

Post a Comment

0 Comments