What is a Subquery?
A subquery is a query nested within another query, enabling operations that rely on the results of the inner query. It is an essential tool for dynamically filtering, aggregating, or modifying query results. For example, let's consider the following SQL query:
This query retrieves the number of employees whose salary is at least the average in the employees table. Inside the WHERE clause, observe that a query calculating the average salary of the table (SELECT AVG(salary) FROM employees) is nested. Such a nested query is also known as an inner query, as it operates within another query. The inner query provides intermediate results that will be used by its surrounding query, known as the outer query. The outer query processes the results from its inner query and returns the final output. Using a subquery like this example dynamically computes the average salary so that the query adapts to changes in the data without requiring manual updates.
Single-Row vs. Multiple-Row Subqueries
When a subquery is intended to return only a single row value (or possibly no value), it is called a single-row subquery. For example, in the query we've seen earlier, what subquery does is returning the average salary value as a single number. So, this is a single-row subquery.
On the other hand, when a subquery can possibly return more than one rows, we call it a multiple-row subquery. Multiple-row subqueries are used most commonly in WHERE or HAVING clauses with subquery quantifiers. For example:
Here, the subquery extracts 35 employee IDs from the employees table, excluding those without commission_pct values. So, it returns more than one row in its result set. Then the outer query filters out, selecting only those employees whose manager_id matches any of the employee IDs returned by the subquery. Thus, the selected employees in the final output is managed by someone who receives a commission.
Subquery Quantifiers
The multiple-row subqueries are typically used in WHERE or HAVING clause to dynamically filter the outer query's results based on the intermediate result set returned by the subquery. In many times, however, the multiple values returned by a subquery should be compared to each single value in the outer query. To support this, Oracle offers some special operators, known as subquery quantifiers:
- ANY: Checks if at least one row from the subquery meets a specific condition. Think of it as "Does any row qualify?"
- ALL: Checks if every single row from the subquery meets a specific condition. In other words, "Do all rows qualify?"
For example, as shown in the figure below, the U.S. office has seven departments, with their respective average salaries:
Now, consider the following three queries:
-- Example use of ANYSELECT department_id, department_name, AVG(salary) AS avg_salaryFROM emp_details_viewWHERE country_name = 'Canada'GROUP BY department_id, department_nameHAVING AVG(salary) >= ANY (SELECT AVG(salary)FROM emp_details_viewWHERE country_name = 'United States of America'GROUP BY department_id);-- Example use of ALLSELECT department_id, department_name, AVG(salary) AS avg_salaryFROM emp_details_viewWHERE country_name = 'Canada'GROUP BY department_id, department_nameHAVING AVG(salary) < ALL (SELECT AVG(salary)FROM emp_details_viewWHERE country_name = 'United States of America'GROUP BY department_id);
The first query retrieves department and job information from the emp_details_view for departments in Canada, where the average salary meets or exceeds any of the average salaries for departments in the United States. The subquery calculates 7 average salary values, one for each department in the U.S. The ANY operator in the HAVING clause of the outer query then evaluates whether the average salary of each department in Canada is greater than or equal to at least one of these seven values. For instance, if a department, say Marketing (avg_salary: 9500), has an average salary that exceeds or matches at least one department in the U.S., the row is included in the result. Otherwise, it is excluded.
In the second query, on the other hand, the HAVING clause checks if the average salary of each department in Canada is less than all seven average salaries in the U.S. office. Unlike ANY, the ALL operator ensures that a department's average salary in Canada must be less than every single one of the seven department average salaries in the U.S. office. If this condition is satisfies, the row is included in the result set; otherwise, it is excluded.
Correlated Subqueries
Subqueries we've seen so far does not reference any columns from their outer queries. For example, in the case of the query on the retired_employees table, the subquery operates independently of its outer query. It is fully functional as a standalone query. This type of subquery is known as a self-contained subquery or non-correlated subquery, as it does not rely on or interact with columns from the outer query to perform its evaluation.
On the other hand, when a subquery reference a outer query column, it is referred to as correlated to the outer query. For example:
This is a clear example of a correlated subquery. The inner query is dependent of the outer query for its execution. Here's a breakdown of how the correlation works in this example:
- FROM clause of the outer query specifies the departments table as the data source and aliases it as dept.
- The inner query operates based on the dept.department_id from the outer query.
- Using the EXISTS condition, the outer query filters out its result set based on the inner query results.
Note that when there are more than one subqueries, each operates independently and is not affected by the others. So, it doesn't matter if you assign the same alias to all subqueries. For example:
Subqueries can be further nested, meaning you can place a subquery within another subquery. For example:
This query is double nested. The inner most query calculates the average salary of the entire employees. Then the intermediate query uses this value in its WHERE clause. This query is correlated to the outer most query, referencing dept.department_id. In essence, this query identifies department that have at least one high-earning employee, providing some insights into the distribution of high salaries across different departments.
Semi-Joins and Anti-Joins
A semi-join is a type of table join that filters rows from the outer query table based on the existence of matching rows in the inner query table. Unlike a traditional join, which combines rows from both tables based on a matching condition, a semi-join only checks for the existence of a match in the inner query table. Semi-joins can be more efficient, especially when dealing with large datasets, as it avoids unnecessary data retrieval and processing.
To implement a semi-join, you can use EXISTS or IN operator. For example:
This is a classic example of using an EXISTS subquery to filter rows based on a related table's data. It retrieves all customer information from customers table (aliased as cust) where there exist at least one order record in the orders table (aliased as ord). So, in the final result, customer records where no corresponding order record exists will be excluded. Since the EXISTS only checks for the existence of records, not their specific content, the inner query is simplified to return an arbitrary constant value (e.g., 1).
On the other hand, when the EXISTS or IN operator used in a subquery quantifier is negated by NOT (e.g., NOT EXISTS or NOT IN), it is called an anti-join. Anti-joins return rows from the outer query table that do not have any matching rows in the second table. They are often used to identify:
- Missing Records: Finding records in one table that are absent in another.
- Unique Values: Identifying unique values in a table that don't exist in another.
- Differences between Datasets: Comparing two datasets and finding discrepancies.
For example, the SQL query shown below operates by utilizing a NOT EXISTS subquery quantifier and identifies products that have not been ordered yet.
Using Subqueries for UPDATE, MERGE, and DELETE Statements
Correlated subqueries can exist in SELECT, UPDATE, and DELETE statements. They are "correlated" by way of a column reference from the parent query within the subquery. Note that a table alias is not required in the subquery if no column name conflict exists. We could have just referenced any column from the parent query, and as long as there isn't an identically named column in the subquery. However, as we've seen in the Pokémon example above, if there is any column name conflict, table alias is required.
Inline Views
When a subquery is employed in the FROM clause of the outer query, it is often referred to as an inline view, as it allows you to treat the result set of the subquery as a virtual table. For example:
This query retrieves detailed information about employees whose salaries exceed the average salary of their respective departments. The result set includes the department ID and name, employee ID, first and last names, the employee's salary, and the average salary for their department.
The inline view sal calculates the average salary for each department by joining the employees table (t1) and the departments table (t2) on department_id. It groups the data by department (t2.department_id) and computes the average salary for each group, rounding the result to two decimal places. This computed value (avg_salary) is then available to the outer query.
In the outer query, the employees (emp) and departments (dept) tables are joined with the inline view sal. The query filters rows using the WHERE clause to include only employees whose salaries (emp.salary) are higher than the average salary for their department (sal.avg_salary). This design allows for a detailed comparison between individual employee salaries and their department averages, making the query both informative and precise.
Creating Populated Tables
Incorporating a subquery into a CREATE TABLE statement facilitates the creation and population of a table from an existing data source simultaneously with its creation.
Mastering Subqueries: A Divide and Conquer Approach
fd
- Clearly define what should be included in the final result set.
- Identify which table and column is required for the result.
- Write queries in a modularized approach.
- Integrate the queries into the final query.
- Validate the query results.
For example,
0 Comments