A Comprehensive Guide to SQL SELECT Queries

Originally named SEQUEL (Structured English QUEry Language), SQL is a language designed for formulating your inquiries in a structured way. Much like asking questions in plain English, phrasing, details, and order of the words all matter. The more specific your query, the more precise and targeted the database's response will be.

When writing an SQL query, there are six key questions to ask yourself: 

  1. Which table are you retrieving data from?
  2. Which fields within the table are relevant to your query?
  3. Do you need to filter data based on specific criteria?
  4. Should the retrieved values be grouped and summarized into statistics?
  5. Do you want to filter the calculated statistics further?
  6. How should the results be sorted?

Together, these questions form a single overarching question: "What does your query do?" The answers to each of these questions shape the six primary clauses of the SELECT statement, which serves as the main tool to fetch data from the database. 

Formulating an SQL Query with the Customer Order Database Example

For a clearer understanding, let's explore an example scenario involving the analysis of sales data from an imaginary company's database.

The ERD above depicts customer order records from an imaginary company[1]. Now, suppose that you want to inquire the average revenue for each product, where the product name is not missing and the average amount is less than or equal to 100. Then you want to order the rows in descending order. We can approach this task, by answering the six key questions outlined earlier. Let's go through each question step by step.

Which table are you retrieving data from?

The very first question to address is from which table you are seeking information. In this example, the required data--product IDs, product names, and the average sales amount--is spread across two tables: order_items and products. The average sales amount is calculated from the unit_price * quantity values in the order_items table and the product names are stored in the products table. To combine this information, we need to "link" the two tables using the product IDs.

Answering this question makes up the FROM clause of the SELECT statement. The FROM clause specifies the tables from which data will be retrieved. In this example scenario, after identifying that the required data is stored in the order_items and products tables, you would put these tables and join them based on the common criteria--records with the same product_id values.

FROM order_items sales JOIN products prods ON sales.product_id = prods.product_id

In the FROM clause above, the two tables, order_items and products, are specified with aliases. In Oracle SQL, when using table joins, you must provide aliases for the tables to clarify which table each column belongs to throughout the SELECT statement. So, in this example, the order_items table is aliased as sales and the products table as prods.

Which fields within the table are relevant to your query?

The next question is to define which fields within the table are relevant to your inquiry. This step involves identifying the specific columns you need to retrieve based on your requirements. In this example, we want to calculate the average revenue per product, along with the product IDs and product names. These three fields should be listed in the SELECT clause, placed before the FROM clause we formulated earlier.

SELECT
prods.product_id,
prods.product_name,
AVG(sales.unit_price * sales.quantity) AS avg_revenue
FROM
order_items sales JOIN products prods ON sales.product_id = prods.product_id

The SELECT clause above lists all three fields we need. Observe that dot notation and table aliases are used to specify from which table each column is retrieved. 

Optionally, you can assign aliases to columns using the AS keyword[2]. For example, the AVG(sales.unit_price * sales.quantity) is aliased by avg_revenue. This practice makes the query results more readable and helps clarify the purpose of calculated fields. However, unlike table aliases, these column aliases cannot be used in the WHERE, HAVING, or GROUP BY clauses due to the logical order of internal query processing, which we will discuss shortly.

The SELECT and FROM clauses are the minimum requirements for a valid SQL query. The other four clauses--WHERE, GROUP BY, HAVING, and ORDER BY--are optional.

Aside: to retrieve all the columns from a table, you can write a query like: SELECT * FROM your_table. For example:

Do you need to filter data based on specific criteria?

Moving on to the third question, we need to exclude any rows where the column_name value is missing. This requirement can be addressed in the WHERE clause, formulated as follows:

SELECT
prods.product_id,
prods.product_name,
AVG(sales.unit_price * sales.quantity) AS avg_revenue
FROM
order_items sales JOIN products prods ON sales.product_id = prods.product_id
WHERE
prods.product_name IS NOT NULL

In the query shown above, we see that the WHERE clause is appended after the FROM clause. This clause filters the results by excluding rows where prods.product_name is NULL. Notice the use of dot notation with the table alias in the WHERE clause. Unlike column aliases defined in the SELECT clause, table aliases can--and should--be used to clarify which table the filtering criteria apply to.

Should the retrieved values be grouped and summarized into statistics?

In this example, we aim to calculate the average revenue for each product. To achieve this, the query needs a GROUP BY clause that groups the results by the product identifiers, ensuring the average is calculated for each product separately.

SELECT
prods.product_id,
prods.product_name,
AVG(sales.unit_price * sales.quantity) AS avg_revenue
FROM
order_items sales JOIN products prods ON sales.product_id = prods.product_id
WHERE
prods.product_name IS NOT NULL
GROUP BY
prods.product_id,
prods.product_name

You might wonder why both product_id and product_name are included in the GROUP BY clause. Since product_id column serves as the primary key, including the product_name column might appear redundant. However, in SQL, when using aggregate functions like AVG(), all columns in the SELECT clause that are not part of an aggregate function must appear in the GROUP BY clause. So, in this example, excluding the product_name column will result in an error:

Do you want to filter the calculated statistics further?

The next step is to set filtering criteria for the calculated statistics. In this example scenario, we want to retain only rows where the average revenue is less than or equal to 100. This filtering criterion differs from the one in the WHERE clause, as it applies to the aggregated result rather than individual rows. To implement this, we use the HAVING clause as follows:

SELECT
prods.product_id,
prods.product_name,
AVG(sales.unit_price * sales.quantity) AS avg_revenue
FROM
order_items sales JOIN products prods ON sales.product_id = prods.product_id
WHERE
prods.product_name IS NOT NULL
GROUP BY
prods.product_id,
prods.product_name
HAVING
AVG(sales.unit_price * sales.quantity) <= 100

Note that you cannot use the column alias in the HAVING clause either. This is because the HAVING clause is processed after the SELECT clause in the logical query order. Instead, you should use the original expression from the SELECT clause directly in the HAVING clause, as shown above. We will discuss the logical query processing order shortly.

How should the results be sorted?

Lastly, we want to re-order the fetched data in descending order by the calculated average revenue per product. This can be done using the ORDER BY clause like:

SELECT
prods.product_id,
prods.product_name,
AVG(sales.unit_price * sales.quantity) AS avg_revenue
FROM
order_items sales JOIN products prods ON sales.product_id = prods.product_id
WHERE
prods.product_name IS NOT NULL
GROUP BY
prods.product_id,
prods.product_name
HAVING
AVG(sales.unit_price * sales.quantity) <= 100
ORDER BY
avg_revenue DESC

-- NEVER FORGET THE CLOSING SEMICOLON!
;

In the query, the ORDER BY clause is appended after the HAVING clause. This clause should always be positioned last in the query, ensuring the final sorting of the result set after all filtering, grouping, and aggregation have been applied. Unlike other clauses, you can refer to columns by their aliases in the ORDER BY clause.

By default, the ORDER BY clause sorts data in ascending order. So, if you omit DESC keyword (which means "descending") at the end of the column specification, the result will be sorted in ascending order. While it's a matter of personal preference, I recommend explicitly specifying sorting order, even when sorting in ascending order. This approach makes the query clearer, particularly when there are multiple sorting levels. To explicitly indicate ascending order, you can use the ASC keyword instead of relying on the default behavior.

To specify multiple sorting levels, you can simply list the associated columns along with their ordering directions, like this:

In the query above, the resulting data set will first be sorted by avg_revenue in descending order, after rounding the value to 2 decimal places. If there is a tie in the rounded avg_revenue, the rows will be sorted by the product_name column in ascending order. 

It is worth noting that the ORDER BY clause does not alter the stored data itself; rather, it arranges the displayed output of a SELECT statement. The content in the table remains unchanged. This is because the SELECT statement only retrieves data from the database, without modifying how the data is stored.

Also note that the ORDER BY clause places the NULL at the last in an ascending order. The NULL value is considered the "largest" value in Oracle. For example:

Observe that the NULL value appears in the first row of the result set when the data is ordered in descending order. This indicates that NULL values are considered the largest. 

Reference Columns by Name vs Position in the ORDER BY clause

Oracle allows you to refer columns by their positions listed in the SELECT clause. If all columns are selected using an asterisk (*), column position refers to the column positions defined in the table. For example:

In this query, the ORDER BY clause uses positional references to sort the result set. The number "2" refers to the second column, my_column2, and the number "1" refers to the first column, my_column1.

However, referencing columns by their positions are generally not recommended for several reasons. First, they can make queries harder to read and maintain, especially as the query grows more complex. When column positions change or when new columns are added to the SELECT list, the positional references many no longer correspond to the intended columns, leading to errors or confusion. Using explicit column names instead of positional references improves clarity and ensures that the sorting behavior is clear, even when the structure of the SELECT list changes. Additionally, column names are self-descriptive, making the query more understandable for others who may work with it in the future.

Aside: Commenting Your Query

So far, we've walked through formulating an SQL query by addressing the six key questions. Collectively, these six questions answer the fundamental question of "What does your query do?" This information is worth being included as comments in your SQL script.

In general, it is recommended to include details about who wrote the query, when it was written, and what the query does. Additionally, it is beneficial to include any relevant context, such as the purpose of the query, any assumptions made during its creation, and potential considerations. This helps ensure the query is understandable and maintainable by others (or by yourself at later time).

Syntaxial vs Logical Order of the SELECT Statement

As mentioned earlier, referencing column aliases outside of the ORDER BY clause in a SELECT statement will result in an error. This occurs because the alias is not recognized by the system until the SELECT clause is processed, which happens right before the ORDER BY clause. The SELECT statement consists of 6 major clauses, and they must appear in the following syntaxial order:

  1. SELECT: Specifies the columns you want to retrieve from the database. 
  2. FROM: Defines the data source for the query.
  3. WHERE: Specifies a condition that must be met for a row to be included in the result set and narrow down the data based on specific criteria.
  4. GROUP BY: Group rows with the same values in the specified set of columns into summary rows. This clause is typically used with aggregate functions, such as SUM, COUNT, AVG, etc., to perform operations on each group of rows. 
  5. HAVING: Similar to the WHERE clause but is used with aggregate functions applied to the grouped rows[3].
  6. ORDER BY: Sort the result set in ascending or descending order.

On the other hand, Oracle processes a query in the following logical order:

  1. FROM: Specifies related data tables and retrieve all rows.
  2. WHERE: Filters the rows from the previous step.
  3. GROUP BY: Groups the filtered data from the previous step.
  4. HAVING: Applies additional filters to the grouped data.
  5. SELECT: Evaluates and projects the desired columns. 
  6. ORDER BY: Sorts the resulting records in the specific order.

In summary, the clauses must be written in a fixed syntaxial order (SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY). However, Oracle processes them in a logical order (FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY). This is because SQL is a set-based language, and processing the SELECT clause for all rows prior to narrowing down the targeted rows could be waist of computing resources. By applying filtering as soon as it gets possible, Oracle minimizes unnecessary computations. The ORDER BY clause is processed last because it operates on the columns and results defined in the SELECT clause. 

Memorizing syntaxial order of the clauses is not enough for mastering SELECT queries. Understanding the logical flow between the clauses is crucial, as it helps you avoid errors and ensures your queries are both efficient and accurate.

Using the DISTINCT Keyword to Eliminate Duplicate Rows

When analyzing data or generating reports, we often want to filter out duplicates and see only unique values in a result set. Particularly, when aggregating results with the GROUP BY and aggregation functions, it is beneficial to first examine the unique values of the grouping columns before proceeding with further analysis.

The DISTINCT[4]keyword in the SELECT clause of an SQL query achieves this by returning only distinct, non-redundant values from a column or a combination of columns. The syntax of the keyword is:

SELECT DISTINCT column_name FROM table_name;

When there are multiple columns, the DISTINCT keyword in Oracle ensures that the combination of values across all specified column is unique in the result set. For example:


The DISTINCT keyword is commonly used with the COUNT function. When used together, it returns the number of unique values in a column or in combinations of columns. For example:


This query uses the DISTINCT keyword in combination with the COUNT function to count the number of unique combinations of product_id and product_name column values. Observe that a SELECT statement is nested within another SELECT statement. This is known as the subquery. First, the inner query (SELECT DISTINCT product_id, product_name FROM products) retrieves all distinct combinations of the specified columns. The outer query then counts the total number of these distinct values and returns the result as num_uniq_product

Using the FETCH Clause to Obtain Top n Results

Sometimes, it is necessary to retrieve only a subset of rows from a query result, such as when you need to display the top n results based on a specific criterion. For example, if you're working with a sales database and you want to find the top 5 products with the highest sales, you can use the FETCH clause to limit the number of results returned. 

The basic syntax of FETCH clause in a SELECT statement is:

SELECT column1, column2, ...
FROM table_name
[WHERE conditions]
[ORDER BY column1, column2, ...]
FETCH {FIRST | NEXT} [n] [PERCENT] {ROW|ROWS} {ONLY | WITH TIES}
;

Here's a breakdown of each syntax element:

  • SELECT column1, column2, ...: Specifies the columns you want to retrieve.
  • FROM table_name: Identifies the table from which you want to retrieve.
  • [WHERE conditions]: Optional clause to filter rows based on specified condition. Rows excluded by the WHERE clause will not be considered by the FETCH clause. 
  • [ORDER BY column1, column2, ...]: Optional clause to define the sorting order of the result set.
  • FETCH {FIRST|NEXT} [n] {ROWS|PERCENT} {ONLY|WITH TIES}: The FETCH clause is used to limit the number of rows returned.
    • {FIRST|NEXT}: Specifies whether to retrieve the first set of rows. Both the keywords are valid and have no functional difference.
    • [n]: An optional numeric value representing the number of rows or percentage of rows to fetch. If no numeric value is provided for n, it defaults to one, which makes sense as omitting it compose an expression, "FETCH FIRST ROW ONLY".
    • [PERCENT]:
    • {ROW|ROWS}: Indicates whether to fetch a specific number of rows (ROWS) or a percentage of rows (PERCENT). Note that you can also say "ROW" instead of ROWS. There is no functional difference between ROW and ROWS; the choice is solemnly for grammatical readability.
    • {ONLY|WITH TIES}: Specifies whether to fetch only the specified number of rows (ONLY) or include additional rows tied by the sorting criteria (WITH TIES).

The FETCH clause is typically used in combination with the ORDER BY clause. Together, these two clause can be used to return the top n results. For example:

Prior to ROW or ROWS, you can include the PERCENT keyword, when n is intended to represent a percentage rather than a specific number of rows. In this case, the number of rows in the result set will be rounded up to the nearest whole number; incomplete rows will never be included. For example:


In the example table shown above, we have 828 rows in total. So, 1 percent of the number of rows in the table would be 8.28. However, when fetching 1 percent of the rows, Oracle returns the first 9 rows from the table.

At the end, you must have one of ONLY or WITH TIES. ONLY will return the specified number or percentage of rows and no more. On the other hand, WITH TIES will return all the rows returned by ONLY keyword, but possibly with some additional rows. They behave differently when the rows are sorted and there are tie values. For example:


In the figure shown above, we have two different results depending on which keyword is used. When I used ONLY, the returning set has just 5 rows. However, when I used WITH TIES, Oracle returned additional 3 rows. This is because those 3 rows have the same value with the fifth row.

Note that without ORDER BY, the WITH TIES produces the same result as if we had executed FETCH ... ONLY. It won't produce any syntax error, but just has no effect.

OFFSET

By default, the row limiting clause begins with the first returned row. However, you have the option to specify that the FETCH operation should start from a different position within the result set, at a specific number of rows into the range. Here's an example:


Unlike the first one, in the second example, the FETCH skips the first two rows and starts at the third row. The following lists what happens with various specifications of OFFSET:

  • Omitted: Defaults to OFFSET = 0. Range begins with the first row returned by the query.
  • OFFSET = a negative number: Treated as OFFSET = 0. Range begins with the first row returned by the query.
  • OFFSET = 0: Range begins with the first row returned by the query.
  • OFFSET = a positive number that is less than or equals to the total set of available rows: Range begins with OFFSET row.
  • OFFSET = a positive number that is greater than the total set of available rows: No rows returned.

[1] The Customer Order database is part of Oracle's sample schema. To learn more about the sample schema, visit oracle-samples/db-sample-schemas
[2] You can omit the AS keyword when aliasing columns. However, I personally do not recommend doing so, as omitting it can make your query hard to read. 
[3] So, Oracle first narrows down the rows in the result set using the WHERE clause, then applies aggregate functions with the GROUP BY clause, and finally applies additional filtering using the HAVING clause. 
[4] Alternatively, you can also use the keyword UNIQUE. Both DISTINCT and UNIQUE are functionally equivalent and have the same syntax. 

Post a Comment

0 Comments