Text Manipulations in Oracle SQL

When dealing with text strings stored in your database, we often need to perform text manipulations. Oracle provides powerful built-in functions specifically designed for processing such text strings. In this blog post, we will explore some Oracle character functions commonly used in day-to-day tasks!

UPPER(s1) and LOWER(s1)

If you want a string to be entirely in upper case or lowercase, you can use UPPER or LOWER functions. These functions are particularly useful for a text search. Suppose that you're searching a row containing 'nerdatandrew' from the table below:

MY_STRING
Lorem ipsum
nerdatandrew
Nerdatandrew
oracle database
NERDATANDREW
Data science

When retrieving data, suppose that you want to retrieve all the three rows: 'nerdatandrew', 'Nerdatandrew', and 'NERDATANDREW', but you are not sure if the data in the table is in uppercase, lowercase, or both. In such situations, you can use UPPER or LOWER function to avoid any exclusion due to the mixed cases. For example:


INITCAP(s1)

The INITCAP function converts the input text string 's1' into a mixed-case string, where the initial letter of each word is capitalized, and all subsequent characters are in lowercase letters. For example:


CONCAT(s1s2) and s1 || s2

Oracle offers two different ways to perform string concatenation. One of the ways is the CONCAT function, which takes two string arguments. For example:


The || operator works basically the same way. The only difference is that while CONCAT function can take only two arguments at a time, the || can be repeated as many times as desired. For example:


LPAD(s1ns2) and RPAD(s1ns2)

The LPAD function pads the left side of the input character string 's1' with character string 's2', until the resulting string is 'n' characters long. Similarly, the RPAD function pads the right side of the 's1' with 's2' until 's1' reaches to the length of 'n'. Both LPAD and RPAD requires 's1' and 'n'. 's2' is optional; if omitted, 's2' defaults to a single blank space.

In practice, you would find the functions are useful in:

  • Formatting Output:
    • Creating reports with aligned columns.
    • Displaying fixed-width data for readability.
  • Data Cleaning and Preparation:
    • Filing missing values with padding characters.
    • Standardizing string lengths for data consistency.
  • Visualizing Hierarchical Structures:
    • Visualizing tree-like structures with indentation using padding.

For example, let's suppose that you have a data that looks like below:


Data stored in the file_directory table inherently has a hierarchical order. One way to visualize this is using LPAD function:


In the query above, the line 4, LPAD('-', 5 * (LEVEL -1)) || file_name AS intended_foler creates an indented representation of the data hierarchy. In the line the LPAD function takes argument '-' as 's1', 5 * (LEVEL - 1) as 'n', and 's2' is omitted. This means that it will pad '-' with blank spaces 4 blank spaces per each level of the hierarchy of the row.

LTRIM(s1s2), RTRIM(s1s2), and TRIM

The LTRIM and RTRIM removes any occurrences of the 's2' characters from 's1' string, from either left side of 's1' (LTRIM) or right side of 's1' (RTRIM) exclusively. In the functions, the 's1' is a required argument, whereas the 's2' is an optional argument; if 's2' is omitted, it defaults to a single blank space. These functions are quite useful when you want to strip out any unnecessary blanks, periods, ellipses, and so on. For example, to remove any trailing zeros:


The TRIM function works just same as LTRIM or RTRIM, with a slightly different syntax:

TRIM(trim_info trim_char FROM trim_source);

where: 

  • trim_info should be one of these keywords: LEADING, TRAILING, BOTH
    • If omitted, defaults to BOTH.
  • trim_char is a single character to be trimmed.
    • If omitted, assumed to be a blank.
  • trim_source is the source string
    • If omitted, the TRIM function will return a NULL.

For example:


LENGTH(s)

The LENGTH function returns the number of characters in a given text string value. For example:


INSTR(s1s2pn)

The INSTR locates where a substring 's2' is in the source string 's1'. Both 's1' and 's2' are required in the function. Optionally, you can also specify the starting position in 's1' to start looking for occurrences of 's2' by 'p', and the number of occurrences of 's2' to locate by 'n'. For example, suppose that you want to look for the string 'is' within 'Mississippi', starting at the first character position, but looking for the second occurrence of 'is'.


The INSTR function is telling us that the second occurrence of 'is' starts at the fifth character in 'Mississippi'.

SUBSTR(spn)

The SUBSTR extracts a substring from its input string 's' (required), starting at the specified position 'p' (required), and continuing for 'n' number of characters (optional). If the length 'n' is omitted, then the substring starts as 'p' and runs through the end of 's'. For example:


Note: Indexing is one-based. For example, in a string 'Hello, Oracle!', the index for the first character H is 1, not 0.

SOUNDEX

SOUNDEX is a phonetic algorithm used to index names based on their pronunciation, rather than their spelling. It was developed in the early 1900s by Robert C. Russell and Margaret King Odell to help researchers find records with potentially misspelled names in large datasets. Oracle supports a built-in function to implement the SOUNDEX algorithm in SQL.

LetterSOUNDEX Code
B, F, P, V1
C, G, J, K, Q, S, X, Z2
D, T3
L4
M, N5
R6
A, E, H, I, O, U, W, YIgnored

Based on the coding scheme above, here's an example illustrating how to use the SOUNDEX function in Oracle SQL:


In the example above, two strings 'Smith' and 'Smythe' have similar pronunciations. It is likely the second string 'Smythe' is misspelled by data entrance. Occasionally, we want to retrieve all data containing both the correct name, Smith, and potentially misspelled one Smythe. In such cases, the SOUNDEX function would be very helpful.

When generating a SOUNDEX translation, the first letter of the input string remains unchanged. Then the subsequent letters are systematically translated into numeric codes based on the phonetic rules in the table shown earlier. The translation goes for three digits; any remaining letters beyond those used for the three digits are disregarded. The resulting four-character code, comprised of the original first letter and the three digits, serves as a phonetic representation of the string, emphasizing sound over exact spelling.

For example, in the case of 'Smith' and 'Smythe', the first character 'S' is preserved for both strings. Then the following m is encoded to 5, i and y are ignored, t is encoded to 3. The remaining h and e are also ignored according to the coding scheme. To ensure the length of the results, 0 will be padded to the output S53

One place where we can find the SOUNDEX is useful in practice would be the WHERE clause. For example, let's consider a table below:

TRAINER_IDFIRST_NAMELAST_NAMEHOMETOWNJOURNEY_STARTED
1AshKetchumPallet Town01-APR-96
2MistyWaterflowerCerulean City15-FEB-97
3BrockPewterPewter City10-JUL-98
4PauloAntrimPallet Town23-JUL-01
5GrigoryAndersonPallet Town16-AUG-01
6CaroleAndersonPallet Town26-JUN-02

Now, suppose that you want to retrieve trainers from the table whose last name sounds like "Andrew":


The query above fetches every records from tbl_employee table whose last name sounds like Andrew. Thus, although strings Antrim and Anderson does not exactly match with Andrew, we can retrieve those rows.

It is worth noting that the SOUNDEX is case-insensitive and is designed for use with American English. So, it won't work quite as well with words originated from other languages, whose pronunciation rules and practices are different. For example, a Vietnamese name Nguyen is pronounced "Nwen", but the SOUNDEX patterns for the two string Nguyen and Nwen are different.

Post a Comment

0 Comments