Lesson 12.1 · Reading time: ~10 min
This lesson focuses on practical string processing in SQL. You will learn how to clean text values, normalize casing, extract useful fragments, and build readable fields for analytics and reporting. We will walk through hands-on scenarios using the Sakila database. By the end of the lesson, you will be able to confidently prepare text data for analysis directly in SQL.
In the previous module, we discussed SQL code quality and query performance. Now we move to applied analytics: real datasets often contain text fields that should not only be selected but first transformed into a usable form.
Practical string processing is needed in reports, user segmentation, reference data cleanup, export preparation, and data quality checks. These are exactly the kinds of tasks analysts and developers face in day-to-day work.
Basic string functions are useful on their own, but real value appears when you apply them to concrete tasks. For example, the same email value can be used for data-quality checks, domain-based segmentation, and marketing reporting.
In practice, SQL string processing usually falls into four task types:
In most cases, text is processed step by step:
This approach makes queries more predictable and easier to debug.
SELECT
LOWER(TRIM(email)) AS email_normalized
FROM customer
LIMIT 5;
Result: email is trimmed and converted to lowercase.
The most common scenario is preparing a string for further analysis. For this, TRIM(), LOWER(), UPPER(), and REPLACE() are typically used.
SELECT
customer_id,
email,
LOWER(TRIM(email)) AS email_normalized
FROM customer
LIMIT 10;
Note: even if data already looks clean, normalization improves comparison, grouping, and downstream automation.
SELECT
address_id,
address,
TRIM(REPLACE(address, 'Street', 'St.')) AS address_cleaned
FROM address
LIMIT 10;
Result: the address becomes shorter and more consistent, which is useful for reports and interfaces.
After cleanup, you often need only a specific part of a string for analysis. In MySQL, SUBSTRING(), LEFT(), RIGHT(), and SUBSTRING_INDEX() are especially useful.
SELECT
customer_id,
email,
SUBSTRING_INDEX(LOWER(TRIM(email)), '@', -1) AS email_domain
FROM customer
LIMIT 10;
Result: the domain part is extracted from email, for example example.com.
SELECT
film_id,
title,
LEFT(title, 5) AS title_prefix,
RIGHT(title, 5) AS title_suffix
FROM film
LIMIT 10;
Note: these fragments can be useful for quick heuristics, naming-pattern checks, or short label generation.
In analytics, you often need readable labels rather than raw source fields. For that, CONCAT() and CONCAT_WS() are convenient.
SELECT
customer_id,
CONCAT_WS(
' | ',
CONCAT_WS(' ', first_name, last_name),
LOWER(TRIM(email)),
CONCAT('store=', store_id)
) AS customer_label
FROM customer
LIMIT 10;
Result: you get a compact text field suitable for admin reports, export files, and internal tools.
String processing is useful not only for formatting but also for basic validation. SQL does not replace a full validation system, but it helps quickly find suspicious values.
@SELECT
customer_id,
email
FROM customer
WHERE INSTR(LOWER(TRIM(email)), '@') = 0;
Result: the query returns records where email does not contain the required separator.
SELECT
film_id,
title,
CHAR_LENGTH(title) AS title_length
FROM film
WHERE CHAR_LENGTH(title) > 20
ORDER BY title_length DESC
LIMIT 10;
Note: checks like this are useful when looking for values that are too long for cards, UI screens, or export limits.
Now let us combine several techniques in one analytical query. Suppose we need to understand which domains are most common among customers.
SELECT
SUBSTRING_INDEX(LOWER(TRIM(email)), '@', -1) AS email_domain,
COUNT(*) AS customer_count
FROM customer
WHERE email IS NOT NULL
AND INSTR(LOWER(TRIM(email)), '@') > 0
GROUP BY SUBSTRING_INDEX(LOWER(TRIM(email)), '@', -1)
ORDER BY customer_count DESC, email_domain
LIMIT 15;
Result: you get the customer distribution by email domain. This is useful for initial audience exploration, anomaly detection, and communication segmentation.
This example highlights an important point: string functions are most powerful in chains. First we clean the value, then validate its structure, then extract the domain, and only then aggregate.
CTE or subquery for readability.SUBSTRING_INDEX() is convenient in MySQL, but other DBMSs may require different syntax.Key takeaways from this lesson:
TRIM, LOWER, REPLACE, SUBSTRING_INDEX, LEFT, RIGHT, and CONCAT_WS are especially useful in daily work.Because even visually clean data can contain extra spaces, inconsistent casing, or subtle format deviations. Normalization makes filtering, grouping, and comparisons more reliable.
Domains help segment users, identify corporate addresses, and detect data anomalies. It is a simple way to turn raw text into an analytical feature.
When fields are needed for reports, admin interfaces, exports, or intermediate analytics. In these cases, SQL-side label building reduces post-processing and keeps logic close to the data.
Typical tasks include text cleaning, format normalization, feature extraction, and data validation. In practice, string functions are often used before grouping, segmentation, and report field generation.
TRIM() and LOWER() before GROUP BY on text columns?Without normalization, the same value may appear in multiple groups because of different casing or extra spaces. Pre-cleaning improves aggregation correctness and reduces false differences.
SUBSTRING_INDEX() with a practical example?In MySQL, this function is convenient for quick extraction of a string part by delimiter. For example, you can extract an email domain and immediately use it for user segmentation or analytical reporting.
In the next lesson, we will move to SQL for data analysis and reporting and see how to turn prepared data into useful business insights.