SQL code copied to buffer
RU PT FR

Lesson 12.1 · Reading time: ~10 min

This lesson focuses on practical string processing in SQL. You will learn how to clean text values, normalize casing, extract useful fragments, and build readable fields for analytics and reporting. We will walk through hands-on scenarios using the Sakila database. By the end of the lesson, you will be able to confidently prepare text data for analysis directly in SQL.

Practical String Processing in SQL

In the previous module, we discussed SQL code quality and query performance. Now we move to applied analytics: real datasets often contain text fields that should not only be selected but first transformed into a usable form.

Practical string processing is needed in reports, user segmentation, reference data cleanup, export preparation, and data quality checks. These are exactly the kinds of tasks analysts and developers face in day-to-day work.

Practical string processing in SQL: text cleaning, email domain extraction, and analytical field building


Why Practical String Processing Matters

Basic string functions are useful on their own, but real value appears when you apply them to concrete tasks. For example, the same email value can be used for data-quality checks, domain-based segmentation, and marketing reporting.

In practice, SQL string processing usually falls into four task types:

  • cleaning text from extra spaces and repeated patterns;
  • normalizing case and format;
  • extracting parts of a string for analysis;
  • building new text fields for interfaces and reports.

A Basic Workflow for String Processing

In most cases, text is processed step by step:

  1. clean the value;
  2. normalize it to a consistent format;
  3. extract the required parts;
  4. use the result in analytics or reports.

This approach makes queries more predictable and easier to debug.

SELECT
   LOWER(TRIM(email)) AS email_normalized
FROM customer
LIMIT 5;

Result: email is trimmed and converted to lowercase.


Cleaning and Normalizing Text

The most common scenario is preparing a string for further analysis. For this, TRIM(), LOWER(), UPPER(), and REPLACE() are typically used.

Example: email normalization

SELECT
   customer_id,
   email,
   LOWER(TRIM(email)) AS email_normalized
FROM customer
LIMIT 10;

Note: even if data already looks clean, normalization improves comparison, grouping, and downstream automation.

Example: address cleanup

SELECT
   address_id,
   address,
   TRIM(REPLACE(address, 'Street', 'St.')) AS address_cleaned
FROM address
LIMIT 10;

Result: the address becomes shorter and more consistent, which is useful for reports and interfaces.


Extracting Useful Parts of a String

After cleanup, you often need only a specific part of a string for analysis. In MySQL, SUBSTRING(), LEFT(), RIGHT(), and SUBSTRING_INDEX() are especially useful.

Example: extract email domain

SELECT
   customer_id,
   email,
   SUBSTRING_INDEX(LOWER(TRIM(email)), '@', -1) AS email_domain
FROM customer
LIMIT 10;

Result: the domain part is extracted from email, for example example.com.

Example: extract film title prefix

SELECT
   film_id,
   title,
   LEFT(title, 5) AS title_prefix,
   RIGHT(title, 5) AS title_suffix
FROM film
LIMIT 10;

Note: these fragments can be useful for quick heuristics, naming-pattern checks, or short label generation.


Building Analytical Text Fields

In analytics, you often need readable labels rather than raw source fields. For that, CONCAT() and CONCAT_WS() are convenient.

Example: customer label for reporting

SELECT
   customer_id,
   CONCAT_WS(
      ' | ',
      CONCAT_WS(' ', first_name, last_name),
      LOWER(TRIM(email)),
      CONCAT('store=', store_id)
   ) AS customer_label
FROM customer
LIMIT 10;

Result: you get a compact text field suitable for admin reports, export files, and internal tools.


Validating Text Data Quality

String processing is useful not only for formatting but also for basic validation. SQL does not replace a full validation system, but it helps quickly find suspicious values.

Example: find emails without @

SELECT
   customer_id,
   email
FROM customer
WHERE INSTR(LOWER(TRIM(email)), '@') = 0;

Result: the query returns records where email does not contain the required separator.

Example: validate film title length

SELECT
   film_id,
   title,
   CHAR_LENGTH(title) AS title_length
FROM film
WHERE CHAR_LENGTH(title) > 20
ORDER BY title_length DESC
LIMIT 10;

Note: checks like this are useful when looking for values that are too long for cards, UI screens, or export limits.


Practical Example: Customer Segmentation by Email Domain

Now let us combine several techniques in one analytical query. Suppose we need to understand which domains are most common among customers.

SELECT
   SUBSTRING_INDEX(LOWER(TRIM(email)), '@', -1) AS email_domain,
   COUNT(*) AS customer_count
FROM customer
WHERE email IS NOT NULL
  AND INSTR(LOWER(TRIM(email)), '@') > 0
GROUP BY SUBSTRING_INDEX(LOWER(TRIM(email)), '@', -1)
ORDER BY customer_count DESC, email_domain
LIMIT 15;

Result: you get the customer distribution by email domain. This is useful for initial audience exploration, anomaly detection, and communication segmentation.

This example highlights an important point: string functions are most powerful in chains. First we clean the value, then validate its structure, then extract the domain, and only then aggregate.


Practical Recommendations

  • Normalize text before comparison and grouping.
  • If the same function appears many times in one query, consider a CTE or subquery for readability.
  • SUBSTRING_INDEX() is convenient in MySQL, but other DBMSs may require different syntax.
  • Do not try to solve all data-cleaning logic in one line; process text in stages.

Key takeaways from this lesson:

  • Practical SQL string processing is needed for cleaning, normalization, extraction, and validation of text data.
  • TRIM, LOWER, REPLACE, SUBSTRING_INDEX, LEFT, RIGHT, and CONCAT_WS are especially useful in daily work.
  • Before analysis, text should be normalized to a consistent format to avoid incorrect grouping and comparisons.
  • SQL can not only format strings but also quickly reveal data-quality issues.
  • The biggest value comes from combining functions in a clear, step-by-step workflow.

Frequently Asked Questions

Why normalize text if table values already look ready?

Because even visually clean data can contain extra spaces, inconsistent casing, or subtle format deviations. Normalization makes filtering, grouping, and comparisons more reliable.

Why is it useful to extract email domains for analysis?

Domains help segment users, identify corporate addresses, and detect data anomalies. It is a simple way to turn raw text into an analytical feature.

When is it better to build text fields in SQL rather than in the application?

When fields are needed for reports, admin interfaces, exports, or intermediate analytics. In these cases, SQL-side label building reduces post-processing and keeps logic close to the data.

Interview Questions

What kinds of tasks do you solve with string functions in analytical SQL?

Typical tasks include text cleaning, format normalization, feature extraction, and data validation. In practice, string functions are often used before grouping, segmentation, and report field generation.

Why is it useful to apply TRIM() and LOWER() before GROUP BY on text columns?

Without normalization, the same value may appear in multiple groups because of different casing or extra spaces. Pre-cleaning improves aggregation correctness and reduces false differences.

How would you explain the value of SUBSTRING_INDEX() with a practical example?

In MySQL, this function is convenient for quick extraction of a string part by delimiter. For example, you can extract an email domain and immediately use it for user segmentation or analytical reporting.

In the next lesson, we will move to SQL for data analysis and reporting and see how to turn prepared data into useful business insights.