IT · Deep dive

Data Transformation

The Basics

Data transformation projects draw on two groups of concepts: technical concepts about how data is stored and moved, and data quality concepts about what makes data good enough to use.

Technical concepts

ETL stands for extract, transform, load. It describes the three steps of moving data: pulling it from a source system, converting it into the right format, and loading it into a target system.
Data pipeline is an automated process that carries out ETL steps in sequence. Pipelines can run on a schedule or be triggered by an event, such as a new file arriving.
Data warehouse is a central storage system where cleaned and structured data from multiple sources is combined. It is designed for analysis rather than day-to-day operations.
Schema is the structure of a database: which tables exist, what columns they have, and how tables relate to each other.
Normalisation is the process of organising a database to reduce redundancy. It means splitting data into separate tables so that each piece of information is stored only once.
Star schema is a common structure for data warehouses. It has one central fact table, which holds the main measurements, surrounded by dimension tables, which hold descriptive information like dates, products, or locations.

Data quality concepts

Completeness measures whether all expected data is present. A dataset with many missing values is incomplete.
Consistency measures whether the same information is stored the same way across different systems. For example, a customer listed as "Netherlands" in one system and "NL" in another is inconsistent.
Accuracy measures whether the data correctly reflects reality. An address that no longer exists is inaccurate.
Validity measures whether data values fall within expected ranges or formats. A date of birth recorded as 32 January is invalid.

What You Will Do

Data profiling (assessment phase)

Before any transformation work begins, you need to understand what the data looks like. Data profiling means analysing source data to find problems. You will write queries to check how many rows are missing values, how many records appear more than once, and whether values fall within expected ranges. The output is usually a report that summarises the quality of each data source and flags the issues that need fixing before the transformation can proceed.

Designing transformation logic (design phase)

Once the problems are known, you help design the rules that will fix them. This means writing transformation logic: instructions that tell the pipeline how to clean, restructure, and map data from its source format to the target format. For example, a rule might say: if the country field is "NL", convert it to "Netherlands". You will document these rules clearly so that others can review and test them.

Building pipelines (build phase)

In the build phase, you turn the transformation logic into working code. This means writing SQL queries or Python scripts that extract data from source systems, apply the transformation rules, and load the result into the target system. You will also set up the pipeline to run automatically, handle errors, and log what it has done so that problems can be traced later.

Testing and validation (test phase)

Before the transformed data goes live, it needs to be checked carefully. You will write test cases that compare the output of the pipeline against expected results. This includes checking row counts, checking that key values have been mapped correctly, and running the data quality checks from the profiling phase again on the output. Any discrepancies need to be investigated and fixed before deployment.

Documentation and handover (deploy phase)

After deployment, the project needs to be handed over to the people who will maintain it. You will write documentation that explains how the pipeline works, what the transformation rules are, and what to do if something goes wrong. Clear documentation is often underdone on these projects but is important for keeping things running after the project team has moved on.

Methods and Models

SQL

SQL is the core tool for data transformation work. It is used for profiling, writing transformation logic, and validating outputs. A few patterns come up repeatedly.

Aggregations and group-by queries are used during profiling to summarise data and spot anomalies:

SELECT country, COUNT(*) as record_count
FROM customers
GROUP BY country
ORDER BY record_count DESC

Window functions are used to compare values across rows without collapsing the data. For example, to find duplicate records:

SELECT id, COUNT(*) OVER (PARTITION BY email) as duplicate_count
FROM customers

Joins are used to combine data from multiple tables. In a star schema, you will frequently join a fact table to its dimension tables to produce a flat dataset for analysis:

SELECT f.sales_amount, d.product_name, t.month
FROM fact_sales f
JOIN dim_product d ON f.product_id = d.product_id
JOIN dim_time t ON f.time_id = t.time_id

Python

Python is used alongside SQL, especially for tasks that are harder to do in pure SQL. The pandas library is the standard tool for transforming data in Python. A typical workflow involves loading data into a dataframe, applying transformation rules, and writing the result to a target system:

import pandas as pd

df = pd.read_csv("customers.csv")
df["country"] = df["country"].replace({"NL": "Netherlands", "DE": "Germany"})
df = df.dropna(subset=["email"])
df.to_csv("customers_clean.csv", index=False)

Python is also used to automate pipelines, handle errors, and connect to databases and APIs.

Data quality metrics

During profiling and validation, you will calculate simple metrics to measure data quality. A common one is the completeness rate for a column:

\text{Completeness} = \frac{\text{Number of non-null values}}{\text{Total number of rows}} \times 100\%

Similarly, a consistency check across two systems can be expressed as:

\text{Consistency} = \frac{\text{Number of matching records}}{\text{Total number of records}} \times 100\%

These metrics are used to set quality thresholds before a project goes live. For example, a team might decide that no column can have a completeness rate below 95% before the data is loaded into the warehouse.

Data modelling

Normalisation is a formal process with defined levels called normal forms. The first three are most relevant in practice:

First normal form (1NF): each column holds a single value and each row is unique.
Second normal form (2NF): all non-key columns depend on the full primary key, not just part of it.
Third normal form (3NF): no non-key column depends on another non-key column.

In practice, data warehouses often deliberately denormalise data into a star schema to make queries faster and simpler, even though this introduces some redundancy.

Good to Know

GDPR and data privacy

Data transformation projects almost always involve personal data, which means GDPR applies. The General Data Protection Regulation is a European law that sets rules for how personal data can be collected, stored, and used. On a transformation project, this has practical consequences. You need to know which fields contain personal data, make sure that data is not copied to systems where it should not be, and apply masking or anonymisation where needed. Data masking means replacing real values with fake but realistic ones, for example replacing a real name with a placeholder, so that the data can be used for testing without exposing real personal information. Understanding the basics of GDPR is expected on any project that handles personal data.

Data governance

Data governance refers to the rules and processes an organisation uses to manage its data. This includes defining who is responsible for which data, setting quality standards, and keeping track of where data comes from and how it has been changed. On a transformation project, you will often work within an existing governance framework, or help build one. A key concept is data lineage, which means documenting the path a piece of data has travelled from its source to its destination, including every transformation applied along the way. Good lineage documentation makes it much easier to trace errors and audit the data later.

Cloud platforms

Most modern data transformation projects run on cloud platforms rather than on-premise servers. Cloud platforms offer storage, computing power, and managed pipeline tools that can be scaled up or down as needed. In practice, this means the pipelines you build will often run on a cloud environment, and the data warehouse will be hosted there too. You do not need to be a cloud infrastructure specialist, but understanding the basics of how cloud storage and compute work will make you more effective on these projects.

Where this fits in a data career

Data transformation sits between data engineering and data analysis. Data engineers focus on building and maintaining the infrastructure that moves data around. Data analysts focus on using clean data to answer business questions. Transformation projects require both. This makes them a good entry point into a data career, because you get exposure to technical work like pipeline building and SQL, as well as analytical work like profiling and validation. Over time, people working on these projects tend to specialise in one direction or the other, depending on where their interests lie.