Data · Deep dive

Data Scientist

The Basics

Data science draws on two groups of concepts: statistical concepts about how models learn and generalise, and data concepts about the quality and structure of the input.

Statistical concepts

Supervised learning is a type of modelling where the model learns from labelled data, meaning data where the correct output is already known. The goal is to learn a mapping from inputs to outputs that generalises to new data.
Unsupervised learning is a type of modelling where there are no labels. The goal is to find structure in the data, such as groups of similar observations or underlying patterns.
Bias refers to the error introduced by a model that is too simple to capture the true relationship in the data. A high-bias model consistently misses the pattern.
Variance refers to the sensitivity of a model to small changes in the training data. A high-variance model fits the training data very closely but performs poorly on new data.
Overfitting occurs when a model learns the training data too well, including its noise, and fails to generalise. It is the result of high variance.
Underfitting occurs when a model is too simple to capture the pattern in the data. It is the result of high bias.
Train/test split is the practice of dividing data into a training set, used to fit the model, and a test set, used to evaluate how well it generalises to new data.
Cross-validation is a more robust version of the train/test split. The data is divided into several folds, and the model is trained and evaluated multiple times, each time using a different fold as the test set. The results are averaged to get a more reliable estimate of performance.

Data concepts

Feature is an input variable used by the model. Features are the columns in your dataset that you use to make predictions.
Feature engineering is the process of creating new features from existing data to improve model performance. For example, extracting the day of the week from a date variable, or combining two variables into a ratio.
Data leakage occurs when information from outside the training data is accidentally included in the model, making it appear to perform better than it actually does. A common example is including a variable that is only available after the outcome is known.
Missing data refers to observations where one or more values are absent. How you handle missing data, whether by removing rows, filling in values, or modelling the missingness, can affect the results significantly.
Class imbalance is a problem in classification tasks where one outcome is much more common than the other. For example, fraud detection datasets often have far more non-fraud cases than fraud cases, which can cause a model to perform poorly on the minority class.

What You Will Do

Data preparation

Before building any model, you need to get the data into a usable state. This involves loading data from one or more sources, checking for missing values and outliers, and deciding how to handle them. You will also engineer features, transforming raw variables into forms that are more useful for the model. This phase takes up more time than most people expect, and the quality of the data preparation has a large effect on the quality of the final model.

Exploratory analysis

Before modelling, you will spend time understanding the data. This means calculating summary statistics, plotting distributions, and looking for relationships between variables. The goal is to understand what the data contains, spot any remaining quality issues, and form hypotheses about which variables are likely to be useful. Exploratory analysis often leads back to data preparation, as you discover problems that need to be fixed.

Model building

Once the data is ready, you will select a modelling approach and fit it to the training data. This involves choosing the type of model, setting its parameters, and deciding which features to include. In practice, you will build and compare several models rather than committing to one immediately. Each modelling choice involves a trade-off, and understanding why one model performs better than another is part of the job.

Model evaluation

After fitting a model, you need to assess how well it performs. This means evaluating it on the test set and checking whether the performance metrics are good enough for the intended use. You will also check for signs of overfitting, data leakage, and poor performance on specific subgroups. A model that performs well on average but badly on an important subgroup may not be fit for purpose.

Communicating results

The final task is presenting your findings. This means summarising what you did, what the model does, how well it performs, and what its limitations are. You will often need to explain this to people with no technical background, which requires translating numbers and model outputs into clear, practical language. Visualisations are an important part of this, as a well-designed chart often communicates a result more clearly than a table of numbers.

Methods and Models

Supervised learning: regression and classification

Most applied data science work falls into one of two categories. Regression problems involve predicting a continuous outcome, such as a price or a quantity. Classification problems involve predicting a discrete outcome, such as whether a customer will leave or whether a transaction is fraudulent.

In both cases, the model learns by minimising a loss function, which measures the difference between the model's predictions and the actual outcomes. For regression, a common loss function is the mean squared error:

MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value. The model adjusts its parameters to make this as small as possible on the training data.

For classification, a common loss function is cross-entropy loss, which measures how well the model's predicted probabilities match the actual class labels:

L = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]

where $\hat{p}_i$ is the predicted probability that observation $i$ belongs to the positive class.

Regularisation

Regularisation is a technique for reducing overfitting by adding a penalty to the loss function that discourages the model from fitting the training data too closely. Two common forms are L1 regularisation, which adds a penalty proportional to the absolute value of the parameters, and L2 regularisation, which adds a penalty proportional to the square of the parameters:

L_{L2} = MSE + \lambda \sum_{j=1}^{k} \beta_j^2

where $\lambda$ controls the strength of the penalty. A higher $\lambda$ means more regularisation and a simpler model. Choosing the right value of $\lambda$ is done through cross-validation.

Model evaluation metrics

For regression problems, common evaluation metrics include mean squared error and mean absolute error:

MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

MAE is easier to interpret than MSE because it is in the same units as the outcome variable.

For classification problems, accuracy alone is often not enough, especially when classes are imbalanced. Two more informative metrics are precision and recall:

\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}

where $TP$ is the number of true positives, $FP$ is the number of false positives, and $FN$ is the number of false negatives. Precision measures how many of the predicted positives are correct. Recall measures how many of the actual positives the model correctly identified. There is usually a trade-off between the two, and which one matters more depends on the application.

The bias-variance trade-off

The bias-variance trade-off is a central concept in model selection. Total prediction error can be decomposed into three components:

\text{Total error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible error}

Simpler models tend to have high bias and low variance. Complex models tend to have low bias and high variance. The goal is to find a model that balances the two, minimising total error on new data. Regularisation, cross-validation, and careful feature selection are the main tools for managing this trade-off in practice.

Good to Know

Data scientist, ML engineer, and data engineer

The data science field has several related roles that are easy to confuse. A data scientist focuses on building models and generating insights from data. A machine learning engineer focuses on taking models built by data scientists and deploying them into production systems, making sure they run reliably at scale. A data engineer focuses on building and maintaining the pipelines and infrastructure that move and store data, making it available for analysis and modelling.

In practice, the boundaries between these roles are not always clear. At smaller organisations, one person may do all three. At larger organisations, the roles are more distinct and each requires a different set of skills. As an econometrics graduate, your training aligns most closely with the data scientist role, but understanding what the other roles involve will help you work effectively with the people in them.

Tooling and programming languages

Python is the dominant programming language in data science. The most important libraries are pandas for data manipulation, NumPy for numerical computing, scikit-learn for machine learning, and matplotlib or seaborn for visualisation. SQL is also essential, as most data you will work with lives in relational databases and needs to be extracted and prepared before it can be used in Python.

Version control using Git is expected in most data science roles. It allows you to track changes to your code, collaborate with others, and roll back to earlier versions if something breaks. Jupyter notebooks are widely used for exploratory analysis and model development, while more structured Python scripts or pipelines are used for production work. Cloud platforms are increasingly part of the toolkit as well, as data and models are often stored and run in cloud environments rather than on local machines.