Academic field · Deep dive

Econometrics

The Basics

Econometrics is built on a set of core concepts that run through almost every method and application in the field.

Variables and parameters

Dependent variable is the outcome you are trying to explain or predict. It is sometimes called the response variable or the left-hand side variable.
Independent variable is a variable you use to explain the dependent variable. It is sometimes called an explanatory variable or a regressor.
Parameter is a number that describes a relationship in a model. In a regression model, the parameter on an independent variable tells you how much the dependent variable changes when that variable increases by one unit.
Estimation is the process of using data to calculate the values of unknown parameters. Because you never observe the true parameter, you estimate it from the data you have.

Inference and model fit

Hypothesis testing is a formal procedure for deciding whether a result is statistically significant, meaning it is unlikely to have occurred by chance. The most common test checks whether a parameter is significantly different from zero.
Standard error measures the uncertainty around an estimated parameter. A small standard error means the estimate is precise. A large one means there is a lot of uncertainty.
p-value is the probability of observing a result at least as extreme as the one you found, assuming the null hypothesis is true. A low p-value is taken as evidence against the null hypothesis.
R-squared measures how much of the variation in the dependent variable is explained by the model. It ranges from 0 to 1, where 1 means the model explains all the variation.

Causation and endogeneity

Correlation means two variables move together. It does not mean one causes the other.
Causation means a change in one variable directly produces a change in another. Establishing causation is harder than measuring correlation and requires careful research design.
Endogeneity is a problem that arises when an independent variable is correlated with the error term in a model. This usually happens because of omitted variables, reverse causality, or measurement error. Endogeneity makes standard estimates unreliable and is one of the central problems econometrics tries to solve.
Omitted variable bias occurs when a relevant variable is left out of the model and is correlated with both the dependent variable and an included independent variable. This causes the estimated parameters to be biased.

What You Will Do

Working with data

In practice, econometrics work starts with data, and the data is rarely clean or ready to use. You will spend time understanding where the data comes from, what each variable means, and what problems it has. This means checking for missing values, identifying outliers, and deciding how to handle them. You will also merge datasets from different sources, which requires careful attention to how observations are matched across datasets.

Making modelling choices

Once the data is ready, you need to decide how to model the question you are trying to answer. This involves choosing which variables to include, deciding on the functional form of the model, and thinking about what assumptions are needed for the estimates to be valid. These choices are not mechanical. They require understanding the economic context and thinking carefully about what could go wrong. For example, if you are estimating the effect of education on earnings, you need to think about whether education is endogenous, meaning people with higher ability might both get more education and earn more regardless of education.

Estimating and interpreting results

After fitting the model, you will interpret the estimated parameters and assess whether the results make sense. This means checking the sign and magnitude of the estimates, testing whether they are statistically significant, and thinking about whether the size of the effect is economically meaningful. Statistical significance and economic significance are not the same thing. A very large dataset can produce statistically significant results that are too small to matter in practice.

Communicating findings

A large part of applied econometrics work is explaining results to people who are not econometricians. You will write reports and presentations that summarise what you did, what you found, and what it means. This requires translating technical results into plain language without losing the important nuance. For example, explaining the difference between a correlation and a causal estimate is something you will need to do regularly, and doing it clearly is a skill in itself.

Methods and Models

Ordinary least squares regression

Ordinary least squares (OLS) is the foundation of econometrics. It estimates the parameters of a linear model by minimising the sum of squared differences between the observed values of the dependent variable and the values predicted by the model.

The basic linear model is:

y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_k x_{ik} + \varepsilon_i

where $y_i$ is the dependent variable for observation $i$ , $x_{i1}, \dots, x_{ik}$ are the independent variables, $\beta_0, \dots, \beta_k$ are the parameters to be estimated, and $\varepsilon_i$ is the error term, which captures everything that affects $y_i$ but is not included in the model.

OLS chooses the parameter estimates $\hat{\beta}$ that minimise:

\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \dots - \hat{\beta}_k x_{ik})^2

For OLS estimates to be unbiased and consistent, a set of assumptions must hold. The most important is that the error term is uncorrelated with the independent variables. When this assumption fails, you have endogeneity, and OLS estimates are no longer reliable.

Time series methods

Time series data is data collected at regular intervals over time, such as monthly inflation figures or daily stock returns. A key concept in time series analysis is stationarity. A time series is stationary if its mean and variance do not change over time. Most economic time series are not stationary, which causes problems for standard regression methods.

A common way to make a series stationary is to take first differences, meaning you model the change from one period to the next rather than the level:

\Delta y_t = y_t - y_{t-1}

The ARIMA model is a standard tool for modelling and forecasting stationary time series. ARIMA stands for autoregressive integrated moving average. The autoregressive part models the current value as a function of past values:

y_t = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + \varepsilon_t

The moving average part models the current value as a function of past error terms:

y_t = \varepsilon_t + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q}

In practice, ARIMA models are used for short-term forecasting in areas like macroeconomics, finance, and demand planning.

Causal inference

When you cannot run a randomised experiment, you need other methods to estimate causal effects. Two widely used approaches are difference-in-differences and instrumental variables.

Difference-in-differences (DiD) compares the change in outcomes over time for a group that was affected by a policy or event against the change for a group that was not. The idea is that both groups would have followed the same trend in the absence of the intervention. The DiD estimator is:

\hat{\delta} = (\bar{y}_{treated, after} - \bar{y}_{treated, before}) - (\bar{y}_{control, after} - \bar{y}_{control, before})

This removes the effect of any time trends that affect both groups equally, leaving an estimate of the causal effect of the intervention.

Instrumental variables (IV) is used when an independent variable is endogenous. The idea is to find an instrument, a variable that affects the endogenous variable but has no direct effect on the dependent variable other than through the endogenous variable. The two-stage least squares (2SLS) estimator first regresses the endogenous variable on the instrument to get a predicted value, then uses that predicted value in the main regression. This removes the part of the endogenous variable that is correlated with the error term.

Machine learning applications

Machine learning methods are increasingly used alongside traditional econometrics. Methods like LASSO regression are useful for variable selection when there are many potential regressors. LASSO adds a penalty to the sum of squared residuals that shrinks small coefficients toward zero, effectively removing irrelevant variables from the model:

\min_{\beta} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{k} |\beta_j|

where $\lambda$ controls the strength of the penalty. A higher $\lambda$ means more coefficients are shrunk to zero.

More recently, methods like causal forests have been developed to estimate heterogeneous treatment effects, meaning how the effect of an intervention varies across different groups. These methods combine the predictive power of machine learning with the causal reasoning of econometrics.