Data Analytics Foundations

Data analytics foundations begin with a clear understanding of the core terminology that underpins every analysis, model, and decision. This glossary‑style explanation is organized by thematic groups, each presenting a term, a concise defin…

Data Analytics Foundations

Data analytics foundations begin with a clear understanding of the core terminology that underpins every analysis, model, and decision. This glossary‑style explanation is organized by thematic groups, each presenting a term, a concise definition, an illustrative example, a practical application, and common challenges that learners may encounter. The aim is to create a reference that can be consulted while studying the Professional Certificate in Data Analytics for Performance Evaluation, and to support the development of a robust analytical vocabulary.

Data and Data Types A data point is a single piece of factual information that can be recorded, stored, and processed. When many data points are collected together they form a dataset, which can be thought of as a table where rows represent individual observations and columns represent variables. An observation might be a single customer transaction, while a variable could be the transaction amount, the date, or the customer’s age.

Variables are classified into two broad categories. A categorical variable (also called a qualitative variable) takes on values that represent groups or categories, such as “region” (North, South, East, West) or “product type” (A, B, C). A numerical variable (quantitative variable) holds numbers that can be measured or counted, such as “sales revenue” or “units sold”. Numerical variables are further divided into discrete (countable values like number of visits) and continuous (measurable values like time spent on a website).

Population, Sample, and Sampling Methods The population is the entire set of items or individuals that an analyst wishes to study. In practice, accessing the whole population is often impossible, so a sample is drawn. A sample should be representative so that conclusions can be generalized. Common sampling techniques include simple random sampling, where each member has an equal chance of selection; stratified sampling, which divides the population into homogeneous sub‑groups (strata) and samples each stratum proportionally; and cluster sampling, where clusters of units (such as stores or schools) are randomly selected and all members within chosen clusters are surveyed.

Descriptive Statistics – Summarising Data Descriptive statistics provide a snapshot of the data’s central tendency, spread, and shape. The mean (average) is calculated by summing all values of a variable and dividing by the number of observations. The median is the middle value when observations are ordered, and the mode is the most frequently occurring value.

The standard deviation measures the average distance of observations from the mean, indicating variability. Its square, the variance, is useful in statistical modelling. A low standard deviation suggests that data points are clustered close to the mean, whereas a high standard deviation indicates wide dispersion.

The shape of a distribution is described by skewness (asymmetry) and kurtosis (tailedness). A distribution with a long right tail is said to be positively skewed. The classic normal distribution (bell curve) is symmetric, with skewness of zero and kurtosis of three.

Outliers are observations that lie far from the rest of the data. They may signal data entry errors, measurement anomalies, or genuine extreme cases. Detecting outliers typically involves visual tools such as boxplots or statistical rules (e.G., Values beyond 1.5 × IQR).

Relationships Between Variables The strength and direction of a linear relationship between two numerical variables are quantified by the correlation coefficient (Pearson’s r). A value close to +1 indicates a strong positive relationship; close to –1 indicates a strong negative relationship; and near 0 suggests little or no linear association.

Correlation does not imply causation. Causal inference requires additional evidence, such as experimental design, temporal precedence, or domain knowledge. For example, a rise in ice‑cream sales may be correlated with higher rates of sunburn, but buying ice‑cream does not cause sunburn.

Regression and Predictive Modelling Regression analysis estimates the functional relationship between a dependent variable (target) and one or more independent variables (predictors). In simple linear regression, a single predictor is used, and the model takes the form y = β₀ + β₁x + ε. The coefficients β₀ (intercept) and β₁ (slope) are estimated from data, and ε represents random error.

When the target variable is categorical (e.G., Churn yes/no), logistic regression is employed. It models the log‑odds of the probability of the event occurring and outputs values between 0 and 1, which can be thresholded to produce class predictions.

Regression is a foundation for many advanced techniques, including multiple linear regression, where several predictors are included, and regularisation methods that penalise excessive complexity (e.G., Ridge regression, lasso).

Supervised vs Unsupervised Learning In supervised learning, the algorithm is trained on a labeled dataset, meaning each observation includes both features and a known target. Classification (predicting categories) and regression (predicting continuous values) are typical supervised tasks.

In contrast, unsupervised learning works with unlabeled data, aiming to discover hidden structure. Common unsupervised methods include clustering (e.G., K‑means, hierarchical clustering) which groups similar observations, and dimensionality reduction (e.G., Principal Component Analysis) which compresses the feature space while retaining most variance.

Feature Engineering and Data Preparation A feature (or variable) is an input used by a model to make predictions. Feature engineering involves transforming raw data into informative attributes. Examples include extracting the day of week from a timestamp, creating a “total spend” variable by summing several monetary columns, or encoding categorical variables using one‑hot encoding.

Effective feature engineering often improves model performance more than algorithm choice. However, it requires domain insight and careful handling to avoid leakage, where information from the test set inadvertently influences the training process.

Data is seldom clean when first acquired. Data cleaning (or data cleansing) addresses missing values, duplicate records, inconsistent formats, and erroneous entries. Techniques include imputation (replacing missing values with mean, median, or model‑based estimates), deletion of rows with excessive missingness, and standardising units (e.G., Converting all lengths to metres).

Data Wrangling, ETL, and Pipelines The process of moving data from source to analysis is often called ETL – Extract, Transform, Load. Extraction pulls data from operational systems (databases, APIs, flat files). Transformation applies cleaning, enrichment, and aggregation. Loading inserts the transformed data into a destination, such as a data warehouse or data lake, ready for analysis.

A data pipeline automates ETL steps, typically using scheduling tools (e.G., Airflow) or cloud services (e.G., AWS Glue). Pipelines ensure repeatability, version control, and error handling, which are crucial for production‑grade analytics.

Data Storage Concepts – Databases, Warehouses, and Lakes Relational databases store data in tables with defined schemas, using Structured Query Language (SQL) for manipulation. Primary keys uniquely identify rows, while foreign keys establish relationships between tables, enabling joins (inner, left, right, full) to combine data across tables.

A data warehouse is a specialised relational system optimised for analytical queries, often employing star or snowflake schemas that denormalise data for faster reporting. In contrast, a data lake stores raw, unstructured, or semi‑structured data (e.G., Log files, sensor streams) in its native format, typically on distributed file systems like HDFS or cloud object storage.

NoSQL databases (e.G., MongoDB, Cassandra) provide flexible schemas for document‑oriented or key‑value storage, supporting high‑velocity data ingestion and horizontal scaling. Understanding the trade‑offs between consistency, availability, and partition tolerance (the CAP theorem) guides technology selection.

Big Data Technologies When data volume, velocity, or variety exceed the capacity of traditional tools, big data platforms are employed. Hadoop implements the MapReduce paradigm, splitting tasks into map (parallel processing) and reduce (aggregation) phases across a cluster.

Spark extends MapReduce with in‑memory computation, offering faster iterative algorithms for machine learning and graph processing. Spark’s DataFrame API provides a familiar, SQL‑like interface for large‑scale data manipulation.

Cloud Computing Layers – IaaS, PaaS, SaaS Infrastructure as a Service (IaaS) supplies virtualised compute, storage, and networking resources (e.G., AWS EC2). Platform as a Service (PaaS) adds managed runtime environments, databases, and development tools (e.G., Azure Synapse). Software as a Service (SaaS) delivers complete applications over the web (e.G., Tableau for visual analytics).

Choosing the appropriate layer depends on control requirements, scalability, and organisational expertise.

Data Visualization and Reporting Effective communication of analytical insights relies on visual representation. A dashboard aggregates key metrics, often using gauges, bar charts, line graphs, and heatmaps. Selecting appropriate chart types enhances comprehension; for instance, a line chart is ideal for illustrating trends over time, while a bar chart compares categorical performance.

Key Performance Indicators (KPI) are quantifiable measures aligned with strategic objectives (e.G., Conversion rate, average order value). Defining clear KPIs ensures that visualisations remain focused on business impact.

Time Series Analysis and Forecasting When data is collected sequentially over time, it is termed a time series. Time‑series components include trend (long‑term direction), seasonality (regular, periodic fluctuations), and irregular noise.

Forecasting models such as ARIMA (AutoRegressive Integrated Moving Average) capture autocorrelation and differencing to achieve stationarity. Exponential smoothing methods (e.G., Holt‑Winters) weight recent observations more heavily, useful for rapidly changing demand.

Practical application: A retailer may forecast weekly sales to optimise inventory levels, reducing stock‑outs and excess holding costs. Challenges include handling missing timestamps, detecting structural breaks, and selecting appropriate lag orders.

Hypothesis Testing and Inferential Statistics Statistical inference allows analysts to draw conclusions about a population based on sample data. A null hypothesis (H₀) typically states that there is no effect or difference, while the alternative hypothesis (H₁) proposes the opposite.

The p‑value quantifies the probability of observing data as extreme as the sample, assuming H₀ is true. If the p‑value falls below a pre‑specified significance level (commonly 0.05), H₀ is rejected, indicating statistical significance.

Confidence intervals provide a range of plausible values for a population parameter, expressed with a confidence level (e.G., 95 %).

Common tests include the t‑test for comparing means, the chi‑square test for independence of categorical variables, and ANOVA for comparing means across multiple groups.

Errors and Power A type I error occurs when H₀ is incorrectly rejected (false positive). A type II error occurs when H₀ is not rejected despite being false (false negative). The test’s power is the probability of correctly rejecting a false H₀, which increases with larger sample sizes, higher effect sizes, and higher significance thresholds.

Designing experiments with adequate power requires careful calculation of required sample size prior to data collection.

Data Ethics, Privacy, and Governance Data analytics must respect legal and ethical standards. In the United Kingdom, the General Data Protection Regulation (GDPR) governs personal data processing, mandating lawful basis, purpose limitation, data minimisation, and individuals’ rights (e.G., Access, erasure).

Anonymisation techniques (e.G., Pseudonymisation, aggregation) reduce re‑identification risk, but must be applied carefully to preserve analytical utility.

Data governance frameworks define roles, responsibilities, policies, and procedures for data management. Core pillars include data quality, security, metadata management, and lifecycle stewardship.

Data Quality Dimensions High‑quality data exhibits attributes such as accuracy (correctness of values), completeness (absence of missing data), consistency (uniform representation across sources), timeliness (availability when needed), and relevance (alignment with analytical purpose).

Assessing data quality often involves profiling tools that generate summary statistics, frequency counts, and anomaly detection reports.

Metadata and Data Lineage Metadata describes data about data, such as column definitions, data types, creation timestamps, and source system identifiers. Maintaining comprehensive metadata supports discoverability, impact analysis, and regulatory compliance.

Data lineage tracks the flow of data from origin through transformations to final consumption, enabling traceability and debugging of analytical pipelines.

Data Modeling – Schemas and Normalisation An entity‑relationship diagram (ERD) visualises the logical structure of a relational database, illustrating entities (tables), attributes (columns), and relationships (cardinality).

Normalization decomposes tables to eliminate redundancy and update anomalies, typically up to the third normal form (3NF). Conversely, denormalisation intentionally introduces redundancy to improve query performance in analytical workloads.

OLAP vs OLTP Online Transaction Processing (OLTP) systems support high‑volume, short‑duration transactions (e.G., Order entry). Online Analytical Processing (OLAP) systems enable complex queries and aggregations for reporting and decision support, often using multidimensional cubes.

Data Mining Techniques Association rule mining discovers relationships between items in transactional data. The classic example is market‑basket analysis, where the Apriori algorithm identifies frequent itemsets and generates rules such as “customers who buy bread also buy butter”.

Evaluation metrics for association rules include support (frequency of the rule), confidence (conditional probability), and lift (strength relative to independence).

Dimensionality Reduction and Feature Selection High‑dimensional data can hinder model performance and interpretability. Principal Component Analysis (PCA) transforms correlated variables into orthogonal components that capture maximal variance.

t‑SNE (t‑Distributed Stochastic Neighbor Embedding) is a non‑linear technique for visualising high‑dimensional data in two or three dimensions, preserving local structure.

Feature selection methods (filter, wrapper, embedded) aim to retain only the most predictive variables, reducing overfitting and computational cost.

Regularisation and Penalisation Regularisation adds a penalty term to the loss function to discourage overly complex models. L1 regularisation (lasso) promotes sparsity by forcing some coefficients to zero, effectively performing variable selection. L2 regularisation (ridge) shrinks coefficients towards zero but retains all predictors, improving stability.

Ensemble Learning – Bagging and Boosting Ensemble methods combine multiple base learners to improve predictive accuracy. Bagging (Bootstrap Aggregating) builds independent models on bootstrapped samples and averages their predictions; Random Forests are a popular bagging technique for decision trees.

Boosting sequentially trains models, each focusing on errors of its predecessor. Gradient Boosting, XGBoost, and LightGBM are state‑of‑the‑art boosting algorithms that often win data‑science competitions.

Neural Networks and Deep Learning A neural network consists of layers of interconnected nodes (neurons) that transform inputs through weighted sums and non‑linear activation functions (e.G., ReLU, sigmoid).

Training involves backpropagation, where gradients of the loss function with respect to weights are computed and used to update parameters via optimisation algorithms such as gradient descent.

Key hyper‑parameters include learning rate (step size), number of hidden layers, number of neurons per layer, and regularisation techniques (dropout, weight decay).

Deep learning excels in image, text, and speech tasks, but requires large labelled datasets and substantial computational resources (GPUs or TPUs).

Model Evaluation – Metrics and Validation Assessing model performance depends on the problem type. For classification, common metrics are accuracy (overall correct predictions), precision (positive predictive value), recall (sensitivity), and the F1‑score (harmonic mean of precision and recall).

The confusion matrix summarises true positives, false positives, true negatives, and false negatives, enabling calculation of the above metrics.

The Receiver Operating Characteristic (ROC) curve plots true‑positive rate against false‑positive rate across thresholds; the area under the curve (AUC) quantifies discriminative ability.

For regression, evaluation metrics include Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R‑squared (proportion of variance explained).

Cross‑validation techniques (k‑fold, stratified, time‑series split) provide robust estimates of model generalisation by rotating training and validation sets.

Overfitting, Underfitting, and the Bias‑Variance Trade‑off < B>Overfitting occurs when a model captures noise in the training data, leading to poor performance on unseen data. Indicators include a large gap between training and validation accuracy.

< B>Underfitting arises when a model is too simplistic to capture underlying patterns, reflected in low training accuracy.

The bias‑variance trade‑off describes the balance between error due to erroneous assumptions (bias) and error due to sensitivity to fluctuations in training data (variance). Regularisation, proper model complexity, and ample data help navigate this trade‑off.

Hyperparameter Tuning and Optimisation Hyperparameters are configuration settings external to model training (e.G., Number of trees, depth, learning rate). Grid search exhaustively evaluates combinations across predefined ranges, while random search samples randomly, often finding good solutions more efficiently.

Advanced optimisation techniques such as Bayesian optimisation model the performance surface and propose promising hyperparameter sets iteratively.

Model Deployment and Monitoring After validation, a model is deployed to production, where it must be integrated with existing systems (e.G., Via REST APIs). Continuous monitoring tracks performance drift, data drift, and resource utilisation.

< I>Model retraining schedules (e.G., Weekly, monthly) can be triggered when drift exceeds thresholds, ensuring the model remains aligned with evolving business conditions.

Challenges in Data Analytics Foundations

Data Silos and Integration Organizations frequently store data in disparate systems, creating silos that hinder comprehensive analysis. Integrating data requires mapping disparate schemas, reconciling differing data types, and handling inconsistent identifiers. Data integration tools and master data management (MDM) strategies mitigate these challenges.

Data Quality Issues Incomplete, inaccurate, or inconsistent data can bias analyses and lead to erroneous decisions. Systematic data profiling, automated validation rules, and feedback loops with data owners are essential for maintaining high data quality.

Scalability and Performance Large datasets strain traditional analytical tools, resulting in long query times and resource bottlenecks. Leveraging distributed processing frameworks, indexing strategies, and materialised views can improve performance, but require careful planning and cost‑benefit analysis.

Interpretability vs Accuracy Complex models (e.G., Deep neural networks) often achieve higher predictive accuracy but are less interpretable. In performance evaluation contexts, stakeholders may demand transparent explanations for decisions. Techniques such as SHAP values, LIME, or rule‑based surrogate models provide local interpretability while preserving overall performance.

Ethical Considerations and Bias Analytical models can unintentionally perpetuate or amplify biases present in historical data (e.G., Gender bias in hiring predictions). Conducting bias audits, using fairness‑aware algorithms, and involving diverse stakeholders in model development help address ethical concerns.

Regulatory Compliance Compliance with GDPR and sector‑specific regulations (e.G., Financial reporting standards) imposes constraints on data collection, storage, and processing. Data protection impact assessments (DPIAs) and privacy‑by‑design principles should be embedded early in the analytics lifecycle.

Change Management and Stakeholder Engagement Introducing data‑driven performance evaluation often requires cultural shifts. Effective communication of benefits, training programs, and involving end‑users in requirement gathering increase adoption and reduce resistance.

Summary of Core Vocabulary

The following list consolidates the essential terms introduced, serving as a quick‑reference cheat sheet.

- Data, Dataset, Observation, Variable - Population, Sample, Simple random sampling, Stratified sampling, Cluster sampling - Mean, Median, Mode, Standard deviation, Variance - Skewness, Kurtosis, Normal distribution, Outlier - Correlation coefficient, Causation - Linear regression, Logistic regression, Multiple regression, Regularisation - Supervised learning, Unsupervised learning, Classification, Clustering - Feature, Feature engineering, Label, Target variable - Training set, Test set, Validation set, Cross‑validation - Overfitting, Underfitting, Bias‑variance trade‑off - Data cleaning, Data wrangling, ETL, Data pipeline - Relational database, SQL, Primary key, Foreign key, Join - Data warehouse, Data lake, NoSQL - Hadoop, Spark, MapReduce - IaaS, PaaS, SaaS - Dashboard, KPI, Metric - Time series, Trend, Seasonality, ARIMA - Null hypothesis, Alternative hypothesis, p‑value, Confidence interval - Type I error, Type II error, Power, Sample size - GDPR, Anonymisation, Data governance, Data quality - Metadata, Data lineage - ERD, Normalization, Denormalisation - OLTP, OLAP - Association rule mining, Apriori algorithm - PCA, t‑SNE, Feature selection - L1 regularisation, L2 regularisation - Bagging, Random Forest, Boosting, Gradient Boosting, XGBoost - Neural network, Backpropagation, Activation function - Accuracy, Precision, Recall, F1‑score, Confusion matrix, ROC curve, AUC - MAE, RMSE, R‑squared - Grid search, Random search, Bayesian optimisation - Model deployment, Model monitoring, Model retraining

By mastering these terms, learners will be equipped to navigate the analytical workflow from data acquisition through model deployment, and to communicate findings effectively to stakeholders involved in performance evaluation. The depth of understanding required for the Professional Certificate in Data Analytics for Performance Evaluation rests on this shared vocabulary, which enables precise discussion of methods, challenges, and solutions across diverse business contexts.

Key takeaways

  • This glossary‑style explanation is organized by thematic groups, each presenting a term, a concise definition, an illustrative example, a practical application, and common challenges that learners may encounter.
  • When many data points are collected together they form a dataset, which can be thought of as a table where rows represent individual observations and columns represent variables.
  • A categorical variable (also called a qualitative variable) takes on values that represent groups or categories, such as “region” (North, South, East, West) or “product type” (A, B, C).
  • Population, Sample, and Sampling Methods The population is the entire set of items or individuals that an analyst wishes to study.
  • Descriptive Statistics – Summarising Data Descriptive statistics provide a snapshot of the data’s central tendency, spread, and shape.
  • A low standard deviation suggests that data points are clustered close to the mean, whereas a high standard deviation indicates wide dispersion.
  • The classic normal distribution (bell curve) is symmetric, with skewness of zero and kurtosis of three.
June 2026 intake · open enrolment
from £90 GBP
Enrol