Background to this EPC model and how to use it

EPCs and the home in Scotland

Energy performance certificates (EPCs) are often required when a property is to be sold. They provide information on the energy efficiency rating (EER) and banding (A-G). The EPCs are determined by specialist surveyors using a Standard Assessment Procedure (SAP) or Reduced (RdSAP) procedure, working with approved software that encapsulates algorithms. These describe the physical characteristics of a home, but also include other aspects such as type of fuel use, energy pricing and any other features such as renewable energy generation in the property, solar or wind for example. The SAP is a complex methodology, based on empirical data and approximations to derive ratings that give some notional value to the energy and environmental value to the home buyer/renter.

The Scottish Government (and the rest of the UK) are in the process of developing a more advanced home energy model tool to address some of the issues of concern of the SAP methodology.

Adopting an AI/Machine learning approach

Over 3.5 million EPCs have been produced since 2008 in Scotland, each containing a significant amount of information each surveyor has collected for an individual property. These EPCs are made available by the Scottish Government.

Instead of reproducing the engineering algorithms of SAP, the model presented here takes a simplified approach by training an AI model on the wide range of input descriptors (categories such as roof type) and reported numeric values (such as loft insulation thickness) on over 2.5 million EPC datasets and then validate against other independent set. In effect imagine taking an expert who has studied and memorised all of Scotland's home EPCs and on this basis predict what the EER and banding might be.

The technical stuff

CatBoost, a gradient-boosting algorithm developed by Yandex, is particularly well-suited to the AI task because EPC datasets contain a rich mixture of categorical descriptors (e.g., dwelling type, wall construction, heating system) and numerical variables (e.g., floor area, heat-loss values, glazing efficiency). CatBoost's native ability to handle categorical features without manual encoding makes it an ideal choice for modelling EPC data efficiently and accurately.

EPC datasets typically include dozens of input variables describing the physical characteristics of a property, its heating systems, insulation levels, and environmental performance. Many of these fields are categorical with high cardinality — for example, "main heating fuel," "property type," or "construction age band." Traditional machine-learning algorithms often require one-hot encoding or complex preprocessing to convert these categories into numerical form, which becomes computationally expensive at a dataset size of 3.5 million records. CatBoost avoids this overhead by applying ordered target encoding internally, reducing the risk of target leakage while preserving the natural structure of the data. When training on millions of EPCs, CatBoost's ordered boosting framework also helps minimise overfitting. EPC ratings (A–G or numerical SAP scores) are influenced by subtle interactions between features — such as how insulation interacts with heating system efficiency or how property age correlates with wall type. CatBoost's symmetric tree structure and gradient-boosting approach allow it to capture these non-linear relationships effectively. The algorithm builds an ensemble of decision trees, each correcting the errors of the previous one, gradually improving predictive accuracy across the full dataset.

Another advantage is CatBoost's ability to handle missing values, which are common in EPC datasets due to inconsistent assessor inputs or legacy records. Instead of requiring imputation, CatBoost learns optimal split directions for missing values during training, preserving data integrity and reducing preprocessing time.

How accurate is it?

The model has identified the extent to which different factors affect EPC scores. The key ones in terms of percentage contribution are:

  • Roof Description (10.8%)
  • Construction Age Band (10.3%)
  • Second Heat System (9.6%)
  • Roof Insulation Thickness (8.1%)
  • Main Fuel (5.7%)

In simple terms the energy model is fairly accurate, but not perfect. On the validation set used after training, it is off by about 4.1 SAP points on average. About 74% of predictions are within ±5 points of the score reported by SAP/RdSAP and 90% are within ±10 points.

An indication of accuracy is given by the figure below highlighting how close the model is to the actual score for 100 randomly chosen homes.

Prediction accuracy for 100 randomly chosen homes

Why use this tool?

The tool allows any user with some familiarity of their home's properties to make an estimate of their home EER and banding. Each EPC certificate normally comes with selected recommendations for home enhancements (such as increase loft insulation, install gas boiler).

The model is trained on most common properties. Unusual homes are less likely to fall within the training regime and therefore likely to produce errors.

Now some caution on the selection of enhancements. Choose what you consider to be practical. For example, if you have a solid wall property, then adding cavity wall insulation is not going to be practical. If you have a flat roof, then adding pitched roof insulation is not going to be practical. The model will still produce a score, but it is likely to be inaccurate because the model has not been trained on such combinations of characteristics and enhancements.