Bruno Petrungaro, Senior Econometrician, lead for machine learning training via the AnalystX platform on the FutureNHS Collaboration site, shares knowledge and best practice on building a predictive model.
There tends to be a common misconception that to do machine learning, machine learning engineers need to know how to build a variety of different models and that this is their only requirement. While this knowledge is important, when building a predictive model there are a number of additional factors that need to be considered.
These include understanding the business question, which is crucial to developing a good solution; if not we might build something technically perfect that does not answer the question asked by the end user. Expertise on the domain you are working on is also extremely important; health is a particularly complex field where domain knowledge is crucial to understanding the data collected that will be used to build a predictive model.
Although the process of building a predictive model is very much iterative, we can attempt to establish a series of steps that need to be followed. This does not mean you will not be going back and forward through these steps, you definitely will, but that is how we get things right.
There are many ways of splitting the steps required for building a predictive model, and I am sure you will find many different versions from other sources and most, if not all, will be correct.
What follows below, however, is a series of steps presented in the shortest way possible to give an overview of what it is like to build a predictive model. The steps are:
- Problem formulation
- Data collection, processing and exploratory data analysis
- Model building
- Deployment
Problem formulation
The objective of this step may seem simple – understanding the problem. In reality it is not so simple. This is actually the most crucial step in building a predictive model.
Understanding what is needed to build a good solution means not only understanding the problem but how the potential solution should look. This includes understanding what data is needed to build the solution.
Data collection, processing and exploratory data analysis
Data collection implies gathering the data you are going to use to build a predictive model. This generally involves using a database, but sometimes you will need to collect the data yourself. Even while you are doing this, you may already start noticing issues with the data, such as missing values or values that have not been properly inputted. Identifying these types of issues is the purpose of exploratory data analysis (EDA).
Understanding your data is the key to exploratory data analysis. EDA is a series of numerical and visualisation techniques that allow us to understand our data, the variables in our data, and the relationship between these variables. It is useful to look at EDA using a set of questions to truly understand our data:
- Which types of variables are there in the dataset?
- Are there missing values?
- Are the variables related to each other?
- Are there outliers?
- Are the variables related to the variable you are trying to predict?
In the data processing stage, you will have to find a way to deal with missing values and outliers. You will most likely have to transform your variables so they are in the format accepted by the Python or R libraries you will potentially use. Also, there will probably be a need to transform your variables so they are presented in the best way for the algorithm you will use.
Model building
In this step we need to build a number of candidate predictive models. We are interested in selecting the best model and to define which one is best; a metric or a set of metrics will need to be defined to assess model quality. The selection of these metrics depends on each project.
The objective is to build models that work well with new data, which means models that work well with data not used in training the model. Models need to be able to generalise. The term for a model that performs well for training data, but poorly for test data, is ‘overfitting’.
However, we do not choose the best model by looking at the test data performance. We only used test data to assess the quality of a model after training and selecting. Model validation provides different techniques to select our final model.
Deployment
As data scientists, it is important to remember that our models will be used by other members of staff in our organisations. They might be clinicians, administrative staff or other staff. This last step of the process requires us to think of the best way of getting the model or its conclusions to them. Sometimes a written report with conclusions is required, but mostly the model will need to be incorporated as part of a web or desktop-based software application, or the model itself may become an application.
Conclusion
Building predictive models is a unique skill but using a series of well-thought-out steps to help guide the process, can help the build run more smoothly and ensure every eventuality is considered.
At the Health Economics Unit, we have years of experience in developing these models to help implement the best patient services across the UK. To find out more about how we can help your organisation, please contact bruno.petrungaro@nhs.net or see our website: www.healtheconomicsunit.nhs.uk.
Author: Bruno Petrungaro
Helping the NHS use machine learning effectively makes Bruno Petrungaro happy.
Bruno brings his vast knowledge, experience and network to ensure we deliver outstanding products that achieve our clients’ aims.