Artificial Intelligence, the ability of a machine to demonstrate intelligence, was founded as an academic discipline in 1956. In the years since it has experienced waves of optimism, followed by periods of diminished funding around the 80’s and new research, methodologies and approaches that led to its success. These days AI is one of the hottest technologies and the most promising ones to lead this new era of business digitalisation.   

All Artificial Intelligence solutions go through a well-defined set of phases from inception and implementation to deployment and continuous improvement. It is an iterative process that can involve several cycles before the solution converges to a finalised pipeline infrastructure. This use case landing page and its collection of sub-landing pages aim at providing a set of reusable blueprints that can be utilised in each of the various phases of the AI lifecycle. The Artificial Intelligence pipeline is a sequence of technical phases that will take an AI project from the early data acquisition phase all the way up-to production inference.

 The AI pipeline is an integral part of the wider AI project lifecycle. We separate the AI pipeline space into four main phases that follow business understanding and initial designs.

Business understanding

This initiation phase within an AI project lifecycle, focuses on understanding the business space, objectives, and requirements what business is expecting to see as value delivered from the AI solution and how that is expected to become quantified. Assessment of the current situation is also very important, what is the existing process in place, risk and contingencies deriving from a move to an AI solution also need to be defined. The project scope needs to be determined - available tools, technologies, new tech involved, migration or integration plans and human resources. Project phases timelines as well as a cost-benefit analysis need to be implemented. Essentially the deliverable for this phase is a full-blown project plan as with any new software solution. 

Data engineering

The first technical phase within the AI project lifecycle is gathering the data and building the appropriate raw datasets. Entering this phase, the team has defined the business case and objectives and has a good understanding of the problem at hand as well as first solid thoughts on how to model its digital twin. It is now time to start documenting potential data sources as well as data schemas that will be collected from each data source. After the process is completed, the team will start writing code to ingest the appropriate data from all the disparate sources into a centralised storage space often called the data lake. An initial exploration of the data will reveal, if they exist, first inconsistencies. Data tends to be noisy; all inconsistencies need to be addressed, perhaps timestamps need to be synchronized, or metric units need to become homogenous e.g., stock prices might be reported in different currencies. Some of these major problems need to be investigated and dealt with. Depending on the situation, solutions can include synchronisation of data from disparate sources, outlier detection and noise reduction. After the data engineering step, the team should have a data lake that is free of any inconsistencies and has all the raw data needed available to move forward to feature engineering.

Feature engineering

One of the fundamental stages of a successful Artificial Intelligence solution is feature engineering. It is the process of transforming raw data into a dataset that can be used for inference. Feature engineering can be a very computational-intensive process and requires a lot of creativity and domain knowledge from the data scientist. The goal of the process is the maximization of the AI predictive power by providing highly informative, well-designed feature sets. The algorithms involved in this stage of the AI pipeline, range from simple decisions such as record or feature removals and imputations to complex machine learning algorithms that try to enhance the raw data. 

Modelling and evaluation

The output of the feature engineering step is a dataset that is not only considered to be sufficient to describe the problem space but is also AI ready. This means that we can now choose the AI cores that we will deploy to analyse the problem space. Usually, and dependent on the AI learning genre, we will choose several cores that are able to solve the problem. Some of the most prominent AI learning genres from which we can choose the corresponding AI cores include:

  • Supervised & Unsupervised ML: These are the two typical scenarios of inference mapping. Supervised Learning means that the AI cores, learn from an existing pre-labelled dataset. For example, identification of a market microstructure incident we have seen and well documented in the past. Unsupervised Learning involves analysis and discovery of patterns in raw datasets to surface new insight. This could be for example a process of discovering trading clusters in a dataset that show similar behaviour.
  • Deep Learning: This is regarded as the new generation of AI algorithms, mainly consisting of different architectures of deep stacked Neural Networks. These AI structures have shown superior performance in solving even the most difficult problems and are capable of surfacing better insight from more complex feature sets.
  • Reinforcement Learning: This approach tries to imitate the way we humans learn by experience. An AI agent is allowed to explore the digital twin of an environment.  As the agent takes decisions it is introduced with the appropriate rewards. The agent learns to navigate the problem space in a way that it will maximise its rewards. Any problem can be tackled using the methodology provided sufficient environment modelling.

After the definition of the appropriate learning genre, we have a pool of algorithms that can be used from within the genre e.g., if the solution involves supervised regression, we could pick cores such as regression trees or regression Neural networks. It is always useful to also pick some statistical methods to use as baselines to evaluate the AI cores. Complex AI cores are sometimes difficult to maintain or interpret and can be very computationally intensive. It is therefore important to be able to be able to determine the performance increase achieved from baseline models to justify the decision of deploying AI solutions. Next steps within this phase include generation of a training, testing, and evaluating environment as well as training and optimizing the cores. Finally, we need to evaluate the trained AI to see how much business value it can deliver.

Machine learning operations

Once the team is satisfied with the choice of methodology and its results, we need to move forward to one of the most important steps of the AI pipeline close to software engineering. Having an optimised AI at hand without deploying it is not particularly useful. The end-customer of the product needs to be able to access the results in a way that elevates her user experience. The AI team needs to have a plan at hand on how to deploy the models, what tech stacks will be used as well as how the model results will be served and visualised. However, even after productionisation, the phase is far from over. In fact, MLOps is a process that governs the AI pipeline from start to finish. Moreover, in AI there is a concept that is described as CACE, Changing Anything, Changes Everything! A small change at any step in the pipeline can cascade in such a way that can affect the entire pipeline. It could be some data drift, a change in the distributions of the data or even something on a hardware level, e.g., the introduction of a faster data provider API that cannot be managed from the ingestion process. Whichever the case, there needs to be a governing automated lineage process able to redeploy the entire pipeline and apply the new changes without breaking the software solution. Luckily these days we have many new frameworks that do exactly that, they allow us to manage the entire AI pipeline, rapidly innovate and easily redeploy to production with high precision.

Measuring business value

While this is not per se a separate step of the AI pipeline, we feel that it is indeed one of the very important deliverables that the pipeline should contain. This phase belongs to the wider AI project lifecycle and follows successful deployment. It is very different than the evaluation of the cores on historical data e.g., a back-test of a trading AI which we know that can contain quite a bit of bias before deployment. The idea here is that as we are trying to enhance an existing process through the use of AI we should be able to answer the question, Have we succeeded at delivering enough value to justify the use of AI? The team should have quantifiable performance indicators that pre-exist the AI and use A-B testing to prove that the AI has indeed delivered an increase in business value. It could be for example a scenario of paper trading an AI system while the prior trading process is in place and comparing the P&L of those two directly.  

For each of the AI lifecycle phases in the pipeline, you can find a dedicated sub-landing page containing numerous blueprints explaining techniques that can be used in the context and applying them to example financial datasets. These blueprints are also coupled with relative, wider use case articles you will be able to use as a starting sandbox environment for experimentation depending on your business use case.