Overview

Data engineering is the natural first step after the team has defined the business scope and its expectations from its AI solution. One of the data strategy deliverables, during the business understanding phase, is the minimum data schema that the AI project will need to model the problem at hand. Given this schema, the team needs to design and implement a data gathering and ingestion strategy that will allow them to source the appropriate data from its various locations. Once the data is gathered, an exploration of the data space follows so that data scientists can understand some of the base characteristics of the data such as distributions and first estimations of quality. Thereafter, the data needs to go through a thorough data preparation process to ensure the delivery of a high-quality dataset. During this phase, an AI data engineer will build all the appropriate pipelines that will ingest and synchronise the data from disperse endpoints, clean and prepare it for the next phases of the AI pipeline.

Data ingestion

These days companies focus on gathering large volumes of both structured and unstructured data. There are many sources of data including databases, APIs, sites, social media, IoT devices, sensors, blogs, emails and more. During this step of the data engineering pipeline phase, the main task of the AI data engineer is to gather the data from all the disparate sources and store it in a single data store. Refinitiv already provides a single source of many different industry datasets and using the available APIs an AI team will be able to quickly tap in into a rich ecosystem of well-structured data. Such consistency is the result of the complex preprocessing that our products have already applied to various disparate datasets. We will be exploring detailed ways of specifying, ingesting and structuring data from the platform using its various available APIs.

Data Exploration

Once the team has a central storage area, often called the data lake, the AI data engineer is ready for an initial exploration of the data that will allow the team to reach certain conclusions about the quality and other important data characteristics such as min-max statistics, data types and more. Data distributions are analysed within the parameter space and data consistency and clarity is explored. Statistical distribution tests are usually performed in the form of hypothesis testing on the relevant features. What we are trying to establish using those tests is whether a feature follows or not a specific distribution. To do that we test if the feature follows a null hypothesis we have setup, e.g., a statement such as “The feature follows the normal distribution”. To test for the hypothesis, we need to decide on the best statistic that can be used. Once the descriptive statistic is determined, the statistical test will give us a metric such as p-value, the probability value, and if that is above a certain level, we can accept the null hypothesis with a certain confidence. For example, if we are testing for normality, we can use the Shapiro-Wilk test and if the p-value is above 0.05 we can safely assume that the feature follows the normal distribution. There exist a multitude of statistical tests that can be used to test for distributions that can support different types of hypothesis testing such as one-sided or two-sided. 

During this phase, the AI data engine will also reveal any problems that the datasets might have, and the team can start formulating strategies towards solving or smoothing them out. For example, during timeseries analysis, the team might be interested in the existence of stationarity, so they can test for that and if it does not exist may conclude that differencing operations are required.

Data preparation

After initial data exploration, the AI data engineer is now ready to deploy initial data cleansing and synchronization methodologies on the raw data. There are a multitude of techniques that can be applied to enhance the quality of the available dataset and prepare it for the next phase of the Artificial Intelligence pipeline - the feature engineering phase. Amongst others, during this phase techniques should be applied to cleanse, transform, combine, and synchronise the dataset. Many methodologies can be applied in the context of data cleansing aiming to tackle problems with the data such as missing values, outliers, null values, whitespaces or special characters and more. Data transformations ensure better quality and predictive power of the data and can sometimes be an expensive process. During this step we can be constructing or deconstructing the data, applying aesthetic transformations or structural ones. For example, the AI engineer might be aggregating a decimal column to periodic averages, that would be a constructive transformation. She could be deleting a column, that is an example of a destructive and structural transformation. If we are renaming a column, then we are applying aesthetic transformations whereas if we are combining two columns to one, that would be a constructive and structural transformation. Other types of transformations include encryption, anonymization, indexing, ordering, typecasting and more. Finally, combining and synchronisation is the process of combining different data sources in a single dataset to provide a single source of all the significant features that can be used to solve the problem at hand. Oftentimes, when combining different data sources, discrepancies can be generated as there can exist differences in timestamps or other attribute aggregations between the sources. The purpose of this engineering step is to smooth out such discrepancies and present a homogenous dataset for the next AI pipeline phase.