More...More...More...More...More...Data 101

Data is the foundation for almost all digital business models. AIoT is adding sensor-generated data to the picture. However, in its rawest form, data is usually not usable. Developers, data engineers, analytics experts, and data scientists are working on creating information from data by linking relevant data elements and given them meaning. By adding context, knowledge is created[1]. In the case of AIoT, knowledge is the foundation of actionable intelligence.

Data is a complex topic with many facets. Data 101 looks at it through different perspectives, including the enterprise perspective, the data management perspective, the statisticians perspective, the ML perspective, and finally the AIoT perspective. The AIoT Data Strategy section provides an overview of how to implement this in the context of an AIoT initiative.

Enterprise Data Perspective

Traditionally, enterprise data is divided into three main categories: master data, transactional data, and analytics data. Master data is data related to business entities such as customers, products, and financial structures (e.g. cost centers). Master Data Management (MDM) aims to provide a holistic view of all the master data in an enterprise, addressing redundancies and inconsistencies. Transactional data is data related to business events, e.g. the sale of a product or the payment of an invoice. Analytics data is related to business performance, e.g. sales performance of different products in different regions.

With AIoT, additional data categories usually play an important role: asset condition data, asset usage data, asset performance data, and data related to asset maintenance and repair. Assets in this context can be physical products, appliances or equipment. The data can come from interfacing with existing control systems, or from additional sensors. AIoT must ensure that this raw data is eventually converted into actionable intelligence.

Data - Enterprise Perspective

Database Perspective

Because of the need to efficiently manage large amounts of data, many different database and other data management systems have been developed. They differ in many ways, including scalability, performance, reliability, and ability to manage data consistency.

For decades, relational database management systems (RDBMS) were the de-facto standard. RDBMS manage data in tabular form, i.e. as a collection of tables with each table consisting of a set of rows and columns. They provide many tools and APIs (application programming interfaces) to query, read, create and manipulate data. Most RDBMS support so-called ACID transactions. ACID relates to Atomicity, Consistency, Isolation, and Durability. ACID transactions guarantee the validity of data even in case of fatal errors, e.g. an error during a transfer of funds from one account to another. Most RDBMS support the Structure Query Language (SQL) for queries and updates.

With the emergence of so-called NoSQL databases in the 2010s, the quasi-monopoly of the RDBMS/SQL paradigm ended. While RDBMS are still dominant for transactional data, many projects are now relying on alternative or at least additional databases and data management systems for specific purposes. Examples for NoSQL databases include Column databases, key-value databases, graph databases, and document databases.

Column (or wide-column) databases group and store data in columns instead of rows. Since they have neither predefined keys nor column names, they are very flexible and allow for storing large amounts of data within a single column. This allows them to scale easily, even across multiple servers.

Document-oriented databases store data in documents, which can also be interlinked. They are very flexible, because there is no dedicated schema required for the different documents. Also, they make development very efficient, since modern programming languages such as JavaScript provide native support for document formats such as JSON.

Key-value databases are very simple, but also very scalable. They have a dictionary data structure for storing objects with a unique key. Objects are retrieved only via key lookup.

Finally, Graph databases store complex graphs of objects, supporting very efficient graph operations. They are most suitable for use cases where many graph operations are required, e.g. in a social network.

Data - DBMS Perspective


There is a mismatch between the different database types introduced here and the different programming languages used to work with the data. This mismatch is also referred to as Impedance Mismatch. For example, while an RDBMS is organized in terms of rows and tables, this model is not found in most programming languages. The solution is to build custom APIs (e.g. for SQL), which help access the database entities in a specific programming language. However, this solution is not ideal -- developers and architects now must deal with two models -- the one in the database, and the one in their application. This adds additional complexity, room for errors, and potential inconsistencies. There is no silver bullet for this problem, and architects and developers have to work together to deal with it in the best way possible.

Statistician's Perspective

In the context of AI and ML, another important perspective on data is the perspective of the statistician. Statistical computations assume that variables have specific levels of measurement. Stanley Smith Stevens first introduced the concept of "The theory of scales of measurement”, which introduces four types or levels of data: nominal, ordinal, interval and ratio.

At the top level, the differentiation is made between categorical/quantitative and qualitative data, as shown in the diagram below. Qualitative data measures types and is usually represented by a name, label or number code. It includes:

  • Categorical data (sometimes called nominal) has two or more categories, but there is no intrinsic ordering. For example, color is a categorical variable with multiple categories (red, green, blue, etc.) but there is no intrinsic ordering to the categories.
  • Ordinal data is similar to categorical data. The difference is that there is a clear ordering of the categories. For example, suppose you have a variable customer satisfaction with three categories low, medium and high. In addition to classify customers and their satisfaction, you can also order the categories.

Quantitative data has numeric variables (e.g. how many, how much, or how often). It includes:

  • Discrete data: countable and have a finite number of possibilities, e.g. number of students in a class
  • Continuous data: not countable and have an infinite number of possibilities, e.g. age, weight or height
Data - Statistics Perspective

Understanding the different types of variables is important because not all statistical analyses can be performed on all variable types. For example, it is not possible to compute the mean of the variable hair color, as you can not sum red and blond hair. Another example that won`t work well is trying to find the mode (i.e. the value that appears most often in a set of data values) of a continuous variable, since almost no continuous variables have exactly the same value.

Machine Learning Perspective

Data obviously is also the foundation of every ML project. AI expert Andrew Ng has gone as far as launching a campaign to shift the focus of AI practitioners from focusing on ML model development to the quality of the data they use to train the models.

ML data can take many different forms, including text (e.g. for auto-correction), audio (e.g. for natural language processing), images (e.g. for optical inspection), video (e.g. for security surveillance), time series data (e.g. electricity metering), event series data (e.g. machine events) and even spatio-temporal data (describing a phenomenon in a particular location and period of time, e.g. for traffic predictions).

Many ML use cases require that the raw data is labeled. Labels can provide additional context information for the ML algorithm, e.g. labeling of images (image classification).

In ML projects, we need data sets to train and test the model. A data set is a collection of data, e.g. a set of file or a specific table in a database. For the latter, the rows in the table corresponds to members of the data set, while every column of the table represents a particular variable.

The data set is usually split into training (approx. 60%), validation (approx. 20%), and test data sets (approx. 20%). Validation sets are used to select and tune the final ML model by estimating the skill of the tuned model for comparison with other models. Finally, the training data set is used to train a model. The test data set is used to evaluate how well the model was trained.

Data - ML Perspective

AIoT Perspective

From an AIoT point of view, data will play a central role in making products and services 'smart'. In the early stages of the AIoT initiative, the data domain needs to be analysed (see Data Domain Model) in order to understand the big picture: which data is required / available, and where is it residing from a physical / organizational point of view. Depending on the specifics, some aspects of the data domain should also be modeled in more detail in order to ensure a common understanding. A high-level data architecture should govern how data is collected, stored, integrated, and used. For all data, it must be understood how it can be accessed and secured. A data-centric integration architecture will complete the big picture.

The general setup of the data management for an AIoT initiative will probably differentiate between online and offline use of data. Online relates to data which is coming from live systems or assets in the field; sometimes also a dedicated test lab. Offline is data (usually data sets) made available to the data engineers and data scientists to create the ML models.

The online work with data will have to follow the usual enterprise rules of data management, including dealing with data storage at scale, data compaction, data retirement, and so on.

The offline work with data (from an ML perspective) usually follows a number of different steps, including data ingestion, data exploration and data preparation. Parallel to all of this, data cataloging, data versioning and lineage, and meta-data management will have to be done.

Data ingestion means the collection of the required data from different sources, including batch data import and data stream ingestion. Typically, this can already include some basic filtering and cleansing. Finally, for data set generation the data needs to be routed to the appropriate data stores.

The ingested data then must be explored. Initial data exploration will focus on the quality of data and measurements. Data quality can be assessed in several different ways, including frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms), etc. Exploratory data analysis helps understanding the main characteristics of the data, often using statistical graphics and other data visualization methods.

Based on the findings of the data exploration, the data needs to be prepared for further analysis and processing. Data preparation includes data fusion, data cleaning, data augmentation, and finally the creation of the required data sets. Important data cleaning and preparation techniques include basic cleaning ("color" vs. "colour"), entity resolution (finding out whether multiple records are referencing the same real-world entity), de-duplication (eliminating redundancies) and imputation. In statistics, imputation describes the process of replacing missing data with substituted values. This is important, because missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency.

One big caveat regarding data preparation: if the lab data used for AI model training is too much different from the production data against which the models are used later on (inference), there is a danger that the models will not properly work in production. This needs to be carefully balanced out.

Finally, the data needs to be split into training, validation and test data sets. This process is also referred to as data splitting.

Data - AIoT Perspective

Chicken vs Egg Perspective

In an AIoT initiative, a key question always is: what comes first, the data or the use case? In theory, any kind of data can be acquired via additional sensors, to best support a given use case. In practice, the ability to add more sensors or other data sources is limited due to cost and other considerations. Coming back to the AIoT short tail vs long tail discussion from earlier, only greenfield, short tail AIoT initiatives will usually have the luxury of defining which data to use specifically for their use case. Most long tail AIoT initiatives will usually have to implement use cases based on already existing data.

Data - ML Long Tail

References

  1. Conceptual Approaches for Defining Data, Information, and Knowledge, Zins, Chaim, 2007, Journal of the American Society for Information Science and Technology.