Data 101: Difference between revisions

Revision as of 11:22, 25 July 2021

Data is the foundation for almost all digital business models. AIoT is adding sensor-generated data to the picture. However, in its rawest form, data is usually not usable. Developers, data engineers, analytics experts and data scientists are working on creating information from data, by linking relevant data elements, giving data meaning. By putting information into context, knowledge is created. Knowledge is the foundation of actionable intelligence, which is a key element of many AIoT use cases.

Data is a complex topic with many facets. The Data 101 is looking at it through different perspectives, including the enterprise perspective, the developer perspective, the statisticians perspective, the ML perspective and finally the AIoT perspective. The AIoT_Data_Strategy section provides an overview of how to implement this in the context of an AIoT initiative.

Enterprise Data Perspective

Traditionally, enterprise data is divided into three main categories: master data, transactional data, and analytics data. Master data is data related to business entities such as customers, products and financial structures (e.g. cost centers). Master Data Management (MDM) aims to provide a holistic view of all the master data in an enterprise, addressing redundancies and inconsistencies. Transactional data is data related to business events, e.g. the sale of a product or the payment of an invoice. Analytics data is related to business performance, e.g. sales performance of different products in different regions.

With AIoT, additional data categories usually play an important role: asset condition data, asset usage data, asset performance data, and data related to asset maintenance and repair. Assets in this context can be physical products, appliances or equipment. The data can come from interfacing with existing control systems, or from additional sensors. AIoT must ensure that this raw data is eventually converted into actionable intelligence.

Developer Perspective

Custom application logic - including Machine Learning algorithms - must be implemented using one of the many available programming languages like Python, JavaScript, Java, C++ or c. All programming languages have certain data types built in, so that developers can read and manipulate data. Typical built-in data types include characters (A-Z/a-z), booleans (true/false), integers (whole numbers), floating point (numbers with fractional parts), arrays ([1,2,4]) and structures ({ speed: 65, unit: mph}).

Depending on the programming language, additional data types are supported (either built-in, or via libraries) which allow with dynamic data structures. These data structures are dynamic in that their size is not known in advance, and they can grow/shrink dynamically. Examples include strings ("Hello, world!"), linked lists, trees, graphs, hash tables (key/value stores), stacks, sets, queues, dictionaries and maps.

These different data types can be used by the developer to develop algorithms. For example, a tree data structure can be used to efficiently sort a set of data by a given sort criteria. The issue with these data structures is that they are usually transient, i.e. the values stored in them are lost once the program (or operating system process) is terminated. This is why most programs must interact with a persistent data store, in order to search, read, create and manipulate persistent data values. The data managed in the application usually only is a subset of the data managed by the persistent data store, which required efficient algorithms also from a data management perspective. Exceptions include in-memory databases and some so-called object databases.

Database Perspective

Because of the need to efficiently manage large amounts of data, many different database and other data management systems have been developed. They differ in many ways, including scalability, performance, reliability, and ability to manage data consistency.

For decades, relational database management systems (RDBMS) were the de-facto standard. RDBMS manage data in tabular form, i.e. as a collection of tables with each table consisting of a set of rows and columns. They provide many tools and APIs (application programming interfaces) to query, read, create and manipulate data. Most RDBMS support so-called ACID transactions. ACID relates to Atomicity, Consistency, Isolation, and Durability. ACID transactions guarantee the validity of data even in case of fatal errors, e.g. an error during a transfer of funds from one account to another. Most RDBMS support the Structure Query Language (SQL) for queries and updates.

With the emergence of so-called NoSQL databases in the 2010s, the quasi-monopoly of the RDBMS/SQL paradigm ended. While RDBMS are still dominant for transactional data, many projects are now relying on alternative or at least additional databases and data management systems for specific purposes. Examples for NoSQL databases include Column databases, key-value databases, graph databases, and document databases.

Column (or wide-column) databases are grouping and storing data in columns instead of rows. Since they have no predefined keys nor column names, they are very flexible and allow for storing large amounts of data within a single column. This allows them to scale easily, even across multiple servers.

Document-oriented databases are storing data in documents, which can also be interlinked. They are very flexible, because there is no dedicated schema required for the different documents. Also, they make development very efficient, since modern programming languages like JavaScript provide native support for document format such as JSON.

Key-value databases are very simple, but also very scalable. They have a dictionary data structure for storing objects with a unique key. Objects are retrieved only via key lookup.

Finally, Graph databases are storing complex graphs of objects, supporting very efficient graph operations. They are most suitable for use cases where many graph operations are required, e.g. in a social network.

There is a mismatch between the different database types introduced here and the programming language models introduced before. This mismatch is also referred to as Impedence Mismatch. For example, while an RDBMS is organized in terms of rows and tables, this model is not found in most programming languages. The solution is to build custom APIs (e.g. for SQL), which help accessing the database entities in a specific programming language. However, this solution is not ideal: Developers and architects now must deal with two models - the one in the database, and the one in their application. This is adding additional complexity, room for errors, and potential inconsistencies. There is no silver bullet for this problem, and architects and developers have to work together to deal with it in the best way possible.

Statisticians Perspective

In the context of AI and ML, another important perspective on data is the perspective of the statistician. Statistical computations assume that variables have specific levels of measurement. Stanley Smith Stevens first introduced the concept of "The theory of scales of measurement”, which introduces four types or levels of data: nominal, ordinal, interval and ratio.

At the top level, the differentiation is made between categorical/quantitative and qualitative data, as shown in the diagram below. Qualitative data measures types and is usually represented by a name, label or number code. It includes:

Categorical data (sometimes called nominal) has two or more categories, but there is no intrinsic ordering. For example, color is a categorical variable with multiple categories (red, green, blue, etc.) but there is no intrinsic ordering to the categories.
Ordinal data is similar to categorical data. The difference is that there is a clear ordering of the categories. For example, suppose you have a variable customer satisfaction with three categories low, medium and high. In addition to classify customers and their satisfaction, you can also order the categories.

Quantitative data has numeric variables (e.g. how many, how much, or how often). It includes:

Discrete data: countable and have a finite number of possibilities, e.g. number of students in a class
Continuous data: not countable and have an infinite number of possibilities, e.g. age, weight or height

Understanding the different types of variables is important because not all statistical analyses can be performed on all variable types. For example, it is not possible to compute the mean of the variable hair color, as you can not sum red and blond hair. Another example that won`t work well is trying to find the mode (i.e. the value that appears most often in a set of data values) of a continuous variable, since almost no continuous variables have exactly the same value.

Machine Learning Perspective

Data obviously is also the foundation of every ML project. AI expert Andrew Ng has gone as far as launching a campaign to shift the focus of AI practitioners from focusing on ML model development to the quality of the data they use to train the models.

ML data can take many different forms, including text (e.g. for auto-correction), audio (e.g. for natural language processing), images (e.g. for optical inspection), video (e.g. for security surveillance), time series data (e.g. electricity metering), event series data (e.g. machine events) and even spatio-temporal data (describing a phenomenon in a particular location and period of time, e.g. for traffic predictions).

Many ML use cases require that the raw data is labeled. Labels can provide additional context information for the ML algorithm, e.g. labeling of images (image classification).

In ML projects, we need data sets to train and test the model. A data set is a collection of data, e.g. a set of file or a specific table in a database. For the latter, the rows in the table corresponds to members of the data set, while every column of the table represents a particular variable.

The data set is usually split into training (approx. 60%), validation (approx. 20%) and test data sets (approx. 20%). Validation sets are used to select and tune the final ML model. The training data set is used to train a model. The test data set is used to evaluate how well the model was trained.

@@ Line 79: / Line 79: @@
 == Machine Learning Perspective ==
+Data obviously is also the foundation of every ML project. AI expert Andrew Ng has gone as far as [https://www.forbes.com/sites/gilpress/2021/06/16/andrew-ng-launches-a-campaign-for-data-centric-ai/?sh=6ca9870a74f5 launching a campaign] to shift the focus of AI practitioners from focusing on ML model development to the quality of the data they use to train the models.
+ML data can take many different forms, including text (e.g. for auto-correction), audio (e.g. for natural language processing), images (e.g. for optical inspection), video (e.g. for security surveillance), time series data (e.g. electricity metering), event series data (e.g. machine events) and even spatio-temporal data (describing a phenomenon in a particular location and period of time, e.g. for traffic predictions).
+Many ML use cases require that the raw data is labeled. Labels can provide additional context information for the ML algorithm, e.g. labeling of images (image classification).
+In ML projects, we need data sets to train and test the model. A data set is a collection of data, e.g. a set of file or a specific table in a database. For the latter, the rows in the table corresponds to members of the data set, while every column of the table represents a particular variable.
+The [https://towardsdatascience.com/how-to-build-a-data-set-for-your-machine-learning-project-5b3b871881ac data set] is usually split into training (approx. 60%), validation (approx. 20%) and test data sets (approx. 20%). Validation sets are used to select and tune the final ML model. The training data set is used to train a model. The test data set is used to evaluate how well the model was trained.
 [[File:Data - ML Perspective.png|800px|frameless|center|Data - ML Perspective]]