Line 52: Line 52:


= Data Engineering =
= Data Engineering =
AI expert Andrew Ng has gone as far as [https://www.forbes.com/sites/gilpress/2021/06/16/andrew-ng-launches-a-campaign-for-data-centric-ai/?sh=6ca9870a74f5 launching a campaign] to shift the focus of AI practitioners from focusing on ML model development to the quality of the data they use to train the models. In his presentations, he is defining the split of work between data-related activities and actual ML model development as 80:20 - this means, 80% of the time and resources are spend on data sourcing and preparation. Building a data pipeline based on a robust and scalable set of data processing tools and platforms is key for success.
Data is the key ingredient for AI. AI expert Andrew Ng has gone as far as [https://www.forbes.com/sites/gilpress/2021/06/16/andrew-ng-launches-a-campaign-for-data-centric-ai/?sh=6ca9870a74f5 launching a campaign] to shift the focus of AI practitioners from focusing on ML model development to the quality of the data they use to train the models. In his presentations, he is defining the split of work between data-related activities and actual ML model development as 80:20 - this means, 80% of the time and resources are spend on data sourcing and preparation. Building a data pipeline based on a robust and scalable set of data processing tools and platforms is key for success.


[[File:Data 80 20.png|500px|frameless|center|link=|Data vs Model Development]]
[[File:Data 80 20.png|500px|frameless|center|link=|Data vs Model Development]]

Revision as of 21:00, 11 September 2021

More...More...More...More...More...More...Data 101

Data is the foundation for almost all digital business models. AIoT is adding sensor-generated data to the picture. However, in its rawest form, data is usually not usable. Developers, data engineers, analytics experts, and data scientists are working on creating information from data by linking relevant data elements and given them meaning. By adding context, knowledge is created[1]. In the case of AIoT, knowledge is the foundation of actionable intelligence.

Data is a complex topic with many facets. Data 101 looks at it through different perspectives, including the enterprise perspective, the Data Management, Data Engineering, Data Science, and Domain Knowledge perspectives, and finally the AIoT perspective. Later on, the AIoT Data Strategy section will provide an overview of how to implement this in the context of an AIoT initiative.

Enterprise Data

Traditionally, enterprise data is divided into three main categories: master data, transactional data, and analytics data. Master data is data related to business entities such as customers, products, and financial structures (e.g. cost centers). Master Data Management (MDM) aims to provide a holistic view of all the master data in an enterprise, addressing redundancies and inconsistencies. Transactional data is data related to business events, e.g. the sale of a product or the payment of an invoice. Analytics data is related to business performance, e.g. sales performance of different products in different regions.

From the product perspective, PLM (Product Lifecycle Management) data plays an important role. This includes traditionally design data (including construction models, maintenance instructions, etc.), as well as the generic Engineering Bill of Material (EBOM), and for each product instance a Manufacturing Bill of Material (MBOM).

With AIoT, additional data categories usually play an important role, representing data captured from the assets in the field: asset condition data, asset usage data, asset performance data, and data related to asset maintenance and repair. Assets in this context can be physical products, appliances or equipment. The data can come from interfacing with existing control systems, or from additional sensors. AIoT must ensure that this raw data is eventually converted into actionable intelligence.

Data - Enterprise Perspective

Data Management

Because of the need to efficiently manage large amounts of data, many different database and other data management systems have been developed. They differ in many ways, including scalability, performance, reliability, and ability to manage data consistency.

For decades, relational database management systems (RDBMS) were the de-facto standard. RDBMS manage data in tabular form, i.e. as a collection of tables with each table consisting of a set of rows and columns. They provide many tools and APIs (application programming interfaces) to query, read, create and manipulate data. Most RDBMS support so-called ACID transactions. ACID relates to Atomicity, Consistency, Isolation, and Durability. ACID transactions guarantee the validity of data even in case of fatal errors, e.g. an error during a transfer of funds from one account to another. Most RDBMS support the Structure Query Language (SQL) for queries and updates.

With the emergence of so-called NoSQL databases in the 2010s, the quasi-monopoly of the RDBMS/SQL paradigm ended. While RDBMS are still dominant for transactional data, many projects are now relying on alternative or at least additional databases and data management systems for specific purposes. Examples for NoSQL databases include column databases, key-value databases, graph databases, and document databases.

Column (or wide-column) databases group and store data in columns instead of rows. Since they have neither predefined keys nor column names, they are very flexible and allow for storing large amounts of data within a single column. This allows them to scale easily, even across multiple servers. Document-oriented databases store data in documents, which can also be interlinked. They are very flexible, because there is no dedicated schema required for the different documents. Also, they make development very efficient, since modern programming languages such as JavaScript provide native support for document formats such as JSON. Key-value databases are very simple, but also very scalable. They have a dictionary data structure for storing objects with a unique key. Objects are retrieved only via key lookup. Finally, graph databases store complex graphs of objects, supporting very efficient graph operations. They are most suitable for use cases where many graph operations are required, e.g. in a social network.

Data - DBMS Perspective

Analytics Platforms

In addition to the operational systems utilizing the different types of data management systems, analytics was always an important use case. In the 1990s, Data Warehousing systems emerged. They aggregated data from different operational and external systems, and ingested the data via a so-called "Extract/Transform/Load" process. The result were data marts, which were optimized for efficient data analytics, using specialized BI (Business Intelligence) and reporting tools. Most Data Warehousing platforms were very much focused on the relational data model.

In the 2010s, Data Lakes emerged. The basic idea was to aggregate all relevant data in one place, including structured (usually relational), non-structured and semi-structured data. Data lakes can be accessed using a number of different tools, including ML/Data Science tools, as well as more traditional BI/reporting tools.

Data lakes were usually designed for batch processing. Many IoT use cases require near real-time processing of streaming and time series data. A number of specialized tools and stream data management platforms have emerged to support this.

From an AIoT point of view, the goal is to eventually merge the big data / batch processing with the real-time streaming anlytics into a single platform, to reduce overheads and minimize redundancies.

Data Analytics Architecture Evolution

Data Engineering

Data is the key ingredient for AI. AI expert Andrew Ng has gone as far as launching a campaign to shift the focus of AI practitioners from focusing on ML model development to the quality of the data they use to train the models. In his presentations, he is defining the split of work between data-related activities and actual ML model development as 80:20 - this means, 80% of the time and resources are spend on data sourcing and preparation. Building a data pipeline based on a robust and scalable set of data processing tools and platforms is key for success.

Data vs Model Development

Data Pipeline

From an AIoT point of view, data will play a central role in making products and services 'smart'. In the early stages of the AIoT initiative, the data domain needs to be analysed (see Data Domain Model) in order to understand the big picture: which data is required / available, and where is it residing from a physical / organizational point of view. Depending on the specifics, some aspects of the data domain should also be modeled in more detail in order to ensure a common understanding. A high-level data architecture should govern how data is collected, stored, integrated, and used. For all data, it must be understood how it can be accessed and secured. A data-centric integration architecture will complete the big picture.

The general setup of the data management for an AIoT initiative will probably differentiate between online and offline use of data. Online relates to data which is coming from live systems or assets in the field; sometimes also a dedicated test lab. Offline is data (usually data sets) made available to the data engineers and data scientists to create the ML models.

The online work with data will have to follow the usual enterprise rules of data management, including dealing with data storage at scale, data compaction, data retirement, and so on.

The offline work with data (from an ML perspective) usually follows a number of different steps, including data ingestion, data exploration and data preparation. Parallel to all of this, data cataloging, data versioning and lineage, and meta-data management will have to be done.

Data ingestion means the collection of the required data from different sources, including batch data import and data stream ingestion. Typically, this can already include some basic filtering and cleansing. Finally, for data set generation the data needs to be routed to the appropriate data stores.

The ingested data then must be explored. Initial data exploration will focus on the quality of data and measurements. Data quality can be assessed in several different ways, including frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms), etc. Exploratory data analysis helps understanding the main characteristics of the data, often using statistical graphics and other data visualization methods.

Based on the findings of the data exploration, the data needs to be prepared for further analysis and processing. Data preparation includes data fusion, data cleaning, data augmentation, and finally the creation of the required data sets. Important data cleaning and preparation techniques include basic cleaning ("color" vs. "colour"), entity resolution (finding out whether multiple records are referencing the same real-world entity), de-duplication (eliminating redundancies) and imputation. In statistics, imputation describes the process of replacing missing data with substituted values. This is important, because missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency.

One big caveat regarding data preparation: if the lab data used for AI model training is too much different from the production data against which the models are used later on (inference), there is a danger that the models will not properly work in production. This is why in the figure shown here, automated data preparation happens online, before the data extraction for data set creation.

Data - AIoT Perspective

Edge vs Cloud

In AIoT, a big concern from the data engineering perspective is the distribution of the data flow and data processing logic between edge and cloud. Sensor-based systems which attempt to apply a cloud-only intelliegence strategy need to send all data from all sensors to the cloud for processing and analytics. The advantage of this approach is that no data gets lost, and the analytics algorithm can be applied to a full set of data. However, the disadvantages are potentially quite severe: massive consumption of bandwith, storate capacities and power consumption, as well as high latency (with respect to reacting to the anlytics results).

This is why most AIoT designs combine edge intelligence with cloud intelligence. On the edge, the sensor data is pre-processed and filtered. This can result in triggers and alerts, e.g. if tresholds are exceeded or critical patterns in the data stream are detected. Local decisions can be made, allowing to react in near-realtime - important in critical situations, or where UX is key. Based on the learnings from the edge intelligence, the edge nodes can make selected data available to the cloud. This can include semantically rich events (e.g. an interpretation of the sensor data), as well as selected rich sample data for further processing in the cloud. In the cloud, more advanced analysis (e.g. predictive or prescriptive) can be applied, taking additional context data into consideration.

The benefits are clear: significant reduction in bandwith, storate capacities and power consumption, plus faster response times. The intelligent edge cloud continuum takes traditional signal chains to a higher level. However, the basic analog signal chain circuit design philosophy should still be taken into consideration. In addition, the combination of cloud/edge and distributed system engineering expertise with a deep domain and application expertise must be ensured for success.

Edge Intelligence

In the example below, an intelligent sensor node is monitoring machine vibration. A threshold has been defined. If this threshold is exceeded, a trigger event will notify the backend, including sample data to provide more insights into the current situation. This data will allow to analyze the status quo. An important question is: will this be sufficient for root cause analysis? Most likely, the system will also have to store vibration data for a given period of time so that in the event of a threshold breach some data preceding the event can be provided as well, enabling root cause analysis.

Threshold event and sample data

Data Science

Data scientists need clean data in order to build and train predictive models. Of course, ML data can take many different forms, including text (e.g. for auto-correction), audio (e.g. for natural language processing), images (e.g. for optical inspection), video (e.g. for security surveillance), time series data (e.g. electricity metering), event series data (e.g. machine events) and even spatio-temporal data (describing a phenomenon in a particular location and period of time, e.g. for traffic predictions).

Many ML use cases require that the raw data is labeled. Labels can provide additional context information for the ML algorithm, e.g. labeling of images (image classification).

Data Sets

In ML projects, we need data sets to train and test the model. A data set is a collection of data, e.g. a set of file or a specific table in a database. For the latter, the rows in the table corresponds to members of the data set, while every column of the table represents a particular variable.

The data set is usually split into training (approx. 60%), validation (approx. 20%), and test data sets (approx. 20%). Validation sets are used to select and tune the final ML model by estimating the skill of the tuned model for comparison with other models. Finally, the training data set is used to train a model. The test data set is used to evaluate how well the model was trained.

Data - ML Perspective

In the article "From model-centric to data-centric"[2], Fabiana Clemente is providing the following guiding questions:

  • Is the data complete?
  • Is the data relevant for the use case?
  • If labels are available, are they consistent?
  • Is the presence of bias impacting the performance?
  • Do I have enough data?

In order to succeed in the adoption of a data-centric approach to ML, focusing on these questions will be key.

Data Labeling

Data labeling is required for supervised learning. It usually means that human data labelers are manually reviewing training data sets, tagging relevant data with specific labels. For example, this can mean manually reviewing pictures and tagging objects in them, like cars, people, and traffic signs. A data labeling platform can help to support and streamline the process.

Is data labeling the job of a data scientist? Probably not directly. But the data scientist has to be involved in ensuring that the process is set up properly, including the relevant QA processes to avoid bad label data quality, or labeled data with a strong bias. Depending on the task at hand, data labeling can be done inhouse, outhouse, or via crowd-sourcing. This will heavily depend on the data volumes as well as the required skill set. For example, correct labeling of data related to medical diagnostics, building inspection or manufacturing product quality will require input from highly skilled experts.

Data Labeling Example

Take, for example, building inspection using data generated from drone-based building scans. This is actually described in detail in the TÜV SÜD building façade inspection case study. Problems detected in such an example can vary widely, depending on the many different materials and components used for building façades. Large building inspection companies like TÜV SÜD have many experts for the different combinations of materials and problem categories. Building up a training data set with labeled data for all possible problem categories will probably take quite some time, and require a process which combines AI-based automation where there sufficient training data, and manual labeling where not. The example shown here assumes that the system will first attempt to automatically filter out the material from the building scan that indicates potential problems. This data is then submitted for automatic classification. Depending on the confidence ("%?") of the system, the data is automatically labeled, or submitted to a human expert for analysis and manual labeling. The results of this process are then used to further enhance the training data set, as well as creating the problem report for the customer. This example shows a type of labeling process which will require close collaboration between data engineers, data scientists and domain experts.

Domain Knowledge

One of the biggest challenges in many AI/ML projects is access to the required domain knowledge. Domain knowlege usually is a combination of general business acumen, vertical knowledge, and an understanding of the data lineage. Domain knowledge is essential for creating the right hypohtheses which data science can then either prove or disprove. It is also important for interpreting the results of the analyses and modeling work.

One of the most challenging parts of machine learning is feature engineering. Understanding domain-specific variables and how they relate to particular outcomes is key for this. Without a certain level of domain knowledge, it will be difficult to direct the data exploration and support the feature engineering process. Even after the features are generated, it is important to understand the relationships between different variables to effectively perform plausibility checks. Being able to look at the outcome of a model to determine if the result makes sense will be difficult without domain knowledge, which will make quality assurance very difficult.

There have been many discussions about how much domain knowledge the data scientist itself needs, and how much can come from domain experts in the field. The general consensus is that a certain amount of domain knowledge of the data scientist is required, that a team effort with experienced domain experts usually also works well. This will also heavily depend on the industry. An internet start-up which is all about "clicks" and related concepts will make it easy for the data scientist to build up the domain knowledge. In other indudistries like finance, healthcase or manufacturing this can be more difficult.

The case study AIoT in High-Volume Manufacturing Network describes how an organization is set up which always aims to team up data science experts with domain experts in the factories (referred to as "tandem teams"). Another trend here is "Citizen Data Science", which aims to make easy to use data science tools available directly to the domain experts.

In many projects, close alignment between the data science experts and the domain experts is also a prerequisite for trust in the project outcomes. Given that it is often difficult in data science to make the results "explainable", this level of trust is key.

Chicken vs Egg

Finally, a key question for AIoT initiatives is: what comes first, the data or the use case? In theory, any kind of data can be acquired via additional sensors, to best support a given use case. In practice, the ability to add more sensors or other data sources is limited due to cost and other considerations. Usually only greenfield, short tail AIoT initiatives will usually have the luxury of defining which data to use specifically for their use case. Most long tail AIoT initiatives will usually have to implement use cases based on already existing data.

For example, the building inspection use case from earlier is a potential short tail opportunity, which will allow the system designers to specify exactly which sensors to deploy on the drone used for the building scans, derived from the use cases which need to be supported. This type of luxury will not be available in many long tail use cases, e.g. in manufacturing optimization - such as outlined in the AIoT and high volume manufacturing case study.

Data - ML Long Tail

References

  1. Conceptual Approaches for Defining Data, Information, and Knowledge, Zins, Chaim, 2007, Journal of the American Society for Information Science and Technology.
  2. From model-centric to data-centric, Fabiana Clemente, 2021, https://towardsdatascience.com/from-model-centric-to-data-centric-4beb8ef50475