Leveraging Sap’s Enterprise Data Management Tools to Enable ML/AI Success

Background

In our previous blog post, “Master Your ML/AI Success with Enterprise Data Management”, we outlined the need for Enterprise Data Management (EDM) and ML/AI initiatives to work together in order to deliver the full business value and expectations of ML/AI. We made a set of high-level recommendations to increase EDM maturity and in turn enable higher value from ML/AI initiatives. A graphical summary of these recommendations is shown below:

 

High level recommendations to address EDM challenges for ML-AI initiatives

Figure 1 – High level recommendations to address EDM challenges for ML/AI initiatives

 

In this post, we will present a specific instantiation of technology for bringing those concepts to life. There are countless examples that could be shown, but for the purposes of this post, we will present a solution within the SAP toolset. The end result is an implementation environment where the EDM technologies work hand-in-hand with ML/AI tools to help automate and streamline both these processes.

SAP’s preferred platform for ML/AI is SAP Data Intelligence (DI).  When it comes to EDM, SAP has a vast suite of tools that store, transfer, process, harness, and visualize data. We will focus on four tools that we believe provide the most significant impact to master ML/AI initiatives implemented on DI. These are SAP Master Data Governance (MDG)SAP Data Intelligence (DI) – Metadata Explorer component, and to a smaller extent, SAP Information Steward (IS)SAP Data Warehouse Cloud (DWC) can also be used to bring all the mastered and cleansed data together and to store and visualize the ML outputs.

Architecture

As with any other enterprise data solution, the challenge is to effectively integrate a set of tools to deliver the needed value, without adding the cost overhead of data being moved and stored in multiple places, as well as the added infrastructure, usage and support costs. For enterprises that run on SAP systems, a high-level architecture and descriptions of the tools that would achieve these benefits is shown below.

 

Figure 2 –High-level MDG/DI architecture and data flow

 

1. SAP MDG (Master Data Governance) with MDI (Master Data Integration)

SAP MDG and MDI go hand in hand. MDI is provided with the SAP Cloud Platform. It enables communication across various SAP applications by establishing One Domain Model (ODM). It enables a consistent view of master data across the end-to-end scenarios.

SAP MDG is available as S/4 HANA or ERP-based. This tool helps ensure high quality and trusted master data for initial and ongoing purposes. It can become a key part of the enterprise MDM and data governance program. Both active and passive governance are supported. Based on business needs, certain domains are prioritized out of the box in MDG.  MDG provides the capabilities like Consolidation, Mass Processing and Central Governance coupled with governance workflows for Create-Read-Update-Delete (CRUD) processes.

SAP has recently announced SAP MDG, cloud edition. While it is not a replacement for MDG on S/4 HANA, MDG cloud edition is planned to come with core MDG capabilities like Consolidation, Centralization and Data Quality Management to centrally manage core attributes of Business Partner data. This is a useful “very quick start” option for customers who never used MDG, but it can also help customers already using MDG on S/4HANA to build out their landscape to a federated MDG approach for better balancing centralized and decentralized master data.

 

2. Data Intelligence (with Metadata Explorer component)

SAP IS and MDG are the pathways to make enriched, trusted data available to Data Intelligence, which is used to actually build the ML/AI models. We can reuse SAP IS rules and metadata terms directly in SAP DI. This is achieved in DI by utilizing its data integration, orchestration, and streaming capabilities. DI’s Metadata Explorer component also facilitates the flow of business rules, metadata, glossaries, catalogs, and definitions to tools like IS (on-prem) for ensuring consistency and governance of data. Metadata explorer is geared towards discovery, movement and preparation of data assets that are spread across diverse and disparate enterprise systems including cloud-based ones.

 

3. Information Steward (IS) – Information Steward is an optional tool, useful for profiling data, especially for on-prem situations. The data quality effort can be initiated by creating the required Data Quality business rules, followed by profiling the data and running Information Steward to assess data quality. This would be the first step towards initial data cleansing, and thereby data remediation, using a passive governance approach via quality dashboards and reports. (Many of these features are also available in MDG and DI). SAP IS helps an enterprise address general data quality issues, prior to using specialized tools like SAP MDG to address master data issues. It can be an optional part of any ongoing data quality improvement initiative for an enterprise.

 

4. Data Warehouse Cloud (DWC) – Data Warehouse Cloud is used in this architecture to bring all the mastered and cleansed data together into the cloud, perform any other data preparation or transformations needed, and to model the data into the format needed by the ML models in DI. DWC is also used to store the results of the ML models, and to create visualizations of these results for data consumers.

 

Figure 3 – Summary of Functionality of SAP tools used for EDM

 

While there are some overlaps in functionality between these tools, Data Intelligence is more focused on the automation aspects of these capabilities. DI is primarily intended as an ML platform, and therefore has functionality such as the ability to create data models and organize the data in a format that facilitates the ML/AI process (ML Data Manager). This architecture allows for capitalizing on the EDM strengths of MDG and IS. This is also consistent with the strategic direction of SAP, that is, providing comprehensive “Business Transformation as a Service” approach, leading with cloud services. Together, these tools work in a complementary way (for hybrid on-prem plus cloud scenarios), and the combination of these tools work hand in hand to make trusted data available to AI/ML.

Conclusion

In summary, the SAP ecosystem has several EDM tools that can help address the data quality and data prep challenges of the ML/AI process. SAP tools like MDG and DI Metadata Explorer component have features and integration capabilities that can easily be leveraged during or even before the ML/AI use cases are underway. If used in conjunction with the general EDM maturity recommendations summarized above, these tools will help to deliver the full business value and expectations of ML/AI use cases.

In our next post, we will continue our discussion on EDM tools, some of their newer features, how they have evolved, and how ML/AI has been part of their own evolution. As a reminder, if you missed the first post in this series, you can find it here: “Master Your ML/AI Success with Enterprise Data Management”.

 

 


Inspired Intellect is an end-to-end service provider of data management, analytics and application development. We engage through a portfolio of offerings ranging from strategic advisory and design, to development and deployment, through to sustained operations and managed services.

 

Learn how Inspired Intellect’s EDM and ML/AI strategy and solutions can help bring greater value to your analytics initiatives by contacting us at marketing@inspiredintellect-us.com.

Machine Learning Enterprise Data Management Data Cube

Master Your ML/AI Success With Enterprise Data Management

Background:

Advanced analytics, in particular analytics that takes advantage of Machine Learning and Artificial Intelligence (ML/AI), have become established mainstream business initiatives. Several reports and surveys published by confirm the rapid growth in ML and AI projects at enterprises of all sizes and in all industries. A Gartner global survey of CIOs found that AI implementations grew by 270% in the prior four-year period. According to Forbes, 93% of executives expect to get some value from AI investments. Algorithmia’s “State of 2020 Machine Learning” survey found that budgets for ML initiatives are growing by 25% annually, with Banking, Manufacturing and IT industries having the largest growth.

 

Businesses are leveraging ML and AI for many different capabilities – at their core, these technologies allow businesses to uncover deeper insights, make better business predictions, and take actions on these predictions. Some of the business use cases for ML/AI that we have most commonly seen in our work with clients are:

  • generating customer insights
  • reducing churn
  • improving customer experience
  • recommendation engines
  • fraud detection
  • demand forecasting
  • supply chain optimization
  • internal process automation to reduce costs

Across industries, ML and AI not only provide competitive advantages but have become must-have capabilities that are necessary to remain viable and competitive. Due to the rapid decline in the cost of ML/AI platforms and technologies, the ROI for ML/AI initiatives has reached a level making it more actionable to an increasing number of businesses.

 

Challenges:

Despite this tremendous growth, many businesses have faced significant challenges in fulfilling the high expectations of ML/AI and actually realizing business value. In the Forbes report, 65% of executives reported that they are not yet seeing the expected value from their AI investments. In a TransUnion survey of finance, risk and marketing executives, 76% indicated that one of their biggest challenges was the data cleansing and prep work required to derive the expected value. Based on our experience with multiple ML engagements across a range of industries, one of the most significant challenges on ML projects is the poor quality of the data. In fact, 80% of time on ML/AI projects is spent on data understanding and preparation – cleaning poor quality data, determining how to fill data gaps, blending data from different sources, standardizing data definitions across various data sets, and other data prep activities.

 

Figure 1 below shows the steps and timeline of an ideal ML/AI project, versus what typically happens during an actual project that has a limited budget and timeline.

 

Figure 1 – Ideal vs Typical ML/AI Project Timeline

 

This illustrates the lack of maturity in Enterprise Data Management which is quite common in most organizations. Enterprise Data Management (EDM) is the discipline which strives to continually increase the overall data maturity of an organization. This includes capabilities of data governance, master data management (MDM), data quality, metadata management, data engineering, data security and data risk management. EDM maturity is important not only for general reporting needs, but particularly for ML/AI needs as well.

 

In many ML/AI projects, poor input data leads to less insightful ML/AI models, which result in limited business value! Some of the key impacts of this are highlighted below.

  • Even though data scientists have tools to detect data discrepancies, poor data necessitates several iterations to deliver higher performant models, challenging the business case for future investments in ML
  • Early investments in ML could result in models with limited economic value due to the prevalent data issues, or an inability to scale the ML models’ benefits due to data quality and data governance concerns
  • The long-term ROI potential of deriving ML/AI value elevates the importance of leveraging capabilities like Data-Ops and ML-Ops to ease deployment and operationalization of these models

 

Real-world Examples:

Shown below in Figures 2 and 3 are some real-world examples of how poor master data, data quality, and metadata outcomes that do not deliver to the full potential of ML.

 

Table Master data issues limit business value of ML outcomes

Figure 2 – Master data issues limit business value of ML outcomes

 

 

Data quality (DQ) & metadata issues limit the business value of ML outcomes

Figure 3 – Data quality (DQ) & metadata issues limit the business value of ML outcomes

 

As seen above, it is critically important that Enterprise Data Management be an integral part of ML/AI initiatives; mature EDM will help to ensure that the overall quality of the input data fed into the models is sufficiently high.

 

Bottom line – Enterprise Data Management and analytics initiatives must work together in order to deliver the full business value and expectations of ML/AI!

 

Recommendations to Address these Challenges:

Inspired Intellect makes four high level recommendations to help address the challenges described above. These are shown in Figure 4.

 

High level recommendations to address EDM challenges for ML-AI initiatives

Figure 4 – High level recommendations to address EDM challenges for ML/AI initiatives

 

1. Develop and maintain an effective data governance and MDM program, ensuring ownership of all data assets. Specifically,

  • specifically address standardization of definitions for key enterprise master and reference data such as customer, supplier, product, vendor, location, etc. – since they will typically play a significant part in most ML use cases
  • prioritize data assets that are significant for high value ML/ analytics use cases
  • prioritize active governance that helps to proactively enhance data quality at the source systems, before data is fed to ML/AI and other use cases
  • include passive governance that helps to standardize data and address existing data quality issues prior to conducting ML/advanced analytics use cases

2. Implement a continuous data quality improvement initiative which includes the use of intelligent tools. Specifically,

  • prioritize the critical master, reference, and transactional data assets that provide the greatest value for ML
  • develop and automate data quality business rules
  • use intelligent tools like ML/AI anomaly detection to discover data quality issues and provide recommendations and automated fixes
  • analyze ML/AI models to determine which data assets are most influential on the models’ predictions to consequently help rationalize your data quality initiatives
  • In effect, 2(c) and 2(d) above use ML/AI outcomes within the EDM process to improve business value outcomes from the ML use cases – a virtuous circle
  • develop data quality metrics that drive the enterprise towards higher quality

3. Develop an enterprise data and features catalog which includes

  • a business glossary with standard enterprise definitions and metrics, prioritizing those needed for ML/AI initiatives
  • a features catalog to standardize and govern features that are developed during the ML/AI process
  • a metadata catalog with both technical information and data lineage
  • a user-friendly way for both business and technical data consumers to access, share and collaborate on this information

4. Incentivize data owners and ML business users based on governance and data quality metrics, as well as business value of the ML insights derived

  • Enterprise data quality metrics should be part of the overall incentives for data owners as well as ML and other data consumers, which will drive behavior of the enterprise towards higher data quality
  • Business users should be incentivized based on the actual value that ML initiatives are bringing to the business, which will help promote valuable ML/AI initiatives (enabled by high quality data) and weed out low value ML/AI initiatives that could be hampered by poor quality data

Conclusion

As ML/AI initiatives continue to grow in importance to business execution, they must be accompanied by strong Enterprise Data Management in order to increase their delivered value. By following the recommendations above, ML/AI project teams can focus on developing the best models rather than dealing with data quality issues. Each of these recommendations require prioritization, investment, and strategic focus driven by the Chief Data Officer (CDO) or equivalent C-suite executive, along with an integration partner like Inspired Intellect that can drive the organizational, business and technical workstreams. When executed well, EDM initiatives will enable higher business value from ML/AL investments.

 

In our next blog, we will discuss how several EDM technologies and tools work hand-in-hand with ML/AI tools to help automate and streamline both these processes.

 


Inspired Intellect is an end-to-end service provider of data management, analytics and application development. We engage through a portfolio of offerings ranging from strategic advisory and design, to development and deployment, through to sustained operations and managed services.

 

Learn how Inspired Intellect’s EDM and ML/AI strategy and solutions can help bring greater value to your analytics initiatives by contacting us at marketing@inspiredintellect-us.com.

Text editor displaying code

Machine Learning + Human Intelligence vs COVID-19: Part 2

Introduction:

Globalization has brought countries together in more ways than ever before. Consumers, corporations and governments alike, now have generally unfettered access to innovations, markets, products and services. While the benefits associated with globalization are many, it also brings associated risks, as we have seen with the recent SARS-COV2 virus. Infectious disease specialists have been raising the alarm about the need for an effective and uniform response to these threats, due to the speed at which an infectious disease could spread as a result of our global connectedness.

 

COVID-19 (the disease caused by the SARS-COV2 virus) has completely taken over our lives, resulting in a material effect in the lives of countless global citizens. The question at the top of everyone’s mind is; “How do we adjust to this new normal?”

 

Specifically, what can we learn about patterns and prevention as we analyze how an infectious disease like COVID-19 migrates and assess how industries are impacted by its spread? This understanding can help inform public health directives that aim to control the migration of the disease, while at the same time alleviating resulting strains on the economy.

 

Study Objectives:       

 

To develop a better understanding of these patterns, our Data Science teams at Inspired Intellect and WorldLink initiated an R&D project with the hypothesis that advanced analytics could uncover insights to address the above questions. We were also looking for pragmatic applications for deploying our findings to help our clients understand how their businesses would need to adapt to survive in the rapidly evolving new normal.

 

We focused our research efforts into 4 distinct tracks:

  1. Creating a data lake of information as a foundational pillar for our research
  2. Collating and categorizing experimental treatments, therapeutics and vaccine research into a semantic search-driven library of knowledge to support frontline healthcare workers and medical practitioners as they keep up with trending research in these domains (here)
  3. Social listening and associated unstructured text analysis to identify and surface trending topics and concerns people were talking about
  4. Machine learning and insight generation to identify the factors influencing the spread of virus to predict the waxing and waning of virus epicenters over time.

 

This article is Part 2 of a 2-part blog series focused on the 4th track above: Machine Learning and Insight Generation. This blog series is focused on answering the following questions:

  • Why are certain counties/cities more affected than others?
  • Why is there variation in mortality rates among the most infected counties?
  • What are the underlying patterns and factors for virus spread and mortality?

 

In Part 1, we provided recommendations on how to mitigate the spread of infectious diseases, based on our work using county-level data and machine learning techniques. In Part 2, we will explore model data, features and insights.

 

We feel that a data-driven scientific approach can help answer these questions and, more importantly, inform decision making for a range of stakeholders:

  • Policy Makers: Have sufficient measures been taken to ensure that the infection spread can be controlled? If not, how do we mitigate the risks?
  • Business Owners: Is my business a potential contributing vector to the spread of the virus? What measures should we consider implementing relative to operating the business in a manner that is safe for employees and customers?
  • Individuals: What measures can we as individuals take to help stem the spread of the virus?

 

Editor’s Note: This blog post was authored to highlight Inspired Intellect’s perspective on how the latest advanced analytics techniques could examine driving factors behind the COVID-19 pandemic and garner recommendations to inform officials in their policy responses. To do this, I co-authored this blog with my colleague, Prashanth Nayak, who serves as a Senior Data Scientist for our partner organization, WorldLink. There were several others across Inspired Intellect involved in the data sourcing and model development necessary to deliver these insights related to the pandemic and potential actions to mitigate its impact.

 

How We Explored the Data:

 

We began our research efforts with a data exploration exercise guided by a quantitative risk score, designed at the county-level. Our risk score design included county-level reported statistics such as:

  • Rolling 14-day infection rates
  • Mortality rates
  • Population density (where the population density is defined against the habitable square miles).

 

Figure 1 - Data exploration risk score design

 

Other relevant attributes such as county-specific mobility, adjacent-county mobility and social stringency could also be included in the risk score design. However, we decided to keep our initial design simple with a view to helping us better understand the insights we encountered. The attributes were passed through a clustering algorithm to arrive at a categorization of counties that exhibited a similarity in infection rates, mortality rates and population density.

 

The Risk Score for the week of May 9th is shown below. Counties exhibiting the highest risk (i.e. high infection rate, high mortality rate and high population density) collected into Cluster 5. In contrast, counties exhibiting the lowest risk were collected into Cluster 1. To keep the design simple and due to a general lack of insight into COVID19’s pathology, we did not weight infection or mortality rate variables differently when clustering. Therefore, we can label a county as risky if either it has high infection rate or high mortality rate.

 

Figure 2 - Covid-19 risk score by data cluster

 

We examine a few counties within Cluster 5 to understand why they were categorized as highest risk. Data for relevant portions of the prior 14-day period (Apr 25 – May 8) that illustrate our arguments are shown below. We can see that the infections spiked from 2 to 14 cases in Tillman County, Oklahoma between May 2nd and May 3rd, and from 31 to 74 in Jackson county, Florida between May 7th and May 8th.

 

Figure 3 - Table depicting spike in covid-19 Cluster 5 counties

 

Table depicting spike in Cluster 5 counties

 

These counties demonstrated a noticeable and sudden rise in infections, indicating emerging virus hotspots and signaling the need for allocated resources. We use these two counties as prime examples of early signaling for an escalating hotspot, by using variables that are true of all counties, regardless of size. Note that many high-density counties were already subject to a mandatory shelter-in-place order during our 14-day evaluation period and were therefore experiencing decreasing infection rates throughout our study.

 

Next, we examine the migration of risk scores across counties when compared with the risk scores from the week of April 24th. The week of April 24th was chosen as our baseline based on anecdotal evidence that the virus has an incubation period up to 14 days. As seen from the color-coded map on the left, the areas of primary concern (as of April 24th) were concentrated in the southwest and northeast of the country, along with some pockets of higher risk in the south region around Georgia. Over the next 14-day period from April 25th to May 8th, the virus had traversed across the country. What is also interesting to note, is that the counties that were previously highest risk, appear to have gained some measure of control over the virus spread. Although these were largely reactive measures, there is much that can be learned from the success of measures these high-risk counties put into place in response to their situation.

 

Risk Score Chart over time, risk score shows wider spread from April 24 to May 9

 

In summary, we see that the risk score can be a helpful tool to guide public health policy decision-making. The downside is that it reflects what has already happened, and the best public health policy makers can hope to do is intervene to prevent the situation from getting worse. The truest value-add from data analytics lies in surfacing what factors influence the risk score, so that decision makers can be more proactive in their approach to control the spread of the virus (i.e. implementing sweeping safety procedures in and around airports, where people are highly mobile across counties).

 

Feature Engineering:

 

We began our predictive modeling exercise by formulating hypotheses to be investigated. Before the first COVID-19 cases were recorded in the US, medical researchers across the globe were already providing valuable anecdotal evidence published through trusted medical channels. In parsing these via text analytics, these were tremendously useful in guiding our hypotheses design.

 

The following data were of primary interest to our research:

  •  Age
  • Gender
  • Health equity
  • Travel exposure
  • Social mobility
  • Healthcare supply/availability
  • Adherence to public health policy directives

 

Of course, some attributes, such as health equity, presented data acquisition challenges that required creative data engineering. We therefore had to scale back our expectations or rephrase the hypotheses in terms of viable data proxies in a few specific areas. As an example of the latter, we re-phrased hypotheses related to health equity through the lens of county-level demographic data attributes. Similarly, due to inconsistent data reporting on adherence to public health policy directives, we eliminated it from our consideration set.

 

The resultant inventory of hypotheses influenced the data collection and associated data enrichment efforts. These are succinctly illustrated within the accompanying graphic, through a layered feature list.

 

A layered, circular chart showing various factors. The three umbrella factors are Exposure Risk, Demographics & Health, and Healthcare. Healthcare displays there is still work to be done in the future.

 

Feature Importance:

 

The representation of an analytic challenge as a machine learning algorithm, and the richness of the features feeding the algorithm, have a direct relationship with the insights gleaned from the model. Accordingly, we now examine the machine learning algorithmic construct we selected, and the features that dominated our models:

 

  1. Target Variable: The target variable describes the analytic objective to be pursued. The viable alternatives in this context would be to estimate the infections or mortalities (i.e. a regression) or to predict the increase/decrease outcome in the infection rate or mortality rate (i.e. classification). To keep it simple, we settled on the classification approach, and, predicted if the rates would increase/decrease over the next two weeks. Two separate models were developed, one that predicted the increase/decrease outcome in the infection rate, the other that predicted increase/decrease outcome in the mortality rate.
  2. Independent Variables (or Features): The independent variables explain the variance in the target variable and the degree to which they account for the variance. An examination of the independent variables provides the insights derived from the models, which in turn guides decision makers. The table below arranges the independent variables in order of their significance in the models.

 

Figure 4 - Independent variables and model insights and decision guidance

 

Monitoring all these factors can help policy makers formulate and evaluate strategies to contain COVID-19 spread and develop preventative measures for those counties most at risk

 

Model Validation:

 

While there are some exceptional factors responsible for the infection spread, the above features collectively provide a holistic explanation for the spread of COVID-19 across the US.

 

Figure 5 - New York Times Interactive Covid-19 Map

 

Source: https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html

 

As an example of an outlying factor that contributes heavily to infection spread and could create inconsistencies in the model if not monitored, , the correctional institutions in two Ohio counties (Marion and Pickaway) led to these two counties becoming hotspots. The visualizations provided by the NY Times (here) from the week of May 11th provides as visual confirmation of several key features in our models (population density, proximity to major airports, prevalence of seniors, regions dominated by a high-degree of mobility or a disposition towards underlying health conditions, due to various socio-economic or demographic factors)

 

“…while the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty.”

                                                                         ― Arthur Conan Doyle, The Sign of Four

 

In general, week over week, our predictive models corroborated well with the county-level COVID-19 trends. To demonstrate the relative accuracy of models, validated against emergent data, we now pick two counties from different states to analyze the effect of features that were deemed important by our models. Not all counties of these state have been, or is, an emerging epicenter.

 

  1. Harris County in Texas (Left): Harris County had considerably lower infection and mortality at the start of pandemic. There was limited stay at home order because of which we see the virus spread increasing in the county in following weeks.
  2. Suffolk County in New York (Right): Suffolk County was a virus epicenter in March and April. The state had undertaken strict measures requiring face coverings, and we can see the resulting impact through the drastically reduced infection and mortality rate.

 

Figure 6 - Covid-19 infection and mortality rate in New York Suffolk County

 

We can see for both the counties the increase in mobility is followed by the increase in infection and mortality. In the case of Suffolk, the mobility decrease is followed by a decrease in infection whereas in the case of Harris County, every increase in mobility is followed by the increase in virus spread. The lag between the 2 charts can be explained by the 14-day incubation period for the virus.

 

Figure 7 - Approximate timelines for Covid-19 related deaths

 

Conclusion:

 

Understanding the driving factors that affect infection and mortality rates are critical insights that can lead to both preventative and prescriptive actions. We note in this article the resulting effects generated by factors such as:

  • Age
  • Gender
  • Health equity
  • Travel exposure
  • Social mobility
  • Healthcare supply/availability
  • Adherence to public health policy directives

Using Advanced Analytics techniques, our objective is to equip policy makers, businesses, and individual citizens alike with the insights to minimize the spread of infectious disease and create data-driven, actionable guidelines that will help us emerge on the other side of this pandemic.

 

Inspired Intellect is an end-to-end service provider of data management, analytics and application development. We engage through a portfolio of offerings ranging from strategic advisory and design, to development and deployment, through to sustained operations and managed services.

 

Inspired Intellect is part of the Adi Group. The Adi Group is a collection of companies that, collectively, advises on and implements enhanced technological capabilities for enterprises along their digital transformation journey. Members of the Adi Group include:

 

Inspired Intellect’s membership in the Adi Group allows its clients access to a broader portfolio of digital transformation solutions and multiple touch points with innovation.

Text editor displaying code

Machine Learning + Human Intelligence vs COVID-19: Part 1

Introduction:

 

Globalization has brought countries together in more ways than ever before. Consumers, corporations and governments alike, now have generally unfettered access to innovations, markets, products and services. While the benefits associated with globalization are many, it also brings associated risks, as we have seen with the recent SARS-COV2 virus. Infectious disease specialists have been raising the alarm about the need for an effective and uniform response to these threats, due to the speed at which an infectious disease could spread as a result of our global connectedness.

 

COVID-19 (the disease caused by the SARS-COV2 virus) has completely taken over our lives, resulting in a material effect in the lives of countless global citizens. The question at the top of everyone’s mind is “How do we adjust to this new normal?”

 

Specifically, what can we learn about patterns and prevention as we analyze how an infectious disease like COVID-19 migrates and assess how industries are impacted by its spread? This understanding can help inform public health directives that aim to control the migration of the disease, while at the same time alleviating resulting strains on the economy.

Study Objectives:

To develop a better understanding of these patterns, our Data Science teams at Inspired Intellect and WorldLink initiated an R&D project with the hypothesis that advanced analytics could uncover insights to address the above questions. We were also looking for pragmatic applications for deploying our findings to help our clients understand how their businesses would need to adapt to survive in the rapidly evolving new normal.

 

We focused our research efforts into 4 distinct tracks:

  1. Creating a data lake of information as a foundational pillar for our research
  2. Collating and categorizing experimental treatments, therapeutics and vaccine research into a semantic search-driven library of knowledge to support frontline healthcare workers and medical practitioners as they keep up with trending research in these domains (here)
  3. Social listening and associated unstructured text analysis to identify and surface trending topics and concerns people were talking about
  4. Machine learning and insight generation to identify the factors influencing the spread of virus to predict the waxing and waning of virus epicenters over time.

 

This article is Part 1 of a 2-part blog series focused on the 4th track above: Machine learning and insight generation. This blog series is focused on answering the following questions:

  • Why are certain counties/cities more affected than others?
  • Why is there variation in mortality rates among the most infected counties?
  • What are the underlying patterns and factors for virus spread and mortality?

 

In this first installment, we will provide recommendations on how to mitigate the spread of infectious diseases, based on our working using county-level data and machine learning techniques. In Part 2, we will explore model data, features and insights.

 

We feel that a data-driven scientific approach can help answer these questions and, more importantly, inform decision making for a range of stakeholders:

  • Policy Makers: Have sufficient measures been taken to ensure that the infection spread can be controlled? If not, how do we mitigate the risks?
  • Business Owners: Is my business a potential contributing vector to the spread of the virus? What measures should we consider implementing relative to operating the business in a manner that is safe for employees and customers?
  • Individuals: What measures can we as individuals take to help stem the spread of the virus?

 

Editor’s Note: This blog post was authored to highlight Inspired Intellect’s perspective on how the latest advanced analytics techniques could examine driving factors behind the COVID-19 pandemic and garner recommendations to inform officials in their policy responses. To do this, I co-authored this blog with my colleague, Prashanth Nayak, who serves as a Senior Data Scientist for our partner organization, WorldLink. There were several others across Inspired Intellect involved in the data sourcing and model development necessary to deliver these insights related to the pandemic and potential actions to mitigate its impact.

 

Our Findings: Guidelines for a Pandemic Playbook

 

To garner our final recommendations, the Inspired Intellect team ran several machine learning models across a broad intersection of data sets at the local, regional and national. The results were surprising and represent actionable steps that stakeholders can follow when seeking to mitigate the negative impact of a pandemic.

 

Specifically, several learnings from our models serve as primary considerations in the context of developing a pandemic response playbook.

  1. The Need for Data Granularity and Capture Standards: The data we employed was captured at the county-level and released for public consumption through the COVID-19 Tracking Project. In the early days of the project’s data reporting, it was clear that data capture standards were not mandated across states and counties – which restricted what was possible via machine learning. Secondly, as was evident in the public domain, it also handicapped public health policy decision makers. Finally, data richness was a constant challenge during this research study. The fact that externally sourced overlays of census track demographic statistics surfaced to the top of our important features, demonstrates the value of capturing demographic, psychographic, socio-economic as well as pre-existing/underlying health conditions data at the case-level of detail. Together with a robust contact tracing methodology, these data can provide valuable insights that will permit balancing containment measures with keeping the nation’s economy afloat during a pandemic.
  2. A Positive Correlation with Increased Local Decision-Making Autonomy: Our models implicitly demonstrate that a local county-level (or perhaps even city-level) autonomy with public health policies may be more effective at preventing the spread and, as some of our key independent variables have illustrated, a county may also need the cooperation of neighboring counties (or cities), in order to succeed. It is true that a broad-brush approach may be appropriate in the initial weeks to give first responders and public health policy makers an opportunity to organize, determine action plans and deploy resources. As we have seen, however, if that time is not adequately utilized to mechanize a credible pandemic response, it is likely that the county will see infection rates escalate and the increased likelihood of a shelter-in-place/shutdown order from authorities. Naturally, this has an adverse effect on the health of its people and will eventually cause a drag on the economy.
  3. The Importance of State and Federal Support: Surprisingly significant independent variables, such as a county’s proximity to major airports, illustrate that state and federal support may be better directed at containing international and interstate travel to mitigate the spread. Additionally, we saw that state and federal support was highly effective in mitigating the spread of the disease when it was deployed to ensure adequate access to healthcare facilities. This materialized as ICU beds in our models, but it could easily be extrapolated to everything else that is needed to keep hospitals and ICU facilities operational (from personal protective equipment, oxygen and ventilator equipment to funding virus testing, treatments and vaccine research). Lastly, state and federal resources should be directed towards defining data collection standards, providing recommendations and best practices for the analysis of the collected data as local county (or city) administrators may not have the resources to recognize patterns beyond their local geographies.
  4. Addressing the Health-Equity Gap: The state of health equity has emerged as one of the most revealing aspects of COVID-19, defined as the ability for citizens across different social stratifications to receive equal healthcare. While our models indirectly captured its effect through county-level demographic proxies, population density and net migration data, it nevertheless brings to the forefront the health-risk faced by the underprivileged. Not only are the populace in these geographies more prone to exhibiting underlying health conditions because of occupational or lifestyle characteristics, but they often also do not have access to adequate medical care or the financial means to avail of it, should they be infected. Based on the analysis, programs to address this socioeconomic gap with regards to healthcare access would prove as a valuable investment in slowing the spread and fatality rates associated with a pandemic.

 

Taken together, these points effectively capture the reasons behind the current “state of the COVID-19 battle” in the US. It is certainly not one person, one agency or one thing, but a perfect storm of unpreparedness in the context of recognizing the “who”, “what”, “why”, “when”, “where” and “how” to beat COVID-19 effectively.

 

Our research demonstrated how machine learning can be a powerful tool in aiding policy makers as they develop appropriate action plans to counter the threat of a pandemic. While interpreting each independent variable separately, it is easy to lose sight of the bigger picture of what the models are telling us.

 

Behind the Quantitative and Predictive Models That We Used:

 

Given the increasing volume of data related to the COVID-19 virus, we had a plethora of options as to how to construct our model. In discovering these insights, the Inspired Intellect team used the following attributes and models:

Data Attributes

  • COVID-19 daily cases and deaths data for every county within the United States (US), captured and published by the New York Times between January 1, 2020 and May 31, 2020 (Coronavirus (COVID -19) data in the United States, 2020).
    • Socioeconomic and health equity characteristics data such as population sizes, unemployment rate, occupation , household income , household size , ICU beds for every county within US captured (county-level socio-economic data in United States, 2019)
    • Land area in square miles, population, domestic and international migration data, gender proportions, age groups for every county within US captured and published by United States Census Bureau (county-level census data in United States, 2019)
    • Mobility data reporting movement of trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential (Google mobility data)
    • Airport data: Proximity of airports for each county, importance of the airports (openflights.org and the Bureau of Transportation Statistics)

 

Models

  1. Quantitative risk score model: A risk score is assigned to each county based on the county’s rate of change in infection rates, mortality rates, and population density. This model’s purpose is to aid our clients in prepare action plans based on relevant counties.
  2. Infection rate machine learning model: Model predicting if a given county’s infection rate will increase from the previous week. Model also produces the drivers (feature importance) behind infection rate increase and helps analyze hotspot changes over time.
  3. Mortality rate machine learning model: Model predicting if a given county’s mortality rate will increase from the previous week. Model also produces the drivers (feature importance) behind mortality rate increases and helps identify vulnerable demographics.

 

Enabling Technologies That We Used:

 

It is a well-known fact that analytical models are only as good as the data that they employ. The importance of consistent data reporting standards, incremental expansion of the data assets considered, as well as periodic retraining of the models against the latest data cannot be diminished. In order to deliver such capabilities consistently, and at scale, we must acknowledge the role of a strong data management foundation.

 

As we began our research initiative during the early stages of the pandemic reaching US territory, we were faced with significant data challenges. As we noted earlier, COVID-19 data reporting standards were nascent, if they existed at all, and they were evolving. Consequently, acquiring data that was collected in a consistent manner across all counties was difficult, and required us to do a large amount of transforming. Another challenge we encountered was the lack of a historical audit thread of day-over-day statistics related to infections, mortalities and recoveries. Yet another hurdle centered around the fact that data formats were not consistent in the early days. Eventually, open-source initiatives such as the “COVID-19 Tracking Project” emerged, that alleviated some of our initial challenges, but even those required experience with semi/unstructured data management capabilities to extract, store and transform data within JSON files or PDF reports.

 

The table below summarizes our technology choices for what began as an internal R&D effort but is rapidly expanding to an offering at the request of several customers. To the opportunities presented by the latter, several other technologies present themselves as viable alternatives within our offering technology stack.

 

Figure 1 - Technology stack

 

Technology choices will ultimately be guided by a variety of factors – from the most obvious, such as alignment with your enterprise architecture strategy and ease of integration with other enterprise business applications, to the less obvious, such as performance scalability of a solution that is consistently evolving over time or the flexibility to adapt to a variety of data and analytics needs as your business evolves. To ensure success in this initiative, the Inspired Intellect team relied on its vast experience in analyzing the solution landscape to align enterprise/data fit with best-in-breed tools.

 

Conclusion:

 

We discovered during our research that it is nearly impossible to earn an “A” grade while trying to mitigate the effect of a global pandemic, but that it is rather easy to earn a “B”. COVID-19 has upended the lives of numerous individuals, families, businesses, and countries, and our goal is to use the latest advanced analytics techniques to raise the floor for our global citizens and improve our chances of being successful, now and in the future of increased globalization.

 

Inspired Intellect is an end-to-end service provider of data management, analytics and application development. We engage through a portfolio of offerings ranging from strategic advisory and design, to development and deployment, through to sustained operations and managed services.

 

Inspired Intellect is part of the Adi Group. The Adi Group is a collection of companies that, collectively, advises on and implements enhanced technological capabilities for enterprises along their digital transformation journey. Members of the Adi Group include:

  • WorldLink
  • Inspired Intellect
  • ADI Family Office
  • The ADI Foundation
  • Inspired Intellect’s membership in the Adi Group allows its clients access to a broader portfolio of digital transformation solutions and multiple touch points with innovation.

Unveiling COVID-19 Research & Development Insights with Navigational Analytics

Amidst the uncertainty brought by COVID-19, the team at Inspired Intellect, under the direction of our Chief Data Scientist Brian Monteiro, PhD, initiated a data analytics project to help address the pandemic. We analyzed data across three tracks: quantitative forecasting, Twitter sentiment, and Research & Development studies. Our goal is to support front-line healthcare workers, healthcare product distributors, and pharmaceutical companies while showcasing our team’s expertise with turning data into actionable information. After aggregating the data, we discovered:

  • Mobility and Airport locations have a high correlation to virus hotspot areas
  • Personal Protection Equipment is a long-term trending topic of discussion on Twitter with opportunities for businesses to address shortfalls of equipment with donations and distribution expertise (search for “PPE Donation” on Twitter)
  • R&D for virus vaccines is highly concentrated in the United States and China, while Italy and India continue to investigate best practices for containing the virus though other medical procedures continue

While this initiative was focused on healthcare and the COVID-19 pandemic, our methodology can be applied to address issues in several other industries. For more on how these insights were generated within each track, the following sections contain data and charts that led to their discovery.

 

Research & Development Navigation

The World Health Organization (WHO) maintains a database that stores references to research papers focused on the Coronavirus pandemic. It is a CSV file that lists the title, authors, an abstract, and, most importantly, Document Object Identifiers (DOIs) for each paper. This database is an attempt to aggregate global C19 research efforts so that the world may better collaborate on discoveries and treatments and ultimately accelerate the process of finding a cure.

 

When first downloading the CSV file on May 12, it contained over 16K papers. To run meaningful analytics, the most useful data point across these documents is the DOI.

 

DOI example: “10.1001/amajethics.2020.344” – “Cohesion in Distancing”

 

With this unique code, I was able to enrich the data set with information from other data sources on authors, cover dates, and hyperlinks for each paper. For example, the CSV file doesn’t contain a URL address linked to research papers. Using a REST API from DOI.org, I was able to find a hyperlink on the web to read the papers. I used another service to find information about authors, their affiliations, and the cover date related to the associated journal.

 

Once the data was enriched, I set up boundaries and dimensions for exploring the data, yielding this interesting timeline on publication.

 

Figure 1 - Timeline on covid-19 related publications

 

It is interesting to note the extent of papers published in January of 2020, despite the fact that the WHO did not declare C19 a pandemic until March 11, 2020. Around 48% of those papers are from China and the United States.

 

You can see a long tail on this timeline, representing a sustained interest from researchers on the topic, and it will most likely continue in this manner until a vaccine is discovered.

 

For reference, the enriched data looks like the table shown below.

 

Figure 2 - Enriched publication covid-19 data

 

Enriching the data enables several intriguing navigation scenarios, such as developing the timeline shown above or segmenting by geography.

 

Figure 3 - Enriched covid-19 data segmented by geography

 

As one might expect, China, as ground zero for the outbreak, makes up a significant majority of research papers published.

 

Figure 4 - Covid-19 data published in China

 

When looking into China with greater granularity, we see that the epicenter, Wuhan, has 168 research papers published about the virus.

 

Finally, the enriched data set contains a metric denoted “Cited By”. Like “upvoting” in popular forums, this measurement indicates a value associated with a paper. It means that people are reading this paper and actually “citing” it in other research.

 

During my analysis, I discovered several keywords popular in the abstracts. Using keywords, along with “Cited By” metrics, I built the navigational tool shown below. The small pink bubble on the top left indicates that there is one research paper that has been cited 1,062 times and contains the keyword “treatment”.

 

Figure 5 - Navigational tool covid-19 related publication

 

You could use this interface to pick keyword(s), select high performing papers that have been heavily cited, and then be more specific using a title search. As an example, four research papers contain the word “herb” in the title, and one has been “Cited by” 5 times with the keyword “test”.

 

Figure 6 - Patterns across research & development papers on covid-19

 

Beneficiaries of these Insights

 

With these tools, we identified patterns across Research & Development papers that could triangulate useful information to enable patient treatment and cure discovery. These tools are engineered to serve the following parties:

Frontline Workers

These research papers contain information for healthcare workers looking for therapy best practices, along with promising vaccine treatments. A colleague recently told me that his spouse, who is a caregiver, spends hours each night reading papers to understand how best to treat patients. By setting up navigation scenarios, combined with search capabilities, front-line workers can quickly identify emerging trends and navigate a wealth of information, segmented by content type.

Pharmaceutical Companies

This information can bridge silos among global pharmaceutical companies. It encourages collaboration where teams of researchers could combine efforts to quickly produce a vaccine. The data contains information on equipment suggestions for hospitals and pharmacies. The lack of personal protection equipment has been a significant supply chain issue in the United States and some of these research papers list helpful products that pharmacies should stock to better prepare for future outbreaks.

 

As a quick note on methodology, enabling people of different interests to sift through all these research papers to find actionable information is the goal for enriching the data and thinking about potential navigation scenarios. You could simply put this data into a search engine and have a text box as the only interface. I wanted to make it more of a guided navigational and data discovery experience by thinking of boundaries such as timeline, geography, authors, keywords, and “Cited by” metrics.

 

The navigation dashboards were made with Microsoft Power BI and it is exposed on the public web. Anyone can use the tool and explore the data. R&D Navigation

 

Twitter Sentiment Analysis

 

For our second analytics track, we tapped into the wisdom of the crowds by leveraging Social Media. To derive insights, we focused on Twitter by analyzing tweets for trending hashtags, sentiment analysis, and user network effects. We used Twitter’s public API and focused on three distinct topics:

  • COVID-19
  • Distributor
  • Manufacturer.

 

Over the course of 26 days, we analyzed 1.8M tweets and 110,000 unique hashtags. There are approximately 500M tweets on Twitter per day. While this is a small sample, it does reveal important trends.

 

To ascertain the critical mass and momentum of a trending topic, we calculated the slope of individual hashtags and focused on hashtags with greater than 100 observations and a slope greater than 10. These boundaries reduced the number of hashtags we analyzed from 110,000 to 17.

 

Below is a screenshot with three popular hashtags highlighted:

  • #PPE
  • #Hydroxychloroquine
  • #lockdown.

 

Figure 7 - Screenshot with three popular hashtags PPE - Hydroxychloroquine - lockdown

 

The positive slope generated by thousands of observations indicates a long-term trend. Most hashtags on Twitter are short-lived. So while hashtags bubble up and become popular quickly, they can also disappear quickly. Conversely, we used the parameters above to find a valuable signal with these points.

 

While the above topics exhibited tenure among the general public, we were also able to ascertain which topics were temporal in the C19 race for a cure. Using the following charts, compare the above trends to the negative slope generated by the following hashtags, which visibly fell out of favor:

  • #Covidtracking
  • #lockdown4guidelines
  • #chloroquin
  • #ENECOVID.

 

Using #chloroquin as an example, we can see a pattern where a potential C19 silver bullet fell out of favor with the scientific community and lost momentum in the general public. We can also compare this to the rise of hydroxychloroquine, which has a sustained interest in both the scientific and global community as a potential treatment.

 

The team analyzing Twitter data is working on a detailed paper for their findings and plan to publish their report over the next few weeks. Look for more insights into how analyzing Twitter with keywords, hashtags, and user analytics can benefit any business trying to meet demand and realize opportunities through the immediacy of this social network.

 

Quantitative Analysis and Risk Score

 

The final analytic opportunity targeted understanding factors that contributed to the rapid migration of the pandemic. This research is useful to public health policy administrators, healthcare providers and pharmaceuticals suppliers (wholesalers and manufacturers alike), all seeking to mitigate expansion while also distributing equipment to meet dynamic needs. Employing machine learning against a variety of externally sourced data assets ranging from COVID-19 infection and mortality statistics and health equity data attributes, to county level mobility and airport data, our resultant model was highly correlated with the COVID-19 migration.

 

Figure 8 - Covid-19 quantitative analysis and risk score

 

The charts above illustrate the outcomes produced by our models, showcasing how our model closely replicated actual migration behaviors of COVID-19. Regions that were high-risk (as of Apr 24) had declined in severity by May 9, while other regions that were previously lower risk, increased in severity over time.

 

Follow our series of blog posts on our COVID-19 study to learn more about the insights we garnered while developing the machine learning models.

 

Final Considerations

The COVID-19 pandemic has captured the attention of the entire world. The concerted response of global organizations and citizens is equally unprecedented. With the incredible resources dedicated to finding a cure, it is possible to pool these data assets to leverage the work being performed across the world and develop meaningful applications. Using this data, we can predict infection hotspots and key indicators that contribute to its spread, such as “mobility”. We can also analyze R&D studies to find promising therapies and equipment to help patients. Lastly, we can analyze sentiment in real-time on Twitter and triangulate trends and needs to specific geographical locations and determine whether there are any remediations to help, such as donating personal protection equipment locally.

 

We have an opportunity to set a historical precedent on how we use technology to collaborate on solutions and potential remedies. As a global society, we can address and mitigate the risks of infection and death while examining events at both the local and global levels.  At Inspired Intellect, we are leveraging our deep data and analytics expertise to develop a meaningful solution for our global stakeholders. Our team has a wide breadth of understanding of technology tools to develop and deliver data assets that generate insights akin to those generated in this article. There are many ways that we all can contribute to addressing this global pandemic, even if it is simply wearing a mask or “staying at home”. At Inspired Intellect, we wanted to use analytics to support the decisions made by those taking this disease head-on.

 

Watch for more in the coming weeks explaining the Machine Learning for our hotspot forecasting model along with a more in-depth discussion on Twitter insights.

 

Methodology, Technology and Data

This project was focused on healthcare and the COVID-19 pandemic, but our methodology is like other projects completed in various other industries. In summary, we:

  • Searched and found datasets
  • Enriched data with several complementary data services
  • Transformed and loaded enriched data in multiple database technologies
  • Explored the data
  • Developed navigational tools, along with predictive algorithms for future observations
  • Automated the entire process while paying attention to future data changes (also referred to as Change Data Capture)
  • Deployed all code assets to the cloud

 

For this project, we developed and reused Java code assets to make REST calls to open data sources. The REST call responses were in JSON format and that format was perfect for storing data in a NoSQL database. We used Couchbase because of the advanced SQL query language for JSON called N1QL. Couchbase, along with the Spark connector enabled quick keyword analysis for the R&D papers. The Couchbase Full-Text Search capability enabled us to quickly assign keyword flags to specific articles and surface this analysis in Power BI.

 

For other quantitative data sources, such as infection rates, the format was in a CSV file. For this data, we used a Postgres traditional ER database. The machine learning models, along with some data preparation code assets, were built in Juniper notebooks and Python.

 

All our code assets and databases were stored on an Amazon AWS Free Tier cluster. The hardware costs were minimum because we leveraged AWS Free Tier machines and only had to pay for storage. The Power BI reports are stored on Microsoft’s cloud and we have a 60-day free time limit on using that resource. Power BI connects to the AWS databases and refreshes the dashboards frequently.

 

Company Description

Inspired Intellect is an end-to-end service provider of data management, analytics and application development. We engage through a portfolio of offerings ranging from strategic advisory and design, to development and deployment, through to sustained operations and managed services.

Inspired Intellect is part of the ADI Group. The ADI Group is a collection of companies that, collectively, advises on and implements enhanced technological capabilities for enterprises along their digital transformation journey. Members of the ADI Group include:

  • Worldlink
  • Inspired Intellect
  • ADI Family Office
  • The ADI Foundation

Inspired Intellect’s membership in the ADI Group allows its clients access to a broader portfolio of digital transformation solutions and multiple touch points with innovation

Increasing Customer Value with Unstructured Data and MDM

Analytics Arms Race

According to a 2017 report from Dimension Data, Customer analytics is the second-highest rated factor in driving positive customer experience and is projected to be the leading factor in coming years. The findings in this report remain valid today. In this same report, only 48% of respondents say their organizations currently have analytics systems, and only 36% possess big data analytic solutions that are delivering real value. What we have found is that market leaders, regardless of vertical, are almost always in that 36%, and the existence of systems and tools are only a starting point for maximizing the value that this data paradigm offers.

 

Fortunately, most companies already capture much of their customers’ transactional data (i.e. product purchases, website usage, campaign returns, focus group surveys, etc.). Having this structured data shared across the organization as a “single source of truth” is essential to your success.

 

Advanced use of structured data is now table stakes, however, and is not sufficient to compete in today’s market. Leaders of every industry are unlocking valuable insights through the infusion of unstructured data via internal and external sources.

 

The unfiltered feelings, thoughts, emotions, and underlying decision-making processes of your customers are best captured in the moment that they occur – that is, in their tweets, call center notes, online reviews, phone calls, emails, and posts. A mature MDM strategy and initiative serves to link these disparate data sources and types and then consolidate them into a set of single “golden” records, so that organizations can fully understand their customers’ behaviors and motivations, and tailor individualized customer engagement accordingly. With MDM tied to transactional, historical, and unstructured data, imagine the impact you can have with a current or potential customer within moments of them telling you what they are thinking! In our experience, most organizations lack a clear strategy or capability for doing this and end up hemorrhaging insights that are critical to their business objectives.

 

Let us review a typical customer journey and life cycle, shown below.

Figure 1 - Typical Customer Lifecycle Journey, where data is being generated at each activity

Figure 1: Typical Customer Lifecycle / Journey, where data is being generated at each activity

 

Customers can follow many paths in their journey. Throughout their experience, there are large volumes of data being generated. These include product recommendations, tweets, posts, thumbs up/down, raves, complaint emails, voice calls, and others, which instantly capture authentic feedback from customers and offer valuable insights.

 

The goal of modern MDM initiatives and tools is to synthesize all this structured and unstructured customer information into a unique, tailored profile, as a “single source of truth”. Then, you can incrementally grow your knowledge and understanding of your customers in order to provide a better experience. This holistic view will impact your abilities to increase retention or drive new acquisition strategies. MDM, coupled with a well-planned data governance process, enables this by serving to link a company’s internal customer knowledge with what the customer is saying in other domains. This allows companies to extract deeper insights, create more effective analytics, and build richer artificial intelligence/machine learning (AI/ML) models that can understand, predict, and influence this behavior.

 

In order to benefit from the accumulated value, the “trusted” data must be consumable by people and processes as quickly and easily as possible. This is best accomplished using a cloud-based, scalable solution such as SAP Data Warehouse Cloud (DWC), which provides an out-of-the-box enterprise-ready data warehouse (SaaS) that is, elastic, cost effective, easily consumable and has ideal integration with S/4HANA and other non-SAP systems.

 

Recommended Approach

At a high level, we make the following recommendations when you undertake a customer-focused MDM initiative.

 

Use the Customer’s Initiatives as a North Star: Develop and execute an MDM strategy that aligns with customer business initiatives and contains multiple coordinated workstreams. This will ensure that meaningful value is created as workstreams are delivered.

 

Align Tools & Techniques: Deploy a leading MDM and data governance tool, such as SAP Master Data Governance (MDG) on SAP S/4HANA, which provides comprehensive capabilities to master all customer-related data. The tool must:

  • Consolidate and create golden customer records by incorporating unstructured data
  • Allow the creation of data quality and validation rules
  • Offer enrichment of name, address and identification-information from external 3rd party data providers like D&B, Melissa, etc., along with SAP’s S/4HANA Cloud for Data Enrichment tool
  • Feature a role-based governance workflow engine
  • Provide mass-processing, and match/merge functionality
  • Harmonize and replicate data back to corporate applications
  • Contain standard Search/Create/Read/Update/Delete/Block processes
  • Lead with incorporation and mastering of unstructured customer data using the MDM tool along with modern (preferably cloud-based) data architectures to speed deployment
  • Focus each workstream on a measurable business use case. Examples include increasing customer loyalty based on repeat purchases, or increasing retention based on churn metrics

 

Each of these recommendations obviously contain much more detail, and we will continue to elaborate on these in future articles. The concept is simplified below in Figure 2.

Figure 2 - High level concept of Unstructured Data and MDM to generate business value

Figure 2: High level concept of Unstructured Data and MDM to generate business value

 

We have been fortunate to partner with many clients who have transformed their customer engagement by prioritizing their Master Data Management strategy and combining it with a complete end-to-end SaaS solution like SAP DWC, to deliver impressive business impact. Three of these case studies are summarized below.

  • A leading mid-sized specialty hydrocarbon products company needed to improve operational excellence. This required consistent and clean data with good governance early in their S/4HANA migration journey towards digital transformation. With the implementation of S4 HANA MDG tool, we were able to help them. This resulted in increased sales effectiveness with consolidated customer view and helped improve the supply chain efficiency and procurement decisions.
  • A large event management client launched a customer-focused initiative, needing timely access to customer data to drive revenue growth opportunities. The goal of this initiative was to provide insights to the sales team across customer preferences, demographics, usage, and behavior. Using the MDM tool, we were able to dramatically reduce the number of customer duplicate versions from over 1 million records to 400 thousand unique customers and decrease the possibility of creating new duplicates to virtually zero. The client also used MDM tool to:
    • Identify high value customers
    • Identify customer behavior and purchase patterns
    • Incorporate customer feedback into their marketing and sales processes.

 

This resulted in an increase in profitable customers, greater revenues from the sale of related products, and increased customer loyalty.

 

* Note that with the recent availability of DWC, this could be re-implemented as a Customer Data Mart.

 

Company Description

Inspired Intellect is an end-to-end service provider of data management, analytics and application development. We engage through a portfolio of offerings ranging from strategic advisory and design, to development and deployment, through to sustained operations and managed services.

Inspired Intellect is part of the ADI Group. The ADI Group is a collection of companies that, collectively, advises on and implements enhanced technological capabilities for enterprises along their digital transformation journey. Members of the ADI Group include:

  • Worldlink
  • Inspired Intellect
  • ADI Family Office
  • The ADI Foundation

Inspired Intellect’s membership in the ADI Group allows its clients access to a broader portfolio of digital transformation solutions and multiple touch points with innovation

Fostering Connection in a Socially Distanced World

“Building purpose in a group is not about generating a brilliant moment of breakthrough, but rather about building systems that can churn lots of ideas in order to help unearth the right choices.” – Daniel Coyle (2017). The Culture Code: The Secrets to Highly Successful Groups

Highly successful organizations do one thing exceptionally well: they find novel and powerful ways to connect with their employees. This is true in the most normal of times, but we are not in normal times. The COVID-19 epidemic has created tremendous disruption, forcing many organizations like Clio to scramble to establish continuity in a virtual, “socially distanced” world. The greatest risk this presents is the loss of that powerful connection that occurs in an office environment.


During this time of extreme disruption, organizational alignment is imperative. Employees’ work lives are disrupted as they battle personal challenges such as finding adequate childcare, home schooling children, and coping with an inability to socialize normally. Stress levels are high, and organizations need to unite and develop strategies on how to solve difficult business challenges efficiently, while fighting the alignment constraints presented by “social distancing.” Combined, these challenges put a strain on employee connection and threaten productivity, direction, and buy-in for key initiatives.


Let’s take a step away from the current crisis for a second. I work with organizations on creating alignment for a very specific purpose — ensuring the success of digital transformation initiatives. I’ve drawn upon my organizational psychology expertise to help these firms recognize that seeking to drive high return on investment and adoption is a fruitless endeavor without strong organizational alignment to the desired transformation. (Hint: digital transformation is less about technology and more about people transforming the way they work). Research (by McKinsey & Company) indicates that alignment is not nice to have; rather it has a direct impact on an organization’s bottom-line. Organizations that are highly aligned have earning margins that are twice as likely to be above median.


To meet the needs of these clients, we created a formal Organizational Transformation Alignment (OTA) service, targeting data and digital transformation initiatives. Over the last 6 months, we have observed a marked acceleration in technological adoption and implementation among clients that have adopted this alignment framework within their organization. But while this offering is specific to data & digital transformation, the foundation of it is absolutely adaptable to other use cases. The methodology we established enables organizational alignment in a completely virtual way. Coupling change management thought leadership with powerful technology to anonymously crowdsource employee data from across the organization, businesses are able to navigate digital transformation more efficiently.


Now, stepping back into the current crisis…as we at Inspired Intellect have engaged broadly with our clients over the past week, one of the common themes we’ve consistently heard relates to the challenges in keeping employees engaged and connected. In modeling how to overcome these challenges, our experience shows us that the most effective mechanisms use the best of both people and technology to drive meaningful results. In that context, we realize that the OTA service provides a basis for delivering solutions to overcome the chaos. The foundation was already in place, and we have simply adapted it to fit this current need.

Let’s take a closer look at how the process works. 


Leveraging virtual and anonymous crowdsourcing technology (we have chosen thoughtexchange), we work with organizations to identify their business challenge, create a custom alignment roadmap, and then crowdsource at ALL levels of the organization to develop a compelling transformational vision. Crowdsourcing at strategic times of the process ensures that we are creating the highest level of engagement and buy-in across the enterprise. We have turned the typical top-down planning process upside down, engaging employees early so leaders can:

  • Leverage diversity and inclusion to create superior solutions to business challenges (research on diversity in the workplace demonstrates that this really works!)
  • Collect powerful business intelligence on what the organization values most, things they may not have considered, and hear from the silent majority.
  • Engage employees in the planning process to promote engagement and create buy-in early in the transformation lifecycle.
  • Develop feedback loops that encourage inputs from employees dealing with internal and external parties at every level, so that the enterprise is aligned with a true stakeholder model.

This approach also removes hierarchies and power structures from the ideation process, enabling people to communicate their ideas without worrying about group dynamics. The silent majority can arm your organization with powerful insights and engaging them early and equitably sets the stage in building effective alignment.

 

Figure 1 below outlines our process for developing OTA.

In both prosperous and challenging times, market leaders are defined by their ability to collectively evolve to pursue opportunities and mitigate business disruption. Particularly in this era of unpredictability, the winners will be those firms whose leaders recognize that success is the result of every employee knowing the company’s vision and their role in helping their company achieve it. Firms should already be applying the concepts I’ve laid out here; but the COVID-19 social distancing *demands* they do.

 

If you see the value in this process and your organization is wrestling today with the disruption introduced by COVID-19, please reach out to me directly, I am passionate about this topic and would love to discuss it interactively with you!

Company Description

Inspired Intellect is an end-to-end service provider of data management, analytics and application development. We engage through a portfolio of offerings ranging from strategic advisory and design, to development and deployment, through to sustained operations and managed services.

Inspired Intellect is part of the ADI Group. The ADI Group is a collection of companies that, collectively, advises on and implements enhanced technological capabilities for enterprises along their digital transformation journey. Members of the ADI Group include:

  • Worldlink
  • Inspired Intellect
  • ADI Family Office
  • The ADI Foundation

Inspired Intellect’s membership in the ADI Group allows its clients access to a broader portfolio of digital transformation solutions and multiple touch points with innovation