The National Defense Agency required improved mission-capable readiness of its air fleet across partner defense services, global Air Force bases, and Naval air carriers while reducing overall program sustainment costs
Objective
Provide an advanced analytics solution to manage and govern the various OEMs and support providers’ performance against mission-capable readiness criteria
Approach
Advisory, implementation, and AI/ML innovation services to provide visibility and traceability for:
Mission-capable availability
Material reliability
Operational support costs across the warfighter enterprise
Outcomes
Delivered a command dashboard providing oversight vs. goals
Data integration to connect, manage, and govern relevant streaming IoT assets that power enterprise visibility/traceability
Prognostics and predictive analytics capabilities across demand (spare parts, training, maintenance, and repair) and supply (logistics, distribution, and inventory/replenishment)
Simulation and modeling of prescriptive recommendations and what-if scenario analysis against modified targets: availability, reliability and costs
Impact
Estimated 30% reduction in operations and support costs while achieving the same availability and reliability
Globalization has brought countries together in more ways than ever before. Consumers, corporations and governments alike, now have generally unfettered access to innovations, markets, products and services. While the benefits associated with globalization are many, it also brings associated risks, as we have seen with the recent SARS-COV2 virus. Infectious disease specialists have been raising the alarm about the need for an effective and uniform response to these threats, due to the speed at which an infectious disease could spread as a result of our global connectedness.
COVID-19 (the disease caused by the SARS-COV2 virus) has completely taken over our lives, resulting in a material effect in the lives of countless global citizens. The question at the top of everyone’s mind is; “How do we adjust to this new normal?”
Specifically, what can we learn about patterns and prevention as we analyze how an infectious disease like COVID-19 migrates and assess how industries are impacted by its spread? This understanding can help inform public health directives that aim to control the migration of the disease, while at the same time alleviating resulting strains on the economy.
Study Objectives:
To develop a better understanding of these patterns, our Data Science teams at Inspired Intellect and WorldLink initiated an R&D project with the hypothesis that advanced analytics could uncover insights to address the above questions. We were also looking for pragmatic applications for deploying our findings to help our clients understand how their businesses would need to adapt to survive in the rapidly evolving new normal.
We focused our research efforts into 4 distinct tracks:
Creating a data lake of information as a foundational pillar for our research
Collating and categorizing experimental treatments, therapeutics and vaccine research into a semantic search-driven library of knowledge to support frontline healthcare workers and medical practitioners as they keep up with trending research in these domains (here)
Social listening and associated unstructured text analysis to identify and surface trending topics and concerns people were talking about
Machine learning and insight generation to identify the factors influencing the spread of virus to predict the waxing and waning of virus epicenters over time.
This article is Part 2 of a 2-part blog series focused on the 4th track above: Machine Learning and Insight Generation. This blog series is focused on answering the following questions:
Why are certain counties/cities more affected than others?
Why is there variation in mortality rates among the most infected counties?
What are the underlying patterns and factors for virus spread and mortality?
In Part 1, we provided recommendations on how to mitigate the spread of infectious diseases, based on our work using county-level data and machine learning techniques. In Part 2, we will explore model data, features and insights.
We feel that a data-driven scientific approach can help answer these questions and, more importantly, inform decision making for a range of stakeholders:
Policy Makers: Have sufficient measures been taken to ensure that the infection spread can be controlled? If not, how do we mitigate the risks?
Business Owners: Is my business a potential contributing vector to the spread of the virus? What measures should we consider implementing relative to operating the business in a manner that is safe for employees and customers?
Individuals: What measures can we as individuals take to help stem the spread of the virus?
Editor’s Note: This blog post was authored to highlight Inspired Intellect’s perspective on how the latest advanced analytics techniques could examine driving factors behind the COVID-19 pandemic and garner recommendations to inform officials in their policy responses. To do this, I co-authored this blog with my colleague, Prashanth Nayak, who serves as a Senior Data Scientist for our partner organization,WorldLink. There were several others across Inspired Intellect involved in the data sourcing and model development necessary to deliver these insights related to the pandemic and potential actions to mitigate its impact.
How We Explored the Data:
We began our research efforts with a data exploration exercise guided by a quantitative risk score, designed at the county-level. Our risk score design included county-level reported statistics such as:
Rolling 14-day infection rates
Mortality rates
Population density (where the population density is defined against the habitable square miles).
Other relevant attributes such as county-specific mobility, adjacent-county mobility and social stringency could also be included in the risk score design. However, we decided to keep our initial design simple with a view to helping us better understand the insights we encountered. The attributes were passed through a clustering algorithm to arrive at a categorization of counties that exhibited a similarity in infection rates, mortality rates and population density.
The Risk Score for the week of May 9th is shown below. Counties exhibiting the highest risk (i.e. high infection rate, high mortality rate and high population density) collected into Cluster 5. In contrast, counties exhibiting the lowest risk were collected into Cluster 1. To keep the design simple and due to a general lack of insight into COVID19’s pathology, we did not weight infection or mortality rate variables differently when clustering. Therefore, we can label a county as risky if either it has high infection rate or high mortality rate.
We examine a few counties within Cluster 5 to understand why they were categorized as highest risk. Data for relevant portions of the prior 14-day period (Apr 25 – May 8) that illustrate our arguments are shown below. We can see that the infections spiked from 2 to 14 cases in Tillman County, Oklahoma between May 2nd and May 3rd, and from 31 to 74 in Jackson county, Florida between May 7th and May 8th.
Table depicting spike in Cluster 5 counties
These counties demonstrated a noticeable and sudden rise in infections, indicating emerging virus hotspots and signaling the need for allocated resources. We use these two counties as prime examples of early signaling for an escalating hotspot, by using variables that are true of all counties, regardless of size. Note that many high-density counties were already subject to a mandatory shelter-in-place order during our 14-day evaluation period and were therefore experiencing decreasing infection rates throughout our study.
Next, we examine the migration of risk scores across counties when compared with the risk scores from the week of April 24th. The week of April 24th was chosen as our baseline based on anecdotal evidence that the virus has an incubation period up to 14 days. As seen from the color-coded map on the left, the areas of primary concern (as of April 24th) were concentrated in the southwest and northeast of the country, along with some pockets of higher risk in the south region around Georgia. Over the next 14-day period from April 25th to May 8th, the virus had traversed across the country. What is also interesting to note, is that the counties that were previously highest risk, appear to have gained some measure of control over the virus spread. Although these were largely reactive measures, there is much that can be learned from the success of measures these high-risk counties put into place in response to their situation.
In summary, we see that the risk score can be a helpful tool to guide public health policy decision-making. The downside is that it reflects what has already happened, and the best public health policy makers can hope to do is intervene to prevent the situation from getting worse. The truest value-add from data analytics lies in surfacing what factors influence the risk score, so that decision makers can be more proactive in their approach to control the spread of the virus (i.e. implementing sweeping safety procedures in and around airports, where people are highly mobile across counties).
Feature Engineering:
We began our predictive modeling exercise by formulating hypotheses to be investigated. Before the first COVID-19 cases were recorded in the US, medical researchers across the globe were already providing valuable anecdotal evidence published through trusted medical channels. In parsing these via text analytics, these were tremendously useful in guiding our hypotheses design.
The following data were of primary interest to our research:
Age
Gender
Health equity
Travel exposure
Social mobility
Healthcare supply/availability
Adherence to public health policy directives
Of course, some attributes, such as health equity, presented data acquisition challenges that required creative data engineering. We therefore had to scale back our expectations or rephrase the hypotheses in terms of viable data proxies in a few specific areas. As an example of the latter, we re-phrased hypotheses related to health equity through the lens of county-level demographic data attributes. Similarly, due to inconsistent data reporting on adherence to public health policy directives, we eliminated it from our consideration set.
The resultant inventory of hypotheses influenced the data collection and associated data enrichment efforts. These are succinctly illustrated within the accompanying graphic, through a layered feature list.
Feature Importance:
The representation of an analytic challenge as a machine learning algorithm, and the richness of the features feeding the algorithm, have a direct relationship with the insights gleaned from the model. Accordingly, we now examine the machine learning algorithmic construct we selected, and the features that dominated our models:
Target Variable: The target variable describes the analytic objective to be pursued. The viable alternatives in this context would be to estimate the infections or mortalities (i.e. a regression) or to predict the increase/decrease outcome in the infection rate or mortality rate (i.e. classification). To keep it simple, we settled on the classification approach, and, predicted if the rates would increase/decrease over the next two weeks. Two separate models were developed, one that predicted the increase/decrease outcome in the infection rate, the other that predicted increase/decrease outcome in the mortality rate.
Independent Variables (or Features): The independent variables explain the variance in the target variable and the degree to which they account for the variance. An examination of the independent variables provides the insights derived from the models, which in turn guides decision makers. The table below arranges the independent variables in order of their significance in the models.
Monitoring all these factors can help policy makers formulate and evaluate strategies to contain COVID-19 spread and develop preventative measures for those counties most at risk
Model Validation:
While there are some exceptional factors responsible for the infection spread, the above features collectively provide a holistic explanation for the spread of COVID-19 across the US.
As an example of an outlying factor that contributes heavily to infection spread and could create inconsistencies in the model if not monitored, , the correctional institutions in two Ohio counties (Marion and Pickaway) led to these two counties becoming hotspots. The visualizations provided by the NY Times (here) from the week of May 11th provides as visual confirmation of several key features in our models (population density, proximity to major airports, prevalence of seniors, regions dominated by a high-degree of mobility or a disposition towards underlying health conditions, due to various socio-economic or demographic factors)
“…while the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty.”
― Arthur Conan Doyle, The Sign of Four
In general, week over week, our predictive models corroborated well with the county-level COVID-19 trends. To demonstrate the relative accuracy of models, validated against emergent data, we now pick two counties from different states to analyze the effect of features that were deemed important by our models. Not all counties of these state have been, or is, an emerging epicenter.
Harris County in Texas (Left): Harris County had considerably lower infection and mortality at the start of pandemic. There was limited stay at home order because of which we see the virus spread increasing in the county in following weeks.
Suffolk County in New York (Right): Suffolk County was a virus epicenter in March and April. The state had undertaken strict measures requiring face coverings, and we can see the resulting impact through the drastically reduced infection and mortality rate.
We can see for both the counties the increase in mobility is followed by the increase in infection and mortality. In the case of Suffolk, the mobility decrease is followed by a decrease in infection whereas in the case of Harris County, every increase in mobility is followed by the increase in virus spread. The lag between the 2 charts can be explained by the 14-day incubation period for the virus.
Conclusion:
Understanding the driving factors that affect infection and mortality rates are critical insights that can lead to both preventative and prescriptive actions. We note in this article the resulting effects generated by factors such as:
Age
Gender
Health equity
Travel exposure
Social mobility
Healthcare supply/availability
Adherence to public health policy directives
Using Advanced Analytics techniques, our objective is to equip policy makers, businesses, and individual citizens alike with the insights to minimize the spread of infectious disease and create data-driven, actionable guidelines that will help us emerge on the other side of this pandemic.
Inspired Intellect is an end-to-end service provider of data management, analytics and application development. We engage through a portfolio of offerings ranging from strategic advisory and design, to development and deployment, through to sustained operations and managed services.
Inspired Intellect is part of the Adi Group. The Adi Group is a collection of companies that, collectively, advises on and implements enhanced technological capabilities for enterprises along their digital transformation journey. Members of the Adi Group include:
Inspired Intellect’s membership in the Adi Group allows its clients access to a broader portfolio of digital transformation solutions and multiple touch points with innovation.
Globalization has brought countries together in more ways than ever before. Consumers, corporations and governments alike, now have generally unfettered access to innovations, markets, products and services. While the benefits associated with globalization are many, it also brings associated risks, as we have seen with the recent SARS-COV2 virus. Infectious disease specialists have been raising the alarm about the need for an effective and uniform response to these threats, due to the speed at which an infectious disease could spread as a result of our global connectedness.
COVID-19 (the disease caused by the SARS-COV2 virus) has completely taken over our lives, resulting in a material effect in the lives of countless global citizens. The question at the top of everyone’s mind is “How do we adjust to this new normal?”
Specifically, what can we learn about patterns and prevention as we analyze how an infectious disease like COVID-19 migrates and assess how industries are impacted by its spread? This understanding can help inform public health directives that aim to control the migration of the disease, while at the same time alleviating resulting strains on the economy.
Study Objectives:
To develop a better understanding of these patterns, our Data Science teams at Inspired Intellect and WorldLink initiated an R&D project with the hypothesis that advanced analytics could uncover insights to address the above questions. We were also looking for pragmatic applications for deploying our findings to help our clients understand how their businesses would need to adapt to survive in the rapidly evolving new normal.
We focused our research efforts into 4 distinct tracks:
Creating a data lake of information as a foundational pillar for our research
Collating and categorizing experimental treatments, therapeutics and vaccine research into a semantic search-driven library of knowledge to support frontline healthcare workers and medical practitioners as they keep up with trending research in these domains (here)
Social listening and associated unstructured text analysis to identify and surface trending topics and concerns people were talking about
Machine learning and insight generation to identify the factors influencing the spread of virus to predict the waxing and waning of virus epicenters over time.
This article is Part 1 of a 2-part blog series focused on the 4th track above: Machine learning and insight generation. This blog series is focused on answering the following questions:
Why are certain counties/cities more affected than others?
Why is there variation in mortality rates among the most infected counties?
What are the underlying patterns and factors for virus spread and mortality?
In this first installment, we will provide recommendations on how to mitigate the spread of infectious diseases, based on our working using county-level data and machine learning techniques. In Part 2, we will explore model data, features and insights.
We feel that a data-driven scientific approach can help answer these questions and, more importantly, inform decision making for a range of stakeholders:
Policy Makers: Have sufficient measures been taken to ensure that the infection spread can be controlled? If not, how do we mitigate the risks?
Business Owners: Is my business a potential contributing vector to the spread of the virus? What measures should we consider implementing relative to operating the business in a manner that is safe for employees and customers?
Individuals: What measures can we as individuals take to help stem the spread of the virus?
Editor’s Note: This blog post was authored to highlight Inspired Intellect’s perspective on how the latest advanced analytics techniques could examine driving factors behind the COVID-19 pandemic and garner recommendations to inform officials in their policy responses. To do this, I co-authored this blog with my colleague, Prashanth Nayak, who serves as a Senior Data Scientist for our partner organization,WorldLink. There were several others across Inspired Intellect involved in the data sourcing and model development necessary to deliver these insights related to the pandemic and potential actions to mitigate its impact.
Our Findings: Guidelines for a Pandemic Playbook
To garner our final recommendations, the Inspired Intellect team ran several machine learning models across a broad intersection of data sets at the local, regional and national. The results were surprising and represent actionable steps that stakeholders can follow when seeking to mitigate the negative impact of a pandemic.
Specifically, several learnings from our models serve as primary considerations in the context of developing a pandemic response playbook.
The Need for Data Granularity and Capture Standards: The data we employed was captured at the county-level and released for public consumption through the COVID-19 Tracking Project. In the early days of the project’s data reporting, it was clear that data capture standards were not mandated across states and counties – which restricted what was possible via machine learning. Secondly, as was evident in the public domain, it also handicapped public health policy decision makers. Finally, data richness was a constant challenge during this research study. The fact that externally sourced overlays of census track demographic statistics surfaced to the top of our important features, demonstrates the value of capturing demographic, psychographic, socio-economic as well as pre-existing/underlying health conditions data at the case-level of detail. Together with a robust contact tracing methodology, these data can provide valuable insights that will permit balancing containment measures with keeping the nation’s economy afloat during a pandemic.
A Positive Correlation with Increased Local Decision-Making Autonomy: Our models implicitly demonstrate that a local county-level (or perhaps even city-level) autonomy with public health policies may be more effective at preventing the spread and, as some of our key independent variables have illustrated, a county may also need the cooperation of neighboring counties (or cities), in order to succeed. It is true that a broad-brush approach may be appropriate in the initial weeks to give first responders and public health policy makers an opportunity to organize, determine action plans and deploy resources. As we have seen, however, if that time is not adequately utilized to mechanize a credible pandemic response, it is likely that the county will see infection rates escalate and the increased likelihood of a shelter-in-place/shutdown order from authorities. Naturally, this has an adverse effect on the health of its people and will eventually cause a drag on the economy.
The Importance of State and Federal Support: Surprisingly significant independent variables, such as a county’s proximity to major airports, illustrate that state and federal support may be better directed at containing international and interstate travel to mitigate the spread. Additionally, we saw that state and federal support was highly effective in mitigating the spread of the disease when it was deployed to ensure adequate access to healthcare facilities. This materialized as ICU beds in our models, but it could easily be extrapolated to everything else that is needed to keep hospitals and ICU facilities operational (from personal protective equipment, oxygen and ventilator equipment to funding virus testing, treatments and vaccine research). Lastly, state and federal resources should be directed towards defining data collection standards, providing recommendations and best practices for the analysis of the collected data as local county (or city) administrators may not have the resources to recognize patterns beyond their local geographies.
Addressing the Health-Equity Gap: The state of health equity has emerged as one of the most revealing aspects of COVID-19, defined as the ability for citizens across different social stratifications to receive equal healthcare. While our models indirectly captured its effect through county-level demographic proxies, population density and net migration data, it nevertheless brings to the forefront the health-risk faced by the underprivileged. Not only are the populace in these geographies more prone to exhibiting underlying health conditions because of occupational or lifestyle characteristics, but they often also do not have access to adequate medical care or the financial means to avail of it, should they be infected. Based on the analysis, programs to address this socioeconomic gap with regards to healthcare access would prove as a valuable investment in slowing the spread and fatality rates associated with a pandemic.
Taken together, these points effectively capture the reasons behind the current “state of the COVID-19 battle” in the US. It is certainly not one person, one agency or one thing, but a perfect storm of unpreparedness in the context of recognizing the “who”, “what”, “why”, “when”, “where” and “how” to beat COVID-19 effectively.
Our research demonstrated how machine learning can be a powerful tool in aiding policy makers as they develop appropriate action plans to counter the threat of a pandemic. While interpreting each independent variable separately, it is easy to lose sight of the bigger picture of what the models are telling us.
Behind the Quantitative and Predictive Models That We Used:
Given the increasing volume of data related to the COVID-19 virus, we had a plethora of options as to how to construct our model. In discovering these insights, the Inspired Intellect team used the following attributes and models:
Data Attributes
COVID-19 daily cases and deaths data for every county within the United States (US), captured and published by the New York Times between January 1, 2020 and May 31, 2020 (Coronavirus (COVID -19) data in the United States, 2020).
Socioeconomic and health equity characteristics data such as population sizes, unemployment rate, occupation , household income , household size , ICU beds for every county within US captured (county-level socio-economic data in United States, 2019)
Land area in square miles, population, domestic and international migration data, gender proportions, age groups for every county within US captured and published by United States Census Bureau (county-level census data in United States, 2019)
Mobility data reporting movement of trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential (Google mobility data)
Airport data: Proximity of airports for each county, importance of the airports (openflights.org and the Bureau of Transportation Statistics)
Models
Quantitative risk score model: A risk score is assigned to each county based on the county’s rate of change in infection rates, mortality rates, and population density. This model’s purpose is to aid our clients in prepare action plans based on relevant counties.
Infection rate machine learning model: Model predicting if a given county’s infection rate will increase from the previous week. Model also produces the drivers (feature importance) behind infection rate increase and helps analyze hotspot changes over time.
Mortality rate machine learning model: Model predicting if a given county’s mortality rate will increase from the previous week. Model also produces the drivers (feature importance) behind mortality rate increases and helps identify vulnerable demographics.
Enabling Technologies That We Used:
It is a well-known fact that analytical models are only as good as the data that they employ. The importance of consistent data reporting standards, incremental expansion of the data assets considered, as well as periodic retraining of the models against the latest data cannot be diminished. In order to deliver such capabilities consistently, and at scale, we must acknowledge the role of a strong data management foundation.
As we began our research initiative during the early stages of the pandemic reaching US territory, we were faced with significant data challenges. As we noted earlier, COVID-19 data reporting standards were nascent, if they existed at all, and they were evolving. Consequently, acquiring data that was collected in a consistent manner across all counties was difficult, and required us to do a large amount of transforming. Another challenge we encountered was the lack of a historical audit thread of day-over-day statistics related to infections, mortalities and recoveries. Yet another hurdle centered around the fact that data formats were not consistent in the early days. Eventually, open-source initiatives such as the “COVID-19 Tracking Project” emerged, that alleviated some of our initial challenges, but even those required experience with semi/unstructured data management capabilities to extract, store and transform data within JSON files or PDF reports.
The table below summarizes our technology choices for what began as an internal R&D effort but is rapidly expanding to an offering at the request of several customers. To the opportunities presented by the latter, several other technologies present themselves as viable alternatives within our offering technology stack.
Technology choices will ultimately be guided by a variety of factors – from the most obvious, such as alignment with your enterprise architecture strategy and ease of integration with other enterprise business applications, to the less obvious, such as performance scalability of a solution that is consistently evolving over time or the flexibility to adapt to a variety of data and analytics needs as your business evolves. To ensure success in this initiative, the Inspired Intellect team relied on its vast experience in analyzing the solution landscape to align enterprise/data fit with best-in-breed tools.
Conclusion:
We discovered during our research that it is nearly impossible to earn an “A” grade while trying to mitigate the effect of a global pandemic, but that it is rather easy to earn a “B”. COVID-19 has upended the lives of numerous individuals, families, businesses, and countries, and our goal is to use the latest advanced analytics techniques to raise the floor for our global citizens and improve our chances of being successful, now and in the future of increased globalization.
Inspired Intellect is an end-to-end service provider of data management, analytics and application development. We engage through a portfolio of offerings ranging from strategic advisory and design, to development and deployment, through to sustained operations and managed services.
Inspired Intellect is part of the Adi Group. The Adi Group is a collection of companies that, collectively, advises on and implements enhanced technological capabilities for enterprises along their digital transformation journey. Members of the Adi Group include:
Inspired Intellect’s membership in the Adi Group allows its clients access to a broader portfolio of digital transformation solutions and multiple touch points with innovation.
Amidst the uncertainty brought by COVID-19, the team at Inspired Intellect, under the direction of our Chief Data Scientist Brian Monteiro, PhD, initiated a data analytics project to help address the pandemic. We analyzed data across three tracks: quantitative forecasting, Twitter sentiment, and Research & Development studies. Our goal is to support front-line healthcare workers, healthcare product distributors, and pharmaceutical companies while showcasing our team’s expertise with turning data into actionable information. After aggregating the data, we discovered:
Mobility and Airport locations have a high correlation to virus hotspot areas
Personal Protection Equipment is a long-term trending topic of discussion on Twitter with opportunities for businesses to address shortfalls of equipment with donations and distribution expertise (search for “PPE Donation” on Twitter)
R&D for virus vaccines is highly concentrated in the United States and China, while Italy and India continue to investigate best practices for containing the virus though other medical procedures continue
While this initiative was focused on healthcare and the COVID-19 pandemic, our methodology can be applied to address issues in several other industries. For more on how these insights were generated within each track, the following sections contain data and charts that led to their discovery.
Research & Development Navigation
The World Health Organization (WHO) maintains a database that stores references to research papers focused on the Coronavirus pandemic. It is a CSV file that lists the title, authors, an abstract, and, most importantly, Document Object Identifiers (DOIs) for each paper. This database is an attempt to aggregate global C19 research efforts so that the world may better collaborate on discoveries and treatments and ultimately accelerate the process of finding a cure.
When first downloading the CSV file on May 12, it contained over 16K papers. To run meaningful analytics, the most useful data point across these documents is the DOI.
With this unique code, I was able to enrich the data set with information from other data sources on authors, cover dates, and hyperlinks for each paper. For example, the CSV file doesn’t contain a URL address linked to research papers. Using a REST API from DOI.org, I was able to find a hyperlink on the web to read the papers. I used another service to find information about authors, their affiliations, and the cover date related to the associated journal.
Once the data was enriched, I set up boundaries and dimensions for exploring the data, yielding this interesting timeline on publication.
It is interesting to note the extent of papers published in January of 2020, despite the fact that the WHO did not declare C19 a pandemic until March 11, 2020. Around 48% of those papers are from China and the United States.
You can see a long tail on this timeline, representing a sustained interest from researchers on the topic, and it will most likely continue in this manner until a vaccine is discovered.
For reference, the enriched data looks like the table shown below.
Enriching the data enables several intriguing navigation scenarios, such as developing the timeline shown above or segmenting by geography.
As one might expect, China, as ground zero for the outbreak, makes up a significant majority of research papers published.
When looking into China with greater granularity, we see that the epicenter, Wuhan, has 168 research papers published about the virus.
Finally, the enriched data set contains a metric denoted “Cited By”. Like “upvoting” in popular forums, this measurement indicates a value associated with a paper. It means that people are reading this paper and actually “citing” it in other research.
During my analysis, I discovered several keywords popular in the abstracts. Using keywords, along with “Cited By” metrics, I built the navigational tool shown below. The small pink bubble on the top left indicates that there is one research paper that has been cited 1,062 times and contains the keyword “treatment”.
You could use this interface to pick keyword(s), select high performing papers that have been heavily cited, and then be more specific using a title search. As an example, four research papers contain the word “herb” in the title, and one has been “Cited by” 5 times with the keyword “test”.
Beneficiaries of these Insights
With these tools, we identified patterns across Research & Development papers that could triangulate useful information to enable patient treatment and cure discovery. These tools are engineered to serve the following parties:
Frontline Workers
These research papers contain information for healthcare workers looking for therapy best practices, along with promising vaccine treatments. A colleague recently told me that his spouse, who is a caregiver, spends hours each night reading papers to understand how best to treat patients. By setting up navigation scenarios, combined with search capabilities, front-line workers can quickly identify emerging trends and navigate a wealth of information, segmented by content type.
Pharmaceutical Companies
This information can bridge silos among global pharmaceutical companies. It encourages collaboration where teams of researchers could combine efforts to quickly produce a vaccine. The data contains information on equipment suggestions for hospitals and pharmacies. The lack of personal protection equipment has been a significant supply chain issue in the United States and some of these research papers list helpful products that pharmacies should stock to better prepare for future outbreaks.
As a quick note on methodology, enabling people of different interests to sift through all these research papers to find actionable information is the goal for enriching the data and thinking about potential navigation scenarios. You could simply put this data into a search engine and have a text box as the only interface. I wanted to make it more of a guided navigational and data discovery experience by thinking of boundaries such as timeline, geography, authors, keywords, and “Cited by” metrics.
The navigation dashboards were made with Microsoft Power BI and it is exposed on the public web. Anyone can use the tool and explore the data. R&D Navigation
Twitter Sentiment Analysis
For our second analytics track, we tapped into the wisdom of the crowds by leveraging Social Media. To derive insights, we focused on Twitter by analyzing tweets for trending hashtags, sentiment analysis, and user network effects. We used Twitter’s public API and focused on three distinct topics:
COVID-19
Distributor
Manufacturer.
Over the course of 26 days, we analyzed 1.8M tweets and 110,000 unique hashtags. There are approximately 500M tweets on Twitter per day. While this is a small sample, it does reveal important trends.
To ascertain the critical mass and momentum of a trending topic, we calculated the slope of individual hashtags and focused on hashtags with greater than 100 observations and a slope greater than 10. These boundaries reduced the number of hashtags we analyzed from 110,000 to 17.
Below is a screenshot with three popular hashtags highlighted:
#PPE
#Hydroxychloroquine
#lockdown.
The positive slope generated by thousands of observations indicates a long-term trend. Most hashtags on Twitter are short-lived. So while hashtags bubble up and become popular quickly, they can also disappear quickly. Conversely, we used the parameters above to find a valuable signal with these points.
While the above topics exhibited tenure among the general public, we were also able to ascertain which topics were temporal in the C19 race for a cure. Using the following charts, compare the above trends to the negative slope generated by the following hashtags, which visibly fell out of favor:
#Covidtracking
#lockdown4guidelines
#chloroquin
#ENECOVID.
Using #chloroquin as an example, we can see a pattern where a potential C19 silver bullet fell out of favor with the scientific community and lost momentum in the general public. We can also compare this to the rise of hydroxychloroquine, which has a sustained interest in both the scientific and global community as a potential treatment.
The team analyzing Twitter data is working on a detailed paper for their findings and plan to publish their report over the next few weeks. Look for more insights into how analyzing Twitter with keywords, hashtags, and user analytics can benefit any business trying to meet demand and realize opportunities through the immediacy of this social network.
Quantitative Analysis and Risk Score
The final analytic opportunity targeted understanding factors that contributed to the rapid migration of the pandemic. This research is useful to public health policy administrators, healthcare providers and pharmaceuticals suppliers (wholesalers and manufacturers alike), all seeking to mitigate expansion while also distributing equipment to meet dynamic needs. Employing machine learning against a variety of externally sourced data assets ranging from COVID-19 infection and mortality statistics and health equity data attributes, to county level mobility and airport data, our resultant model was highly correlated with the COVID-19 migration.
The charts above illustrate the outcomes produced by our models, showcasing how our model closely replicated actual migration behaviors of COVID-19. Regions that were high-risk (as of Apr 24) had declined in severity by May 9, while other regions that were previously lower risk, increased in severity over time.
Follow our series of blog posts on our COVID-19 study to learn more about the insights we garnered while developing the machine learning models.
Final Considerations
The COVID-19 pandemic has captured the attention of the entire world. The concerted response of global organizations and citizens is equally unprecedented. With the incredible resources dedicated to finding a cure, it is possible to pool these data assets to leverage the work being performed across the world and develop meaningful applications. Using this data, we can predict infection hotspots and key indicators that contribute to its spread, such as “mobility”. We can also analyze R&D studies to find promising therapies and equipment to help patients. Lastly, we can analyze sentiment in real-time on Twitter and triangulate trends and needs to specific geographical locations and determine whether there are any remediations to help, such as donating personal protection equipment locally.
We have an opportunity to set a historical precedent on how we use technology to collaborate on solutions and potential remedies. As a global society, we can address and mitigate the risks of infection and death while examining events at both the local and global levels. At Inspired Intellect, we are leveraging our deep data and analytics expertise to develop a meaningful solution for our global stakeholders. Our team has a wide breadth of understanding of technology tools to develop and deliver data assets that generate insights akin to those generated in this article. There are many ways that we all can contribute to addressing this global pandemic, even if it is simply wearing a mask or “staying at home”. At Inspired Intellect, we wanted to use analytics to support the decisions made by those taking this disease head-on.
Watch for more in the coming weeks explaining the Machine Learning for our hotspot forecasting model along with a more in-depth discussion on Twitter insights.
Methodology, Technology and Data
This project was focused on healthcare and the COVID-19 pandemic, but our methodology is like other projects completed in various other industries. In summary, we:
Searched and found datasets
Enriched data with several complementary data services
Transformed and loaded enriched data in multiple database technologies
Explored the data
Developed navigational tools, along with predictive algorithms for future observations
Automated the entire process while paying attention to future data changes (also referred to as Change Data Capture)
Deployed all code assets to the cloud
For this project, we developed and reused Java code assets to make REST calls to open data sources. The REST call responses were in JSON format and that format was perfect for storing data in a NoSQL database. We used Couchbase because of the advanced SQL query language for JSON called N1QL. Couchbase, along with the Spark connector enabled quick keyword analysis for the R&D papers. The Couchbase Full-Text Search capability enabled us to quickly assign keyword flags to specific articles and surface this analysis in Power BI.
For other quantitative data sources, such as infection rates, the format was in a CSV file. For this data, we used a Postgres traditional ER database. The machine learning models, along with some data preparation code assets, were built in Juniper notebooks and Python.
All our code assets and databases were stored on an Amazon AWS Free Tier cluster. The hardware costs were minimum because we leveraged AWS Free Tier machines and only had to pay for storage. The Power BI reports are stored on Microsoft’s cloud and we have a 60-day free time limit on using that resource. Power BI connects to the AWS databases and refreshes the dashboards frequently.
Company Description
Inspired Intellect is an end-to-end service provider of data management, analytics and application development. We engage through a portfolio of offerings ranging from strategic advisory and design, to development and deployment, through to sustained operations and managed services.
Inspired Intellect is part of the ADI Group. The ADI Group is a collection of companies that, collectively, advises on and implements enhanced technological capabilities for enterprises along their digital transformation journey. Members of the ADI Group include:
Inspired Intellect’s membership in the ADI Group allows its clients access to a broader portfolio of digital transformation solutions and multiple touch points with innovation
According to a 2017 report from Dimension Data, Customer analytics is the second-highest rated factor in driving positive customer experience and is projected to be the leading factor in coming years. The findings in this report remain valid today. In this same report, only 48% of respondents say their organizations currently have analytics systems, and only 36% possess big data analytic solutions that are delivering real value. What we have found is that market leaders, regardless of vertical, are almost always in that 36%, and the existence of systems and tools are only a starting point for maximizing the value that this data paradigm offers.
Fortunately, most companies already capture much of their customers’ transactional data (i.e. product purchases, website usage, campaign returns, focus group surveys, etc.). Having this structured data shared across the organization as a “single source of truth” is essential to your success.
Advanced use of structured data is now table stakes, however, and is not sufficient to compete in today’s market. Leaders of every industry are unlocking valuable insights through the infusion of unstructured data via internal and external sources.
The unfiltered feelings, thoughts, emotions, and underlying decision-making processes of your customers are best captured in the moment that they occur – that is, in their tweets, call center notes, online reviews, phone calls, emails, and posts. A mature MDM strategy and initiative serves to link these disparate data sources and types and then consolidate them into a set of single “golden” records, so that organizations can fully understand their customers’ behaviors and motivations, and tailor individualized customer engagement accordingly. With MDM tied to transactional, historical, and unstructured data, imagine the impact you can have with a current or potential customer within moments of them telling you what they are thinking! In our experience, most organizations lack a clear strategy or capability for doing this and end up hemorrhaging insights that are critical to their business objectives.
Let us review a typical customer journey and life cycle, shown below.
Figure 1: Typical Customer Lifecycle / Journey, where data is being generated at each activity
Customers can follow many paths in their journey. Throughout their experience, there are large volumes of data being generated. These include product recommendations, tweets, posts, thumbs up/down, raves, complaint emails, voice calls, and others, which instantly capture authentic feedback from customers and offer valuable insights.
The goal of modern MDM initiatives and tools is to synthesize all this structured and unstructured customer information into a unique, tailored profile, as a “single source of truth”. Then, you can incrementally grow your knowledge and understanding of your customers in order to provide a better experience. This holistic view will impact your abilities to increase retention or drive new acquisition strategies. MDM, coupled with a well-planned data governance process, enables this by serving to link a company’s internal customer knowledge with what the customer is saying in other domains. This allows companies to extract deeper insights, create more effective analytics, and build richer artificial intelligence/machine learning (AI/ML) models that can understand, predict, and influence this behavior.
In order to benefit from the accumulated value, the “trusted” data must be consumable by people and processes as quickly and easily as possible. This is best accomplished using a cloud-based, scalable solution such as SAP Data Warehouse Cloud (DWC), which provides an out-of-the-box enterprise-ready data warehouse (SaaS) that is, elastic, cost effective, easily consumable and has ideal integration with S/4HANA and other non-SAP systems.
Recommended Approach
At a high level, we make the following recommendations when you undertake a customer-focused MDM initiative.
Use the Customer’s Initiatives as a North Star: Develop and execute an MDM strategy that aligns with customer business initiatives and contains multiple coordinated workstreams. This will ensure that meaningful value is created as workstreams are delivered.
Align Tools & Techniques: Deploy a leading MDM and data governance tool, such as SAP Master Data Governance (MDG) on SAP S/4HANA, which provides comprehensive capabilities to master all customer-related data. The tool must:
Consolidate and create golden customer records by incorporating unstructured data
Allow the creation of data quality and validation rules
Offer enrichment of name, address and identification-information from external 3rd party data providers like D&B, Melissa, etc., along with SAP’s S/4HANA Cloud for Data Enrichment tool
Feature a role-based governance workflow engine
Provide mass-processing, and match/merge functionality
Harmonize and replicate data back to corporate applications
Contain standard Search/Create/Read/Update/Delete/Block processes
Lead with incorporation and mastering of unstructured customer data using the MDM tool along with modern (preferably cloud-based) data architectures to speed deployment
Focus each workstream on a measurable business use case. Examples include increasing customer loyalty based on repeat purchases, or increasing retention based on churn metrics
Each of these recommendations obviously contain much more detail, and we will continue to elaborate on these in future articles. The concept is simplified below in Figure 2.
Figure 2: High level concept of Unstructured Data and MDM to generate business value
We have been fortunate to partner with many clients who have transformed their customer engagement by prioritizing their Master Data Management strategy and combining it with a complete end-to-end SaaS solution like SAP DWC, to deliver impressive business impact. Three of these case studies are summarized below.
A leading mid-sized specialty hydrocarbon products company needed to improve operational excellence. This required consistent and clean data with good governance early in their S/4HANA migration journey towards digital transformation. With the implementation of S4 HANA MDG tool, we were able to help them. This resulted in increased sales effectiveness with consolidated customer view and helped improve the supply chain efficiency and procurement decisions.
A large event management client launched a customer-focused initiative, needing timely access to customer data to drive revenue growth opportunities. The goal of this initiative was to provide insights to the sales team across customer preferences, demographics, usage, and behavior. Using the MDM tool, we were able to dramatically reduce the number of customer duplicate versions from over 1 million records to 400 thousand unique customers and decrease the possibility of creating new duplicates to virtually zero. The client also used MDM tool to:
Identify high value customers
Identify customer behavior and purchase patterns
Incorporate customer feedback into their marketing and sales processes.
This resulted in an increase in profitable customers, greater revenues from the sale of related products, and increased customer loyalty.
* Note that with the recent availability of DWC, this could be re-implemented as a Customer Data Mart.
Company Description
Inspired Intellect is an end-to-end service provider of data management, analytics and application development. We engage through a portfolio of offerings ranging from strategic advisory and design, to development and deployment, through to sustained operations and managed services.
Inspired Intellect is part of the ADI Group. The ADI Group is a collection of companies that, collectively, advises on and implements enhanced technological capabilities for enterprises along their digital transformation journey. Members of the ADI Group include:
Inspired Intellect’s membership in the ADI Group allows its clients access to a broader portfolio of digital transformation solutions and multiple touch points with innovation
A mid-tier destination airline sought to drive revenue growth by better activating their customer base and reducing customer attrition. The airline recognized attrition as a critical lever for their growth and was experiencing attrition rates of nearly 60%. Retaining customers was paramount for achieving strategic business objectives, considering that customers were only able to book travel on the company website, and not using third-party outlets.
Objective:
Develop a predictive model that will enable the airline to grow their customer base by:
Reducing revenue leakage resulting from customer attrition
Illustrating how attrition varies by customer category (new vs. returning vs. win-backs)
Proactively communicating with customers at-risk of attrition
Approach:
Build a Predictive Analytics Attrition Model that:
Identifies customer characteristics that explain attrition likelihood
Scores customers’ likelihood to attrite within the subsequent year
Outcome
Working with the client, Inspired Intellect:
Advised complementing the predictive model with predictive retention and conditional spend
Suggested overlaying demographic data to provide visibility to customer personas and design personalized offers / targeted communications to at-risk customers
Advocated data management initiatives to permit employing other internal assets to enhance the model’s predictive power
Impacts:
Identified leading indicators of customer attrition – inclusive of actionable business rules logic that could be leveraged ahead of full model operationalization
Established ~$70.5M in incremental revenue retention opportunity (over the business as usual/random model)
Corroborated the predictive value of the airline’s internal and external data sources to refine the roadmap and next steps
Applied model outcome to hedge on fuel and other capacity planning efforts
A F100 client recognized that its journey to a data-empowered organization must begin with understanding its customers. The client sought to more effectively identify their customer personas and beyond that, their customers’ journeys and preferences.
Objective:
Develop customer analytics, inclusive of purchase behaviors, store vs. eCom engagement and channel
Demonstrate the scalability and elasticity of MS Azure cloud services for data and analytics
Approach:
Employ customer transaction data, customer payment data and associated customer identities captured in interactions, to isolate a business ontology of the customer data domain
Adopt the Cross-Industry Standard Process for Data Mining (CRISP-DM) for navigating through the use-case analysis – thus defining specific touchpoints for reviews and feedback
Migrate the client’s customer data assets to MS Azure data science virtual machine (DSVM) instance, employing Azure SQL for data management and Azure ML services for advanced analytics
Insights:
A majority of transactions contained PII, which allowed us to reduce the number of customer IDs
Predictability of incremental revenue increase as a result of addressing the top 20% and 80% of customers, respectively
Clustering on RFM and NADR features yielded a Customer Segmentation that indirectly differentiated customers by other characteristics that were not exposed to the algorithm – and were inferred through profiling the segments
Among customers who made 2+ purchases in the first half of the month, Recency and Frequency features explained most of the variance in their likelihood to return in the second half of the month
Impacts:
Established customer identification as a critical step in the client’s data and analytics journey
Provided a roadmap for operationalizing the customer identification methodology developed as part of this POC, and leveraging the same to develop the client’s customer analytics capabilities
Defined an add-on initiative that further scaled learnings from this POC and advanced the client’s operationalization of advanced analytics
The data needs of a global pharmaceutical firm required a central data repository across multiple Electronic Data Capture systems to tease greater insights out of clinical drug trials and to meet regulatory requirements.
Objectives:
IT Objective: Implement a new data system to allow for structured and unstructured data to be extracted, loaded and processed.
Business Objective: Reduce cost and time-to-market while increasing data quality and accuracy
Approach:
Inspired Intellect implemented a Data Capture Hub (DCH) to:
Ingest raw data from a diverse variety of enterprise systems
Harmonize, map and transform data into consistent structures
Provide APIs and services to access and consume data for operational and regulatory compliance
Outcomes:
Through our engagement, the client modernized their infrastructure to a production ready system capable of managing the capture, transformation and consumption of clinical trial data. This data hub solution:
Enabled SMEs to define and configure data processing rules
Produced adverse event reporting for regulatory compliance
Established a foundation for building additional data services in the future
Impacts:
Significant increase in accuracy and timeliness of the reporting of Adverse Events – one of the most important features of the clinical trial process, as it impacts the safety of the patient.
Planned retirement of obsolete & costly systems, with minimal business disruption during the transition
Increased access to critical data for mission-critical business processes
Data can be a tremendous source of value – but tapping into it is proving to be increasingly challenging for the enterprise:
How do you engage with the portfolio of advanced analytics opportunities and stakeholders across the organization?
How efficiently can you produce actionable insight from multiple source systems?
How do you prioritize competing analytics requests from your business?
Do you have sufficient analytics expertise to meet business needs?
Objective:
A Fortune 10 pharmaceutical wholesaler carried a wealth of data sources across their enterprise and business units but wasn’t capitalizing on their data assets to inform their operations and identify market-related trends
Approach:
The shift towards transforming into a digital enterprise warrants a fresh approach in the context of analytics and data management.
An advanced analytics center of excellence (AA CoE) consolidates the right skills, strategies, processes, and technology to allow the enterprise to increase competitive advantage and accelerate innovation.
Inspired Intellect developed and executed on an AA CoE roadmap that delivered the following:
Vision, Charter and Goals of the AA CoE
Standards and Methods for Use-Case Analysis & Prioritization, Adoption Measurement and more
AA CoE Organizational Structure
Governance and Ethical AI framework
A Technology Platform the could scale to support the analytics needs of a distributed Fortune 10 data team”
Outcomes:
Defined an optimal CoE structure, the processes required to support IT – including the business leaders, data scientists, and technical specialists.
Advised approach for prioritizing use cases and maintaining available, secure, and usable data.
Identified appropriate tools and technologies.
Recommended governance considerations to ensure ethical use of data and ML/AI.
Impacts:
Powered by the right skills, processes, strategy and tools, the AA CoE enables the enterprise to do more with its volumes of data:
Improve business agility to respond to market/competitive threats and regulatory demands.
Respond effectively to evolving customer needs and supplier/procurement risks.
Reduce logistics & distribution operational costs.
Create greater value for stakeholders.
More significantly, an AA CoE will enable all of these to be done better, faster and more efficiently, while driving continuous insight loops back to the enterprise.
A Global Payments Services client was looking to scale their AI/ML capabilities beyond serving the enterprise. Motivating factors for creating new data assets included the potential to monetize their data & AI/ML models across their customers and deliver ethical (not-for-profit) AI/ML for the society.
Objective:
Assess the client’s AI/ML platform needs and gaps to recommend solutioning approaches that would enable their strategic goals of data and AI/ML model monetization.
Approach:
Deployed surveys, conducted user workshops and researched vendor /provider landscape to provide a:
Needs and Gap Assessment
Solution Recommendations that would support both current AI/ML capabilities and future needs.
Outcomes:
Inspired Intellect delivered an AI/ML reference architecture that reflected the client’s nuanced innovation and regulatory needs while also serving as the framework of choice for socializing gaps and building consensus on prioritization of the next steps.
Inspired Intellect also provided a robust and reusable mechanism to evaluate and rank AI/ML solutions from third-party solution providers.
Impact(s):
Inspired Intellect’s solution roadmap formed the cornerstone for capital funding for gap closure initiatives