Saturday 4 September 2021

Moving ML from “best guess” to best data-based decisions

IBM Causal Inference 360 Toolkit offers access to multiple tools that can move the decision-making processes from “best guess” to concrete answers based on data.

IBM Exam Prep, IBM Tutorial and Material, IBM Career, IBM Learning, IBM Guides, IBM Preparation

The application of machine learning (ML) models by data scientists has paved the way for our current era of big data. Such traditional ML models have become highly successful in predicting outcomes based on the data. For example, they're good at answering: “What is my likelihood of developing a specific health condition?”

ML models, however, are not designed to answer the question of what can be done to change that likelihood. This is the concept of causal inference. And until recently, there have been few tools available to help data scientists train and apply causal inference models, choose between the models, and determine which parameters to use.

At IBM Research, we wanted to change this. Enter the open source IBM Causal Inference 360 Toolkit. Released in 2019, the toolkit is the first of its kind to offer a comprehensive suite of methods, all under one unified API, that aids data scientists in the application and understanding of causal inference in their models.

Today, we’re excited to unveil the latest in these efforts—a new, customized website for the Causal Inference Toolkit, complete with tutorials, background information, and multiple demos showcasing the package’s abilities in multiple domains, including healthcare, agriculture, and marketing in the financial and banking sectors. Concurrent with this new website, we’re also releasing a new version of the open-source Python library with additional functionalities.

What is causal inference?


All decision-making involves asking questions and trying to get the best answer possible. Take the question: “What happens if I eat eggs every day for breakfast?”

Depending on what is being measured and what additional factors are involved, the answer could vary widely. What if the people who tend to eat eggs for breakfast every morning are also people who work out every morning? Perhaps the difference that we see in the outcome is driven by the exercise and not by eating eggs.

This is called a confounding variable—affecting both the decision and the outcome. And that’s what causal inference tries to resolve. What is the answer to the question after controlling (as much as possible from the data) for the confounding variable?

Next, we try and account for how the outcome is influenced based on different parameters (e.g., how many eggs are eaten; what is eaten with the eggs; is the person overweight, and so on). We can also try and account for what we are looking for (e.g., are we interested in whether the person will gain weight, sleep better, eat less during the day, or lower their cholesterol).

In short, it might be easy to start off with one question that can be answered using data. But to get a reliable answer, we need to fine-tune the parameters involved, and the type of model being used.

IBM Exam Prep, IBM Tutorial and Material, IBM Career, IBM Learning, IBM Guides, IBM Preparation
Figure 1:
A schematic of the pipeline to guide model selection and cohort definition in causal inference. The pipeline involves an iterative process, in which a) the causal inference is defined and a data matrix is extracted; b) the causal method is chosen; c) the underlying machine learning models are chosen; and d) the model performance is evaluated. If the models perform well, the causal inference prediction can be drawn to estimate outcome and effect. Otherwise, the process needs to be reiterated following some modifications in steps a-c.

Help from the IBM Causal Inference 360 Toolkit


Causal inference consists of a set of methods that attempt to estimate the effect of some intervention on some outcome from observational data. With the IBM Causal Inference 360 Toolkit, individuals have access to multiple tools that can move their decision-making processes from a “best guess” scenario to concrete answers based on data.

The IBM Causality 360 library is an open-source Python library that utilizes ML models internally and, unlike most packages, allows users to seamlessly plug in almost any ML model they want. It also has methodologies for selecting the best ML models and their parameters based on ML paradigms like cross-validation, and using well-established and novel causal-specific metrics.

IBM Exam Prep, IBM Tutorial and Material, IBM Career, IBM Learning, IBM Guides, IBM Preparation
Figure 2:
Examples of graphical evaluations of causal inference models available in the Toolkit package. These can help data-scientists select better-performing models among several options or detect problems in the data.

IBM Causal Inference 360 in the real world


At IBM’s research lab in Haifa, Israel, we have been using the causal inference toolkit as part of our work on drug repurposing. Drug repurposing or repositioning is a method for finding new therapeutic uses for accepted drugs. Here, the question we searched for was: “What would happen if patient X took drug Y?”

The result? Discovery of two new potential treatments for dementia that typically accompanies Parkinson’s disease. More specifics on how the causal modeling in this research worked can be found in a blog from April of this year, by our colleague Michal Rosen-Zvi.

The team also used the toolkit in a collaboration with Assuta health services, the largest private network of hospitals in Israel, to analyze the impact of COVID on access to care. Specifically, the team analyzed more than 300,000 invitations sent to women for breast screening exams, with the focus on instances in which the women did not show up for their appointments.

The causal inference technology revealed that while at first glance it seemed the nonpharmaceutical interventions of the government resulted in the no-shows. In reality, it was the number of newly infected people that influenced whether or not the women showed up to their appointments.

In another example, we wanted to understand whether novel irrigation practices contribute to a desired reduction in pollution and nutrient runoff. To do this, we used a dataset where multiple aspects of the agricultural use of the land were captured, including its irrigation method, and the amount of runoff was measured.

What we saw was that the naïve data showed little effect. But, after using the causal inference toolkit to correct for the fact that the irrigation methods depend heavily on the type of land use and the type of crop, we showed that introducing these novel irrigation techniques does reduce runoff. It can save fertilization and water, and reduce pollution of the watershed. This reduction can be further quantified to estimate the tradeoff between savings and initial investment.

Source: ibm.com

Related Posts

0 comments:

Post a Comment