After developing a machine learning model, you need a place to run your model and serve predictions. If your company is in the early stage of its AI journey or has budget constraints, you may struggle to find a deployment system for your model. Building ML infrastructure and integrating ML models with the larger business are major bottlenecks to AI adoption. IBM Db2 can help solve these problems with its built-in ML infrastructure. Someone with the knowledge of SQL and access to a Db2 instance, where the in-database ML feature is enabled, can easily learn to build and use a machine learning model in the database.
In this post, I will show how to develop, deploy, and use a decision tree model in a Db2 database.
These are my major steps in this tutorial:
1. Set up Db2 tables
2. Explore ML dataset
3. Preprocess the dataset
4. Train a decision tree model
5. Generate predictions using the model
6. Evaluate the model
I implemented these steps in a Db2 Warehouse on-prem database. Db2 Warehouse on cloud also supports these ML features.
The machine learning use case
I will use a dataset of historical flights in the US. For each flight, the dataset has information such as the flight’s origin airport, departure time, flying time, and arrival time. Also, a column in the dataset indicates if each flight had arrived on time or late. Using examples from the dataset, we’ll build a classification model with decision tree algorithm. Once trained, the model can receive as input unseen flight data and predict if the flight will arrive on time or late at its destination.
1. Set up Db2 tables
The dataset I use in this tutorial is available here as a csv file.
Creating a Db2 table
I use the following SQL for creating a table for storing the dataset.
db2start
connect to <database_name>
db2 "CREATE TABLE FLIGHTS.FLIGHTS_DATA_V3 (
ID INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,
YEAR INTEGER ,
QUARTER INTEGER ,
MONTH INTEGER ,
DAYOFMONTH INTEGER ,
DAYOFWEEK INTEGER ,
UNIQUECARRIER VARCHAR(50 OCTETS) ,
ORIGIN VARCHAR(50 OCTETS) ,
DEST VARCHAR(50 OCTETS) ,
CRSDEPTIME INTEGER ,
DEPTIME INTEGER ,
DEPDELAY REAL ,
DEPDEL15 REAL ,
TAXIOUT INTEGER ,
WHEELSOFF INTEGER ,
CRSARRTIME INTEGER ,
CRSELAPSEDTIME INTEGER ,
AIRTIME INTEGER ,
DISTANCEGROUP INTEGER ,
FLIGHTSTATUS VARCHAR(1) )
ORGANIZE BY ROW";
After creating the table, I use the following SQL to load the data, from the csv file, into the table:
db2 "IMPORT FROM 'FLIGHTS_DATA_V3.csv' OF DEL COMMITCOUNT 50000 INSERT INTO FLIGHTS.FLIGHTS_DATA_V3"
I now have the ML dataset loaded into the FLIGHTS.FLIGHTS_DATA_V3 table in Db2. I’ll copy a subset of the records from this table to a separate table for the ML model development and evaluation, leaving the original copy of the data intact.
SELECT count(*) FROM FLIGHTS.FLIGHTS_DATA_V3
— — —
1000000
Creating a separate table with sample records
Create a table with 10% sample rows from the above table. Use the RAND function of Db2 for random sampling.
CREATE TABLE FLIGHT.FLIGHTS_DATA AS (SELECT * FROM FLIGHTS.FLIGHTS_DATA_V3 WHERE RAND() < 0.1) WITH DATA
Count the number of rows in the sample table.
SELECT count(*) FROM FLIGHT.FLIGHTS_DATA
— — —
99879
Look into the scheme definition of the table.
SELECT NAME, COLTYPE, LENGTH
FROM SYSIBM.SYSCOLUMNS
WHERE TBCREATOR = 'FLIGHT' AND TBNAME = 'FLIGHTS_DATA'
ORDER BY COLNO
FLIGHTSTATUS is the response or the target column. Others are feature columns.
Find the DISTINCT values in the target column.
From these values, I can see that it’s a binary classification task where each flight arrived either on time or late.
Find the frequencies of distinct values in the FLIGHTSTATUS column.
SELECT FLIGHTSTATUS, count(*) AS FREQUENCY, count(*) / (SELECT count(*) FROM FLIGHT.FLIGHTS_DATA) AS FRACTION
FROM FLIGHT.FLIGHTS_DATA fdf
GROUP BY FLIGHTSTATUS
From the above, I see the classes are imbalanced. Now I’ll not gain any further insights from the entire dataset, as this can leak information to the modeling phase.
Creating train/test partitions of the dataset
Before collecting deeper insights into the data, I’ll divide this dataset into train and test partitions using Db2’s RANDOM_SAMPLING SP. I apply stratified sampling to preserve the ratio between two classes in the generated training data set.
Create a TRAIN partition.
call IDAX.RANDOM_SAMPLE('intable=FLIGHT.FLIGHTS_DATA, fraction=0.8, outtable=FLIGHT.FLIGHTS_TRAIN, by=FLIGHTSTATUS')
Copy the remaining records to a test PARTITION.
CREATE TABLE FLIGHT.FLIGHTS_TEST AS (SELECT * FROM FLIGHT.FLIGHTS_DATA FDF WHERE FDF.ID NOT IN(SELECT FT.ID FROM FLIGHT.FLIGHTS_TRAIN FT)) WITH DATA
2. Explore data
In this step, I’ll look at both sample records and the summary statistics of the training dataset to gain insights into the dataset.
Look into some sample records.
SELECT * FROM FLIGHT.FLIGHTS_TRAIN FETCH FIRST 10 ROWS ONLY
Some columns have encoded the time as numbers:
— CRSDEPTIME: Computer Reservation System (scheduled) Departure Time (hhmm)
— DepTime: Departure Time (hhmm)
— CRSArrTime: Computer Reservation System (scheduled) Arrival Time
Now, I collect summary statistics from the FLIGHTS_TRAIN using SUMMARY1000 SP to get a global view of the characteristics of the dataset.
CALL IDAX.SUMMARY1000('intable=FLIGHT.FLIGHTS_TRAIN, outtable=FLIGHT.FLIGHTS_TRAIN_SUM1000')
Here the intable has the name of the input table from which I want SUMMARY1000 SP to collect statistics. outtable is the name of the table where SUMMARY1000 will store gathered statistics for the entire dataset. Besides the outtable, SUMMARY1000 SP creates a few additional output tables — one table with statistics for each column type. Our dataset has two types of columns — numeric and nominal. So, SUMMARY1000 will generate two additional tables. These additional tables follow this naming convention: the name of the outtable + column type. In our case, the column types are NUM, representing numeric, and CHAR, representing nominal. So, the names of these two additional tables will be as follows:
FLIGHTS_TRAIN_SUM1000_NUM
FLIGHTS_TRAIN_SUM1000_CHAR
Having the statistics available in separate tables for specific datatypes makes it easier to view the statistics that apply to specific datatype and reduce the number of columns whose statistics are viewed together. This simplifies the analysis process.
Check the summary statistics of the numeric column.
SELECT * FROM FLIGHT.FLIGHTS_TRAIN_SUM1000_NUM
For the numeric columns, SUMMARY1000 gather the following statistics:
◉ Missing value count
◉ Non-missing value count
◉ Average
◉ Variance
◉ Standard deviation
◉ Skewness
◉ Excess kurtosis
◉ Minimum
◉ Maximum
Each of these statistics can help uncover insights into the dataset. For instance, I can see that DEPDEL15 and DEPDELAY columns have 49 missing values. There are large values in these columns: AIRTIME, CRSARRTIME, CRSDEPTIME, CRSELAPSEDTIME, DEPDELAY, DEPTIME, TAXIOUT, WHEELSOFF, and YEAR. Since I will create a decision tree model, I don’t need to deal with the large value and the missing values. Db2 will deal with both issues natively.
Next, I investigate the summary statistics of the nominal columns.
select * from FLIGHT.FLIGHTS_TRAIN_SUM1000_CHAR
For nominal columns, SUMMARY1000 gathered the following statistics:
◉ Number of missing values
◉ Number of non-missing values
◉ Number of distinct values
◉ Frequency of the most frequent value
3. Preprocess data
From the above data exploration, I can see that the dataset has no missing values. These four TIME columns have large values: AIRTIME, CRSARRTIME, DEPTIME, WHEELSOFF. I’ll leave the nominal values in all columns as-is, as the decision tree implementation in Db2 can deal with them natively.
Extract the hour part from the TIME columns — CRSARRTIME, DEPTIME, WHEELSOFF.
From looking up the description of the dataset, I see the values in the CRSARRTIME, DEPTIME, and WHEELSOFF columns are encoding of hhmm of the time values. I extract the hour part of these values to create, hopefully, better features for the learning algorithm.
Scale CRSARRTIME COLUMN: divide the value with 100 gives the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET CRSARRTIME = CRSARRTIME / 100
Scale DEPTIME COLUMN: divide the value by 100 gives the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET DEPTIME = DEPTIME / 100
Scale WHEELSOFF COLUMN: divide the value by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET WHEELSOFF = WHEELSOFF / 100
4. Train a decision tree model
Now the training dataset is ready for the decision tree algorithm.
I train a decision tree model using GROW_DECTREE SP.
CALL IDAX.GROW_DECTREE('model=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TRAIN, id=ID, target=FLIGHTSTATUS')
I called this SP using the following parameters:
◉ model: the name I want to give to the decision tree model — FLIGHT_DECTREE
◉ intable: the name of the table where the training dataset is stored
◉ id: the name of the ID column
◉ target: the name of the target column
After completing the model training, the GROW_DECTREE SP generated several tables with metadata from the model and the training dataset. Here are some of the key tables:
◉ FLIGHT_DECTREE_MODEL: this table contains metadata about the model. Examples of metadata include depth of the tree, strategy for handling missing values, and the number of leaf nodes in the tree.
◉ FLIGHT_DECTREE_NODES: this table provides information about each node in the decision tree.
◉ FLIGHT_DECTREE_COLUMNS: this table provides information on each input column and their role in the trained model. The information includes the importance of a column in generating a prediction from the model.
5. Generate predictions from the model
Since the FLIGHT_DECTREE model is trained and deployed in the database, I can use it for generating predictions on the test records from the FLIGHTS_TEST table.
First, I preprocess the test dataset using the same preprocessing logic that I applied to the TRAINING dataset.
Scale CRSARRTIME COLUMN: divide the value by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET CRSARRTIME = CRSARRTIME / 100
Scale DEPTIME COLUMN: divide the value by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET DEPTIME = DEPTIME / 100
Scale WHEELSOFF COLUMN: divide the value by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET WHEELSOFF = WHEELSOFF / 100
Generating predictions
I use PREDICT_DECTREE SP to generate predictions from the FLIGHT_DECTREE model:
CALL IDAX.PREDICT_DECTREE('model=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TEST, outtable=FLIGHT.FLIGHTS_TEST_PRED, prob=true, outtableprob=FLIGHT.FLIGHTS_TEST_PRED_DIST')
Here is the list of parameters I passed when calling this SP:
◉ model: the name of the decision tree model, FLIGHT_DECTREE
◉ intable: name of the input table to generate predictions from
◉ outtable: the name of the table that the SP will create and store predictions to
◉ prob: a boolean flag indicating if we want to include in the output the probability of each prediction
◉ outputtableprob: the name of the output table where the probability of each prediction will be stored
6. Evaluate the model
Using generated predictions for the test dataset, I compute a few metrics to evaluate the quality of the model’s predictions.
Creating a confusion matrix
I use CONFUSION_MATRIX SP to create a confusion matrix based on the model’s prediction on the TEST dataset.
CALL IDAX.CONFUSION_MATRIX('intable=FLIGHT.FLIGHTS_TEST, resulttable=FLIGHT.FLIGHTS_TEST_PRED, id=ID, target=FLIGHTSTATUS, matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')
In calling this SP, here are some of the key parameters that I passed:
◉ intable: the name of the table that contains the dataset and the actual value of the target column
◉ resulttable: the name of the table that contains the column with predicted values from the model
◉ target: the name of the target column
◉ matrixTable: The output table where the SP will store the confusion matrix
After the SP completes its run, we have the following output table with statistics for the confusion matrix.
FLIGHTS_TEST_CMATRIX:
This table has three columns. The REAL column has the actual flight status. PREDICTION column has the predicted flight status. Since flight status takes two values – 0 (on time) or 1 (delayed), we have four possible combinations between values in the REAL and the PREDICTION columns:
1. TRUE NEGATIVE: REAL: 0, PREDICTION: 0 — The model has accurately predicted the status of those flights that arrived on schedule. From that CNT column, we see that 11795 rows from the TEST table belong to this combination.
2. FALSE POSITIVE: REAL: 0, PREDICTION: 1 — these are the flights that actually arrived on time but the model predicted them to be delayed. 671 is the count of such flights.
3. FALSE NEGATIVE: REAL: 1, PREDICTION: 0 — these flights have arrived late, but the model predicted them to be on time. From the CNT table, we find their count to be 2528.
4. TRUE POSITIVE: REAL: 1, PREDICTION: 1 — the model has accurately identified these flights that were late. The count is 4981.
I use these counts to compute a few evaluation metrics for the model. For doing so, I use CMATRIX_STATS SP as follows:
CALL IDAX.CMATRIX_STATS('matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')
The only parameter this SP needs is the name of the table that contains the statistics generated by the CONFUSION_MATRIX SP in the previous step. CMATRIX_STATS SP generates two sets of output. The first one shows overall quality metrics of the model. The second one includes the model’s predictive performance for each class.
First output — overall model metrics include correction predictions, incorrect prediction, overall accuracy, weighted accuracy. From this output, I see that the model has an overall accuracy of 83.98% and a weighted accuracy of 80.46%.
With classification tasks, it’s usually useful to view the model’s quality factors for each individual class. The second output from the CMATRIX_STATS SP includes these class level quality metrics.
For each class, this output includes the True Positive Rate (TPR), False Positive Rate (FPR), Positive Predictive Value (PPV) or Precision, and F-measure (F1 score).
Source: ibm.com
0 comments:
Post a Comment