Saturday 29 April 2023

Why optimize your warehouse with a data lakehouse strategy

IBM, IBM Exam, IBM Exam Prep, IBM Exam Tutorial and Materials, IBM Certification, IBM Learning, IBM Guides

We pointed out that warehouses, known for high-performance data processing for business intelligence, can quickly become expensive for new data and evolving workloads. We also made the case that query and reporting, provided by big data engines such as Presto, need to work with the Spark infrastructure framework to support advanced analytics and complex enterprise data decision-making. To do so, Presto and Spark need to readily work with existing and modern data warehouse infrastructures. Now, let’s chat about why data warehouse optimization is a key value of a data lakehouse strategy.

Value of data warehouse optimization


Since its introduction over a century ago, the gasoline-powered engine has remained largely unchanged. It’s simply been adapted over time to accommodate modern demands such as pollution controls, air conditioning and power steering.

Similarly, the relational database has been the foundation for data warehousing for as long as data warehousing has been around. Relational databases were adapted to accommodate the demands of new workloads, such as the data engineering tasks associated with structured and semi-structured data, and for building machine learning models.

Returning to the analogy, there have been significant changes to how we power cars. We now have gasoline-powered engines, battery electric vehicles (BEVs), and hybrid vehicles. An August 2021 Forbes article referenced a 2021 Department of Energy Argonne National Laboratory publication indicating, “Hybrid electric vehicles (think: Prius) had the lowest total 15-year per-mile cost of driving in the Small SUV category beating BEVs”.

Just as hybrid vehicles help their owners balance the initial purchase price and cost over time, enterprises are attempting to find a balance between high performance and cost-effectiveness for their data and analytics ecosystem. Essentially, they want to run the right workloads in the right environment without having to copy datasets excessively.

Optimizing your data lakehouse architecture


Fortunately, the IT landscape is changing thanks to a mix of cloud platforms, open source and traditional software vendors. The rise of cloud object storage has driven the cost of data storage down. Open-data file formats have evolved to support data sharing across multiple data engines, like Presto, Spark and others. Intelligent data caching is improving the performance of data lakehouse infrastructures.

All these innovations are being adapted by software vendors and accepted by their customers. So, what does this mean from a practical perspective? What can enterprises do different from what they are already doing today? Some use case examples will help. To effectively use raw data, it often needs to be curated within a data warehouse. Semi-structured data needs to be reformatted and transformed to be loaded into tables. And ML processes consume an abundance of capacity to build models.

Organizations running these workloads in their data warehouse environment today are paying a high run rate for engineering tasks that add no additional value or insight. Only the outputs from these data-driven models allow an organization to derive additional value. If organizations could execute these engineering tasks at a lower run rate in a data lakehouse while making the transformed data available to both the lakehouse and warehouse via open formats, they could deliver the same output value with low-cost processing.

Benefits of optimizing across your data warehouse and data lakehouse


Optimizing workloads across a data warehouse and a data lakehouse by sharing data using open formats can reduce costs and complexity. This helps organizations drive a better return on their data strategy and analytics investments while also helping to deliver better data governance and security.

And just as a hybrid car allows car owners to get greater value from their car investment, optimizing workloads across a data warehouse and data lakehouse will allow organizations to get greater value from their data analytics ecosystem.

Discover how you can optimize your data warehouse to scale analytics and artificial intelligence (AI) workloads with a data lakehouse strategy.

Source: ibm.com

Friday 28 April 2023

Cloud scalability: Scale-up vs. scale-out

IBM Exam Study, IBM, IBM Exam Prep, IBM Exam Certification, IBM Exam Learning, IBM Cloud Scalability

IT Managers run into scalability challenges on a regular basis. It is difficult to predict growth rates of applications, storage capacity usage and bandwidth. When a workload reaches capacity limits, how is performance maintained while preserving efficiency to scale?

The ability to use the cloud to scale quickly and handle unexpected rapid growth or seasonal shifts in demand has become a major benefit of public cloud services, but it can also become a liability if not managed properly. Buying access to additional infrastructure within minutes has become quite appealing. However, there are decisions that must be made about what kind of scalability is needed to meet demand and how to accurately track expenditures.

Scale-up vs. Scale-out


Infrastructure scalability handles the changing needs of an application by statically adding or removing resources to meet changing application demands, as needed. In most cases, this is handled by scaling up (vertical scaling) and/or scaling out (horizontal scaling). There have been many studies and architecture development around cloud scalability that address many areas of how it works and architecting for emerging cloud-native applications. In this article, we are going focus first on comparing scale-up vs scale-out.

What is scale-up (or vertical scaling)?


Scale-up is done by adding more resources to an existing system to reach a desired state of performance. For example, a database or web server needs additional resources to continue performance at a certain level to meet SLAs. More compute, memory, storage or network can be added to that system to keep the performance at desired levels.

When this is done in the cloud, applications often get moved onto more powerful instances and may even migrate to a different host and retire the server they were on. Of course, this process should be transparent to the customer.

Scaling-up can also be done in software by adding more threads, more connections or, in cases of database applications, increasing cache sizes. These types of scale-up operations have been happening on-premises in data centers for decades. However, the time it takes to procure additional recourses to scale-up a given system could take weeks or months in a traditional on-premises environment, while scaling-up in the cloud can take only minutes.

What is scale-out (or horizontal scaling)?


Scale-out is usually associated with distributed architectures. There are two basic forms of scaling out:

◉ Adding additional infrastructure capacity in pre-packaged blocks of infrastructure or nodes (i.e., hyper-converged)

◉ Using a distributed service that can retrieve customer information but be independent of applications or services

Both approaches are used in CSPs today, along with vertical scaling for individual components (compute, memory, network, and storage), to drive down costs. Horizontal scaling makes it easy for service providers to offer “pay-as-you-grow” infrastructure and services.

Hyper-converged infrastructure has become increasingly popular for use in private cloud and even tier 2 service providers. This approach is not quite as loosely coupled as other forms of distributed architectures but it does help IT managers that are used to traditional architectures make the transition to horizontal scaling and realize the associated cost benefits.

Loosely coupled distributed architecture allows for the scaling of each part of the architecture independently. This means a group of software products can be created and deployed as independent pieces, even though they work together to manage a complete workflow. Each application is made up of a collection of abstracted services that can function and operate independently. This allows for horizontal scaling at the product level as well as the service level. Even more granular scaling capabilities can be delineated by SLA or customer type (e.g., bronze, silver or gold) or even by API type if there are different levels of demand for certain APIs. This can promote efficient use of scaling within a given infrastructure.

IBM Turbonomic and the upside of cloud scalability


The way service providers have designed their infrastructures for maximum performance and efficiency scaling has been and continues to be driven by their customer’s ever-growing and shrinking needs. A good example is AWS auto-scaling. AWS couples scaling with an elastic approach so users can run resources that match what they are actively using and only be charged for that usage. There is a large potential cost savings in this case, but the complex billing makes it hard to tell exactly how much (if anything) is actually saved.

This is where IBM Turbonomic can help. It helps you simplify your cloud billing lets you know up front where your expenditures lie and how to make quick educated choices on your scale-up or scale-out decisions to save even more. Turbonomic can also simplify and take the complexity out of how IT management spends their human and capital budgets on on-prem and off-prem infrastructure by providing cost modeling for both environments along with migration plans to ensure all workloads are running where both their performance and efficiency are ensured.

For today’s cloud service providers, loosely coupled distributed architectures are critical to scaling in the cloud, and coupled with cloud automation, this gives customers many options on how to scale vertically or horizontally to best suit their business needs. Turbonomic can help you make sure you’re picking the best options in your cloud journey.

Source: ibm.com

Thursday 27 April 2023

The Role of IBM C2090-619 Certification in a Cloud-based World

As the technology industry grows, certifications have become a crucial aspect of career development for IT professionals. The IBM C2090-619 certification exam is one such certification that focuses on IBM Informix 12.10 System Administrator. This article offers all the information about the IBM C2090-619 certification exam, including its benefits, exam format, and study resources.

IBM C2090-619 Certification Exam Format

The IBM C2090-619 certification exam evaluates the candidate's proficiency in IBM Informix 12.10 System Administration. The IBM C2090-619 certification exam is also known as the IBM Certified System Administrator - Informix 12.10 exam. The exam code for this certification is C2090-619. The cost to take this exam is $200 (USD). The exam lasts 90 minutes, and candidates will be presented with 70 multiple-choice questions. To pass the exam, candidates must obtain a minimum score of 68%. Here is a breakdown of the IBM C2090-619 certification exam content.

  • Installation and Configuration 17%

  • Space Management 11%

  • System Activity Monitoring and Troubleshooting 13%

  • Performance Tuning 16%

  • OAT and Database Scheduler 4%

  • Backup and Restore 10%

  • Replication and High Availability 16%

  • Warehousing 4%

  • Security 9%

Prerequisites for the Exam

To take the IBM C2090-619 Certification Exam, candidates must meet the following prerequisites.

1. Knowledge of SQL

Candidates should have a solid understanding of SQL (Structured Query Language) and be able to write basic SQL statements.

2. Knowledge of IBM DB2

Candidates should have experience working with IBM DB2 database software and be familiar with its features and functionality.

3. Basic knowledge of Linux

Candidates should have a basic understanding of Linux operating system commands and functionality.

4. Experience with IBM Data Studio

Candidates should be familiar with IBM Data Studio, which is a tool used to manage IBM databases.

5. Familiarity with data warehousing concepts

Candidates should have a basic understanding of data warehousing concepts such as data modeling, data integration, and ETL (Extract, Transform, Load) processes.

It is important to note that meeting these prerequisites does not guarantee success in passing the exam. Candidates should also have hands-on experience working with IBM DB2 database software and other related technologies. Additionally, it is suggested that candidates have experience working on real-world projects involving IBM DB2 and data warehousing to ensure they have a practical understanding of the exam topics.

Benefits of IBM C2090-619 Certification

The IBM C2090-619 certification is designed for professionals who work with IBM Informix 12.10 System Administrator. This certification exam helps professionals validate their expertise and knowledge in the field, enhancing their chances of career advancement. Here are some of the benefits of obtaining the IBM C2090-619 certification.

1. Validation of Expertise in IBM Informix 12.10 System Administration

The IBM Informix System Administrator Exam validates a candidate's proficiency and knowledge in IBM Informix 12.10 System Administration. This certification demonstrates that the candidate has the necessary skills and expertise to manage and maintain an IBM Informix 12.10 database system.

2. Increased Career Opportunities and Salary

Obtaining the IBM C2090-619 certification can increase a candidate's career opportunities and earning potential. Employers value certified experts and are more likely to offer promotions, salary increases, and better job opportunities to accredited candidates.

3. Enhancement of Professional Credibility

The IBM C2090-619 certification enhances a candidate's professional credibility in the IT industry. This certification demonstrates that the candidate is committed to their professional development and has invested time and effort in acquiring new skills and knowledge.

4. Access to a Community of Certified Professionals

Certified IBM Informix 12.10 System Administrators can access a community of certified professionals. This community provides a platform for networking, sharing knowledge, and collaborating on projects.

5. Opportunity to Work on Challenging Projects

The IBM C2090-619 certification can allow candidates to work on challenging projects. Certified professionals are often assigned to high-profile projects that require advanced skills and expertise in IBM Informix 12.10 System Administration.

Study Resources for IBM C2090-619 Certification Exam

The IBM Informix System Administrator Exam requires thorough preparation to achieve success. IBM offers various study resources to help candidates prepare for the exam, including.

1. IBM C2090-619 Exam Objectives

The IBM C2090-619 exam objectives outline the exam content and the skills required to pass the exam. Candidates should use this resource as the basis for their study plan.

2. IBM Knowledge Center

The IBM Knowledge Center is an online resource that provides comprehensive documentation on IBM products and solutions. It offers valuable information on IBM Informix 12.10 System Administration, which is the focus of the IBM Informix System Administrator Exam.

3. IBM Training Courses

IBM offers various training courses to help candidates prepare for the IBM Informix System Administrator Exam. These courses cover different topics related to IBM Informix 12.10 System Administration and are delivered online or in person.

4. Practice Tests

IBM provides practice tests that simulate the IBM Informix System Administrator Exam. These practice tests help candidates evaluate their readiness for the exam and identify areas that need improvement.

Conclusion

The IBM Informix System Administrator Exam is an excellent opportunity for professionals working with IBM Informix 12.10 System Administration to validate their skills and enhance their career prospects. With the proper preparation, candidates can pass the exam and gain the benefits of certification. This article overviews the IBM C2090-619 certification exam, including its uses, exam format, and study resources.

Tuesday 25 April 2023

Why companies need to accelerate data warehousing solution modernization

IBM, IBM Exam, IBM Exam Prep, IBM Exam Tutorial and Materials, IBM Certification, IBM Guides, IBM Skill

Unexpected situations like the COVID-19 pandemic and the ongoing macroeconomic atmosphere are wake-up calls for companies worldwide to exponentially accelerate digital transformation. During the pandemic, when lockdowns and social-distancing restrictions transformed business operations, it quickly became apparent that digital innovation was vital to the survival of any organization.

The dependence on remote internet access for business, personal, and educational use elevated the data demand and boosted global data consumption. Additionally, the increase in online transactions and web traffic generated mountains of data. Enter the modernization of data warehousing solutions.

Companies realized that their legacy or enterprise data warehousing solutions could not manage the huge workload. Innovative organizations sought modern solutions to manage larger data capacities and attain secure storage solutions, helping them meet consumer demands. One of these advances included the accelerated adoption of modernized data warehousing technologies. Business success and the ability to remain competitive depended on it.

Why data warehousing is critical to a company’s success

Data warehousing is the secure electronic information storage by a company or organization. It creates a trove of historical data that can be retrieved, analyzed, and reported to provide insight or predictive analysis into an organization’s performance and operations.

Data warehousing solutions drive business efficiency, build future analysis and predictions, enhance productivity, and improve business success. These solutions categorize and convert data into readable dashboards that anyone in a company can analyze. Data is reported from one central repository, enabling management to draw more meaningful business insights and make faster, better decisions.

By running reports on historical data, a data warehouse can clarify what systems and processes are working and what methods need improvement. Data warehouse is the base architecture for artificial intelligence and machine learning (AI/ML) solutions as well.

Benefits of new data warehousing technology

Everything is data, regardless of whether it’s structured, semi-structured, or unstructured. Most of the enterprise or legacy data warehousing will support only structured data through relational database management system (RDBMS) databases. Companies require additional resources and people to process enterprise data. It is nearly impossible to achieve business efficiency and agility with legacy tools that create inefficiency and elevate costs.

Managing, storing, and processing data is critical to business efficiency and success. Modern data warehousing technology can handle all data forms. Significant developments in big data, cloud computing, and advanced analytics created the demand for the modern data warehouse.

Today’s data warehouses are different from antiquated single-stack warehouses. Instead of focusing primarily on data processing, as legacy or enterprise data warehouses did, the modern version is designed to store tremendous amounts of data from multiple sources in various formats and produce analysis to drive business decisions.

Data warehousing solutions

A superior solution for companies is the integration of existing on-premises data warehousing with data lakehouse solutions using data fabric and data mesh technology. Doing so creates a modern data warehousing solution for the long term.

A data lakehouse contains an organization’s data in a unstructured, structured, semi-structured form, which can be stored indefinitely for immediate or future use. This data is used by data scientists and engineers who study data to gain business insights. Data lake or data lakehouse storage costs are less expensive than a enterprise data warehouse. Further, data lakes and data lakehouse are less time-consuming to manage, which reduces operational costs. IBM has a next-generation data lakehouse solution to achieve these business situations.

Data fabric is the next-generation data analytics platform that solves advanced data security challenges through decentralized ownership. Typically, organizations have multiple data sources from different business lines that must be integrated for analytics. A data fabric architecture effectively unites disparate data sources and links them through centrally managed data sharing and governance guidelines.

Many enterprises seek a flexible, hybrid, and multi-cloud solution based on cloud providers. The data mesh solution pushes down the structured query language (SQL) queries to the related RDBMS or data lakehouse by managing the data catalog, giving users virtualized tables and data. In data mesh principles, it never stores business data locally, which is an advantage for a business. A successful data mesh solution will reduce a company’s capital and operational expenses.

IBM Cloud Pak for Data is an excellent example of a data fabric and data mesh solution for analytics. Cloud technology has emerged as the preferred platform for artificial intelligence (AI) capabilities, intelligent edge services, and advanced wireless connectivity and etc. Many companies will leverage a hybrid, multi-cloud strategy to improve business performance and success and thrive in the business world. 

Best practices for adopting data warehousing technology

Data warehouse modernization includes extending the infrastructure without compromising security. This allows companies to reap the advantages of new technologies, inducing speed and agility in data processes, meeting changing business requirements, and staying relevant in this age of big data. The growing variety and volume of current data make it essential for businesses to modernize their data warehouses to remain competitive in today’s market. Businesses need valuable insights and reports in real-time and enterprise or legacy data warehouses cannot keep pace with modern data demands.

Data warehouses are at an exciting point of evolution. With the global data warehousing market size estimated to grow at a compound grow over 250% in next 5 years, companies will rely on new data warehouse solutions and tools that make them easier to use than ever before.

Cutting-edge technology to keep up with constant changes

AI and other breakthrough technologies will propel organizations into the next decade. Data consumption and load will continue to grow and provoke companies to discover new ways to implement state-of-the-art data warehousing solutions. The prevalence of digital technologies and connected devices will help organizations remain afloat, an unimaginable feat 20 years ago.

Essential lessons arise from an organization’s efforts to optimize its enterprise or legacy data warehousing technology. One vital lesson is the importance of making specific changes to modernize technology, processes, and organizational operations to evolve. As the rate of change will only continue to increase, this knowledge—and the capability to accelerate modernization—will be critical going forward.

No matter where you are at data warehouse modernization today, IBM experts are here to help modernize the right approach to fit your needs. It’s time to get started with your data warehouse modernization journey.

Source: ibm.com

Saturday 22 April 2023

IBM

Five recommendations for federal agencies to use the cloud to accelerate modernization

IBM Exam, IBM Exam Study, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Cloud

One of the most distinguishing things about our federal government is its broad scope of services. No other institution is responsible for doing so much for so many, so quickly, in an ever-changing landscape. No other institution must respond simultaneously to such a breadth of challenges that have only been amplified over the last few years.

In response to the COVID-19 crisis, many federal agencies kicked their digital transformations into high gear to help enhance public services, embrace a remote workforce and better secure data with trust and transparency. While positive momentum on modernization grows, so does external pressure as citizen expectations rise and new threats multiply.

However, it’s difficult for federal agencies to harness the potential of technology and data within a legacy IT infrastructure that struggles with data fragmentation, a lack of interoperability and vulnerability to cyber attacks. Uneven budget cycles and challenges focusing dollars on modernization compound these factors.

The US government’s IT modernization efforts are still lagging, especially when considering the improved productivity, speed and lowered risk that cloud computing offers to help address the security and citizen service challenges impacting trust in government today.

Congress and successive administrations have acknowledged these challenges with mandates like the recent Customer Service Executive Order. Last year’s Quantum Computing Cybersecurity Preparedness Act (P.L. 117-260), the recently released National Cybersecurity Strategy and the federal zero trust mandate require agencies to prioritize system and data security.

IBM has built a diverse ecosystem of partners to help government effectively use the cloud to address these challenges. Bringing government solutions derived from an ecosystem of partners is critical because no single IT provider can solve today’s government challenges alone. We continually evolve our partner ecosystem in response to public sector challenges, bringing forward collaborative teams from a full spectrum of industry players from global cloud service providers to small businesses.

From this perspective of partnership, IBM offers five recommendations for federal agencies to use the cloud to accelerate modernization:

1. Think hybrid multi-cloud first. Hybrid cloud is the only infrastructure and application development framework that’s flexible, adaptable and elastic enough to support the variety of programs and services needed today. Therefore, agencies should not orient modernization toward a single cloud service provider, nor should they always rely entirely on the cloud.

2. Support the mission out to the edge. Edge computing is a strategy for securely extending a digital environment out to the user. Americans expect to engage with government via their phones and tablets. The military needs global access to data and intelligence systems in remote locations. Government workers must deliver services in remote health clinics, far flung national parks and border control stations. Government accomplishes its mission “on the edge,” and we must secure applications and data where the mission is happening. Cybersecurity should be baked in from the initial design to maximize seamless risk mitigation and to minimize the end user burden.

3. Reorient incentives to modernize business processes, not infrastructure. We must deemphasize counting data centers closed each year, or which legacy applications shift to the cloud. It’s not just about technology, it’s about improving citizen services, security and enabling the mission. Agencies should prioritize optimizing business processes that impact service and how work is done. Federal IT budgets and score cards should incentivize this.

4. Apply an open ecosystem approach to improving how work is done. The challenges facing government can’t be met with just one company’s tools. Federal agencies must work with multiple cloud and infrastructure vendors to demand interoperability. Agencies should focus on solutions by challenging vendor teams to help redesign how work is done. To emphasize this, cloud infrastructure contracts should be expanded to encourage partner ecosystems to deliver cloud native solutions as services. Whenever possible, build once and use everywhere.

5. Streamline FedRAMP certification. FedRAMP is the default federal information security requirement. Congress recently reaffirmed its importance by passing the FedRAMP Authorization Act (P.L. 117-263). However, it remains far too difficult to move cloud solutions needed to modernize through FedRAMP certification. In fact, some see FedRAMP as a major hurdle. FedRAMP must become fully automated, the sponsorship burden reduced or eliminated, approvals must reciprocate between agencies, and the FedRAMP Program Office must be funded on par with its role supporting modernization. 

IBM looks forward to continuing to expand our collaborations within our partner ecosystem to support the digital transformation of government even better through connectivity, partnerships and open technologies. Success is a team sport. We are confident that working together as a collaborative ecosystem of partners, there is no challenge to which we cannot rise together.

Source: ibm.com

Thursday 20 April 2023

IBM

How a water technology company overcame massive data problems with ActionKPI and IBM

IBM Exam, IBM Exam Prep, IBM Preparation, IBM Career, IBM Jobs, IBM Skills

Access to clean water is essential for the survival and growth of humans, animals and crops. Water technology companies worldwide provide innovative solutions to supply, conserve and protect water throughout the highly complex and technical water cycle of collection, treatment, distribution, reuse and disposal.

After a series of international acquisitions, a leading water technology company formed an Assessment Services division to provide water infrastructure services to their customers. Through the formation of this group, the Assessment Services division discovered multiple enterprise resource planning instances and payroll systems, a lack of standard reporting, and siloed budgeting and forecasting processes residing within a labyrinth of spreadsheets. It was chaotic.

These redundant manual processes slowed the organization and resulted in inaccuracies and lack of clarity around what data could be trusted enough to use. Internal and external stakeholders of the publicly traded company felt the impact via late and misguided profit and loss forecasts. As a result, the new department was under heavy pressure to produce timely and accurate financial reports and projections for stakeholders. They were also expected to improve and deliver upon earning targets.

How IBM and ActionKPI improved financial reporting and streamlined operations


The Assessment Services division turned to ActionKPI and IBM to help solve their massive data problems. First, the partnership developed an integrated business planning roadmap, including a comprehensive strategy to address data and organizational challenges. Once the partnership completed that work, IBM, ActionKPI and the Assessments Services team kicked off the first phase of their two-phase project.

The water company first needed to standardize its monthly financials and management reporting for the solution to work. This work involved creating a single set of definitions and procedures for collecting and reporting financial data. The water company also needed to develop reporting for a data warehouse, financial data integration and operations. 

Phase One


Once the partnership established a data warehouse, the water company had a central repository for the organization’s data. They could store financial data from every business unit and create reports showing the organization’s financial performance from those various business units.

The next step of this phase was to create a system for integrating financial data from different sources and automating financial operations, making it possible to collect and report financial data more quickly and accurately.

Phase One resulted in standardized financial definitions, procedures and reports, as well as a data warehouse and a system that allowed the organization to improve its financial reporting and make better decisions.

Phase Two


Phase Two focused on standardizing their budgeting and forecasting process, moving from a complicated, interwoven one-hundred-and-forty-tab Excel model into IBM Planning Analytics. Part of this process involved refining and creating new strategies to align with the CFO’s vision and to establish a monthly rolling forecast process owned by the business.

The partnership created several models to transition Excel business logic and calculations to IBM Planning Analytics. These models included employee-level workforce planning, high-level project forecasting and the organization’s driver-based and line-item detailed operating expenses, all integrated directly into the P&L to provide real-time forecast updates.

With the forecast model established, the next step was to integrate financial and payroll data into the forecasting model from the data warehouse. The partnership set up the data integration to update actuals from ERP to Datawarehouse to IBM Planning Analytics in near real time and on demand, which is critical for month-end reporting deadlines. This integration enabled the water technology company to reduce month-end reporting cycles, free up time to conduct better analysis and provide a well-thought-out forecast to the business, all within essential reporting timelines.

By centralizing actuals, budget and forecasting information in Planning Analytics, finance business partners could analyze the business faster, allowing the company to make decisions mid-month and influence month-end results (instead of just reporting results without any actionable recommendations) and moving decision making from hindsight to foresight.

As the finance teams began providing information analysis and insights, the whole organization started to see the value of their investment. Hitting this project goal was an essential milestone, and the momentum began to build and prepare the organization for the next phase.

Outcomes of this phase included a higher degree of transparency and forecast accuracy. In addition, they moved from being the last division to submit budgets and forecast results to corporate to the first.

Phase Three


Phase Three of the project involved transitioning core sales, operational processes and weekly forecast reporting from Excel to IBM Planning Analytics.

The new system required integrating data from multiple sources, such as Salesforce, ERP, Workday and Excel spreadsheets. In addition, a change management strategy was implemented to ensure everyone involved in the project was on board with the transformations.

Using IBM Planning Analytics and Cognos® Analytics, the team transformed the core workings of the Assessment Services division, bringing a sense of order and a single source of truth to multiple departments throughout the organization.

The Vice President of the water services company reports, “Integrated Business Planning has allowed us to dive in and signal risk … resulting in proactive action plans that steer our bottom line. We no longer debate who has the right information, because we operate on a single source of truth, instilling trust in our process and underlying data.”

Better decision-making through increased transparency and trust in data


The partnership’s resulting technology and process optimization improved forecast accuracy and reduced the time to forecast from 4 months to 1 week. Decision makers can now consider the potential risk and rewards of different options to make choices with the best chance of achieving their desired business outcome. Greater transparency and trust in data provide a central project management solution for project forecasting and P&L Reporting.

This process optimization also produced the following significant results:

◉ A deeper understanding of the financial performance of projects and their associated project managers
◉ Transparent quarterly targets and forecasts for sales reps to achieve
◉ An easier way to set client and forecast risk expectations at the corporate level
◉ A better way to determine resource capacity, risk and revenue backlog
◉ An efficient method to guide sales efforts and client negotiations based on capacity and insight into margin details

By combining integrated business planning with smarter analytics, the water technology company was able to operate more efficiently and unlock profit potential, all while protecting one of the world’s most precious resources.

Source: ibm.com

Saturday 15 April 2023

How to build a decision tree model in IBM Db2

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

After developing a machine learning model, you need a place to run your model and serve predictions. If your company is in the early stage of its AI journey or has budget constraints, you may struggle to find a deployment system for your model. Building ML infrastructure and integrating ML models with the larger business are major bottlenecks to AI adoption. IBM Db2 can help solve these problems with its built-in ML infrastructure. Someone with the knowledge of SQL and access to a Db2 instance, where the in-database ML feature is enabled, can easily learn to build and use a machine learning model in the database.

In this post, I will show how to develop, deploy, and use a decision tree model in a Db2 database.

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

These are my major steps in this tutorial:

1. Set up Db2 tables
2. Explore ML dataset
3. Preprocess the dataset
4. Train a decision tree model
5. Generate predictions using the model
6. Evaluate the model

I implemented these steps in a Db2 Warehouse on-prem database. Db2 Warehouse on cloud also supports these ML features.

The machine learning use case


I will use a dataset of historical flights in the US. For each flight, the dataset has information such as the flight’s origin airport, departure time, flying time, and arrival time. Also, a column in the dataset indicates if each flight had arrived on time or late. Using examples from the dataset, we’ll build a classification model with decision tree algorithm. Once trained, the model can receive as input unseen flight data and predict if the flight will arrive on time or late at its destination.

1. Set up Db2 tables


The dataset I use in this tutorial is available here as a csv file.

Creating a Db2 table

I use the following SQL for creating a table for storing the dataset.

db2start
connect to <database_name>

db2 "CREATE TABLE FLIGHTS.FLIGHTS_DATA_V3  (
ID INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,
YEAR INTEGER ,
QUARTER INTEGER ,
MONTH INTEGER ,
DAYOFMONTH INTEGER ,                       
DAYOFWEEK INTEGER ,                       
UNIQUECARRIER VARCHAR(50 OCTETS) ,               
ORIGIN VARCHAR(50 OCTETS) ,                      
DEST VARCHAR(50 OCTETS) ,                       
CRSDEPTIME INTEGER ,                       
DEPTIME INTEGER ,                       
DEPDELAY REAL ,                       
DEPDEL15 REAL ,                       
TAXIOUT INTEGER ,                       
WHEELSOFF INTEGER ,                       
CRSARRTIME INTEGER ,                       
CRSELAPSEDTIME INTEGER ,                       
AIRTIME INTEGER ,                       
DISTANCEGROUP INTEGER ,                       
FLIGHTSTATUS VARCHAR(1) )
ORGANIZE BY ROW";

After creating the table, I use the following SQL to load the data, from the csv file, into the table:

db2 "IMPORT FROM 'FLIGHTS_DATA_V3.csv' OF DEL COMMITCOUNT 50000 INSERT INTO FLIGHTS.FLIGHTS_DATA_V3"

I now have the ML dataset loaded into the FLIGHTS.FLIGHTS_DATA_V3 table in Db2. I’ll copy a subset of the records from this table to a separate table for the ML model development and evaluation, leaving the original copy of the data intact. 

SELECT count(*) FROM FLIGHTS.FLIGHTS_DATA_V3
 — — — 
1000000

Creating a separate table with sample records

Create a table with 10% sample rows from the above table. Use the RAND function of Db2 for random sampling.

CREATE TABLE FLIGHT.FLIGHTS_DATA AS (SELECT * FROM FLIGHTS.FLIGHTS_DATA_V3 WHERE RAND() < 0.1) WITH DATA
Count the number of rows in the sample table.

SELECT count(*) FROM FLIGHT.FLIGHTS_DATA
— — — 
99879

Look into the scheme definition of the table.

SELECT NAME, COLTYPE, LENGTH
FROM SYSIBM.SYSCOLUMNS
WHERE TBCREATOR = 'FLIGHT' AND TBNAME = 'FLIGHTS_DATA'
ORDER BY COLNO

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

FLIGHTSTATUS is the response or the target column. Others are feature columns.

Find the DISTINCT values in the target column.

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

From these values, I can see that it’s a binary classification task where each flight arrived either on time or late. 

Find the frequencies of distinct values in the FLIGHTSTATUS column.

SELECT FLIGHTSTATUS, count(*) AS FREQUENCY, count(*) / (SELECT count(*) FROM FLIGHT.FLIGHTS_DATA) AS FRACTION
FROM FLIGHT.FLIGHTS_DATA fdf
GROUP BY FLIGHTSTATUS

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

From the above, I see the classes are imbalanced. Now I’ll not gain any further insights from the entire dataset, as this can leak information to the modeling phase. 

Creating train/test partitions of the dataset

Before collecting deeper insights into the data, I’ll divide this dataset into train and test partitions using Db2’s RANDOM_SAMPLING SP. I apply stratified sampling to preserve the ratio between two classes in the generated training data set.

Create a TRAIN partition.

call IDAX.RANDOM_SAMPLE('intable=FLIGHT.FLIGHTS_DATA, fraction=0.8, outtable=FLIGHT.FLIGHTS_TRAIN, by=FLIGHTSTATUS')

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

Copy the remaining records to a test PARTITION.

CREATE TABLE FLIGHT.FLIGHTS_TEST AS (SELECT * FROM FLIGHT.FLIGHTS_DATA FDF WHERE FDF.ID NOT IN(SELECT FT.ID FROM FLIGHT.FLIGHTS_TRAIN FT)) WITH DATA

2. Explore data


In this step, I’ll look at both sample records and the summary statistics of the training dataset to gain insights into the dataset.

Look into some sample records.

SELECT * FROM FLIGHT.FLIGHTS_TRAIN FETCH FIRST 10 ROWS ONLY

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

Some columns have encoded the time as numbers:

 — CRSDEPTIME: Computer Reservation System (scheduled) Departure Time (hhmm)

 — DepTime: Departure Time (hhmm)

 — CRSArrTime: Computer Reservation System (scheduled) Arrival Time

Now, I collect summary statistics from the FLIGHTS_TRAIN using SUMMARY1000 SP to get a global view of the characteristics of the dataset.

CALL IDAX.SUMMARY1000('intable=FLIGHT.FLIGHTS_TRAIN, outtable=FLIGHT.FLIGHTS_TRAIN_SUM1000')

Here the intable has the name of the input table from which I want SUMMARY1000 SP to collect statistics. outtable is the name of the table where SUMMARY1000 will store gathered statistics for the entire dataset. Besides the outtable, SUMMARY1000 SP creates a few additional output tables — one table with statistics for each column type. Our dataset has two types of columns — numeric and nominal. So, SUMMARY1000 will generate two additional tables. These additional tables follow this naming convention: the name of the outtable + column type. In our case, the column types are NUM, representing numeric, and CHAR, representing nominal. So, the names of these two additional tables will be as follows:

FLIGHTS_TRAIN_SUM1000_NUM

FLIGHTS_TRAIN_SUM1000_CHAR

Having the statistics available in separate tables for specific datatypes makes it easier to view the statistics that apply to specific datatype and reduce the number of columns whose statistics are viewed together. This simplifies the analysis process. 

Check the summary statistics of the numeric column.

SELECT * FROM FLIGHT.FLIGHTS_TRAIN_SUM1000_NUM

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

For the numeric columns, SUMMARY1000 gather the following statistics:

◉ Missing value count
◉ Non-missing value count
◉ Average
◉ Variance
◉ Standard deviation
◉ Skewness
◉ Excess kurtosis
◉ Minimum
◉ Maximum

Each of these statistics can help uncover insights into the dataset. For instance, I can see that DEPDEL15 and DEPDELAY columns have 49 missing values. There are large values in these columns: AIRTIME, CRSARRTIME, CRSDEPTIME, CRSELAPSEDTIME, DEPDELAY, DEPTIME, TAXIOUT, WHEELSOFF, and YEAR. Since I will create a decision tree model, I don’t need to deal with the large value and the missing values. Db2 will deal with both issues natively. 

Next, I investigate the summary statistics of the nominal columns.

select * from FLIGHT.FLIGHTS_TRAIN_SUM1000_CHAR

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

For nominal columns, SUMMARY1000 gathered the following statistics:

◉ Number of missing values
◉ Number of non-missing values
◉ Number of distinct values
◉ Frequency of the most frequent value

3. Preprocess data


From the above data exploration, I can see that the dataset has no missing values. These four TIME columns have large values: AIRTIME, CRSARRTIME, DEPTIME, WHEELSOFF. I’ll leave the nominal values in all columns as-is, as the decision tree implementation in Db2 can deal with them natively. 

Extract the hour part from the TIME columns — CRSARRTIME, DEPTIME, WHEELSOFF.

From looking up the description of the dataset, I see the values in the CRSARRTIME, DEPTIME, and WHEELSOFF columns are encoding of hhmm of the time values. I extract the hour part of these values to create, hopefully, better features for the learning algorithm. 

Scale CRSARRTIME COLUMN: divide the value with 100 gives the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET CRSARRTIME = CRSARRTIME / 100

Scale DEPTIME COLUMN: divide the value by 100 gives the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET DEPTIME = DEPTIME / 100

Scale WHEELSOFF COLUMN: divide the value by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET WHEELSOFF = WHEELSOFF / 100

4. Train a decision tree model


Now the training dataset is ready for the decision tree algorithm. 

I train a decision tree model using GROW_DECTREE SP. 

CALL IDAX.GROW_DECTREE('model=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TRAIN, id=ID, target=FLIGHTSTATUS')

I called this SP using the following parameters:

◉ model: the name I want to give to the decision tree model — FLIGHT_DECTREE
◉ intable: the name of the table where the training dataset is stored
◉ id: the name of the ID column
◉ target: the name of the target column

After completing the model training, the GROW_DECTREE SP generated several tables with metadata from the model and the training dataset. Here are some of the key tables:

◉ FLIGHT_DECTREE_MODEL: this table contains metadata about the model. Examples of metadata include depth of the tree, strategy for handling missing values, and the number of leaf nodes in the tree. 
◉ FLIGHT_DECTREE_NODES: this table provides information about each node in the decision tree. 
◉ FLIGHT_DECTREE_COLUMNS: this table provides information on each input column and their role in the trained model. The information includes the importance of a column in generating a prediction from the model. 

5. Generate predictions from the model


Since the FLIGHT_DECTREE model is trained and deployed in the database, I can use it for generating predictions on the test records from the FLIGHTS_TEST table.

First, I preprocess the test dataset using the same preprocessing logic that I applied to the TRAINING dataset. 

Scale CRSARRTIME COLUMN: divide the value by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET CRSARRTIME = CRSARRTIME / 100

Scale DEPTIME COLUMN: divide the value by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET DEPTIME = DEPTIME / 100

Scale WHEELSOFF COLUMN: divide the value by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET WHEELSOFF = WHEELSOFF / 100

Generating predictions

I use PREDICT_DECTREE SP to generate predictions from the FLIGHT_DECTREE model:

CALL IDAX.PREDICT_DECTREE('model=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TEST, outtable=FLIGHT.FLIGHTS_TEST_PRED, prob=true, outtableprob=FLIGHT.FLIGHTS_TEST_PRED_DIST')

Here is the list of parameters I passed when calling this SP:

◉ model: the name of the decision tree model, FLIGHT_DECTREE
◉ intable: name of the input table to generate predictions from
◉ outtable: the name of the table that the SP will create and store predictions to
◉ prob: a boolean flag indicating if we want to include in the output the probability of each prediction
◉ outputtableprob: the name of the output table where the probability of each prediction will be stored 

6. Evaluate the model


Using generated predictions for the test dataset, I compute a few metrics to evaluate the quality of the model’s predictions.

Creating a confusion matrix

I use CONFUSION_MATRIX SP to create a confusion matrix based on the model’s prediction on the TEST dataset. 

CALL IDAX.CONFUSION_MATRIX('intable=FLIGHT.FLIGHTS_TEST, resulttable=FLIGHT.FLIGHTS_TEST_PRED, id=ID, target=FLIGHTSTATUS, matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')
In calling this SP, here are some of the key parameters that I passed:

◉ intable: the name of the table that contains the dataset and the actual value of the target column
◉ resulttable: the name of the table that contains the column with predicted values from the model
◉ target: the name of the target column
◉ matrixTable: The output table where the SP will store the confusion matrix

After the SP completes its run, we have the following output table with statistics for the confusion matrix. 

FLIGHTS_TEST_CMATRIX:

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

This table has three columns. The REAL column has the actual flight status. PREDICTION column has the predicted flight status. Since flight status takes two values – 0 (on time) or 1 (delayed), we have four possible combinations between values in the REAL and the PREDICTION columns: 

1. TRUE NEGATIVE: REAL: 0, PREDICTION: 0 — The model has accurately predicted the status of those flights that arrived on schedule. From that CNT column, we see that 11795 rows from the TEST table belong to this combination.
2. FALSE POSITIVE: REAL: 0, PREDICTION: 1 — these are the flights that actually arrived on time but the model predicted them to be delayed. 671 is the count of such flights. 
3. FALSE NEGATIVE: REAL: 1, PREDICTION: 0 — these flights have arrived late, but the model predicted them to be on time. From the CNT table, we find their count to be 2528.
4. TRUE POSITIVE: REAL: 1, PREDICTION: 1 — the model has accurately identified these flights that were late. The count is 4981. 

I use these counts to compute a few evaluation metrics for the model. For doing so, I use CMATRIX_STATS SP as follows:

CALL IDAX.CMATRIX_STATS('matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')
The only parameter this SP needs is the name of the table that contains the statistics generated by the CONFUSION_MATRIX SP in the previous step. CMATRIX_STATS SP generates two sets of output. The first one shows overall quality metrics of the model. The second one includes the model’s predictive performance for each class. 

First output — overall model metrics include correction predictions, incorrect prediction, overall accuracy, weighted accuracy. From this output, I see that the model has an overall accuracy of 83.98% and a weighted accuracy of 80.46%. 

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

With classification tasks, it’s usually useful to view the model’s quality factors for each individual class. The second output from the CMATRIX_STATS SP includes these class level quality metrics. 

IBM, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Guides, IBM Learning, IBM Certification, IBM Prep, IBM Preparation

For each class, this output includes the True Positive Rate (TPR), False Positive Rate (FPR), Positive Predictive Value (PPV) or Precision, and F-measure (F1 score). 

Source: ibm.com