Showing posts with label Collect: Make data accessible. Show all posts
Showing posts with label Collect: Make data accessible. Show all posts

Tuesday, 7 March 2023

Real-time analytics on IoT data

IBM Exam, IBM Tutorial and Materials, IBM Career, IBM Jobs, IBM Skills, IBM Prep, IBM Preparation, IBM IoT, IBM Guides, IBM Learning

Why real-time analytics matters for IoT systems


IoT systems access millions of devices that generate large amounts of streaming data. For some equipment, a single event may prove critical to understanding and responding to the health of the machine in real time, increasing the importance of accurate, reliable data. While real-time data remains important, storing and analyzing the historical data also creates opportunities to improve processes, decision-making and outcomes.

Smart grids, which include components like sensors and smart meters, produce a wealth of telemetry data that can be used for multiple purposes, including:

◉ Identifying anomalies such as manufacturing defects or process deviations
◉ Predictive maintenance on devices (such as meters and transformers)
◉ Real-time operational dashboards
◉ Inventory optimization (in retail)
◉ Supply chain optimization (in manufacturing)

Considering solutions for real-time analytics on IoT data


One way to achieve real-time analytics is with a combination of a time-series database (InfluxDB or TimescaleDB) or a NoSQL database (MongoDB) + a data warehouse + a BI tool:

IBM Exam, IBM Tutorial and Materials, IBM Career, IBM Jobs, IBM Skills, IBM Prep, IBM Preparation, IBM IoT, IBM Guides, IBM Learning

This architecture raises a question: Why would one use an operational database, and still need a data warehouse? Architects consider such a separation so they can choose a special-purpose database — such as a NoSQL database for document data — or a time-series database (key-value) for low costs and high performance.

However, this separation also creates a data bottleneck — data can’t be analyzed without moving it from an operational data store to the warehouse. Additionally, NoSQL databases are not great at analytics, especially when it comes to complex joins and real-time analytics.

Is there a better way? What if you could get all of the above with a general-purpose, high-performance SQL database? You’d need this type of database to support time-series data, streaming data ingestion, real–time analytics and perhaps even JSON documents.

IBM Exam, IBM Tutorial and Materials, IBM Career, IBM Jobs, IBM Skills, IBM Prep, IBM Preparation, IBM IoT, IBM Guides, IBM Learning

Achieving a real-time architecture with SingleStoreDB + IBM Cognos


SingleStoreDB supports fast ingestion with Pipelines (native first class feature) and concurrent analytics for IoT data to enable real-time analytics. On top of SingleStoreDB, you can use IBM® Cognos® Business Intelligence to help you make sense of all of this data. The previously described architecture then simplifies into:

IBM Exam, IBM Tutorial and Materials, IBM Career, IBM Jobs, IBM Skills, IBM Prep, IBM Preparation, IBM IoT, IBM Guides, IBM Learning
Real-time analytics with SingleStoreDB & IBM Cognos

Pipelines in SingleStoreDB allow you to continuously load data at blazing fast speeds. Millions of events can be ingested each second in parallel from data sources such as Kafka, cloud object storage or HDFS. This means you can stream in structured — as well as unstructured data — for real-time analytics.

IBM Exam, IBM Tutorial and Materials, IBM Career, IBM Jobs, IBM Skills, IBM Prep, IBM Preparation, IBM IoT, IBM Guides, IBM Learning

But wait, it gets better…

1. Once data is in SingleStoreDB, it can also be used for real-time machine learning, or to safely run application code imported into a sandbox with SingleStoreDB’s Code Engine Powered by Web Assembly (Wasm).
2. With SingleStoreDB, you can also leverage geospatial data — for instance to factor site locations, or to visualize material moving through your supply chains.

Armis and Infiswift are just a couple of examples of how customers use SingleStoreDB for IoT applications:

◉ Armis uses SingleStoreDB to help enterprises discover and secure IoT devices. Armis originally started with PostgreSQL, migrated to ElasticSearch for better search performance and considered Google Big Query before finally picking SingleStoreDB for its overall capabilities across relational, analytics and text search. The Armis Platform, of which SingleStoreDB now plays a significant part, collects an array of raw data (traffic, asset, user data and more) from various sources — then processes, analyzes, enriches and aggregates it.

◉ Infiswift selected SingleStoreDB after evaluating several other databases. Their decision was driven in part because of SingleStore’s Universal Storage technology (a hybrid table type that works for both transactional and analytical workloads).

Want to learn more about achieving real-time analytics?


Join IBM and SingleStore on Sep 21, 2022 for our webinar “Accelerating Real-Time IoT Analytics with IBM Cognos and SingleStore”. You will learn how real-time data can be leveraged to identify anomalies and create alarms by reading meter data, and classifying unusual spikes as warnings.

We will demonstrate:

◉ Streaming data ingestion using SingleStoreDB Pipelines
◉ Stored procedures in SingleStoreDB to classify data before it is persisted on disk or in memory
◉ Dashboarding with Cognos

These capabilities enable companies to:

◉ Provide better quality of service through quickly reacting to or predicting service interruptions due to equipment failures
◉ Identify opportunities to increase production throughput as needed
◉ Quickly and accurately invoice customers for their utilization

Source: ibm.com

Tuesday, 28 February 2023

How to use Netezza Performance Server query data in Amazon Simple Storage Service (S3)

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides

In this example, we will demonstrate using current data within a Netezza Performance Server as a Service (NPSaaS) table combined with historical data in Parquet files to determine if flight delays have increased in 2022 due to the impact of the COVID-19 pandemic on the airline travel industry. This demonstration illustrates how Netezza Performance Server (NPS) can be extended to access data stored externally in cloud object storage (Parquet format files).

Background on the Netezza Performance Server capability demo


Netezza Performance Server (NPS) has recently added the ability to access Parquet files by defining a Parquet file as an external table in the database. This allows data that exists in cloud object storage to be easily combined with existing data warehouse data without data movement. The advantage to NPS clients is that they can store infrequently used data in a cost-effective manner without having to move that data into a physical data warehouse table.

To make it easy for clients to understand how to utilize this capability within NPS, a demonstration was created that uses flight delay data for all commercial flights from United States airports that was collected by the United States Department of Transportation (Bureau of Transportation Statistics). This data will be analyzed using Netezza SQL and Python code to determine if the flight delays for the first half of 2022 have increased over flight delays compared to earlier periods of time within the current data (January 2019 – December 2021).

This demonstration then compares the current flight delay data (January 2019 – June 2022) with historical flight delay data (June 2003 – December 2018) to understand if the flight delays experienced in 2022 are occurring with more frequency or simply following a historical pattern.

For this data scenario, the current flight delay data (2019 – 2022) is contained in a regular, internal NPS database table residing in an NPS as a Service (NPSaaS) instance within the U.S. East2 region of the Microsoft Azure cloud and the historical data (2003 – 2018) is contained in an external Parquet format file that resides on the Amazon Web Services (AWS) cloud within S3 (Simple Storage Service) storage.

All SQL and Python code is executed against the NPS database using Jupyter notebooks, which capture query output and graphing of results during the analysis phase of the demonstration. The external table capability of NPS makes it transparent to a client that some of the data resides externally to the data warehouse. This provides a cost-effective data analysis solution for clients that have frequently accessed data that they wish to combine with older, less frequently accessed data. It also allows clients to store their different data collections using the most economical storage based on the frequency of data access, instead of storing all data using high-cost data warehouse storage.

Prerequisites for the demo


The data set used in this example is a publicly available data set that is available from the United States Department of Transportation, Bureau of Transportation Statistics website at this URL: https://www.transtats.bts.gov/ot_delay/ot_delaycause1.asp?qv52ynB=qn6n&20=E

Using the default settings will return the most recent flight delay data for the last month of data available (for example, in late November 2022, the most recent data available was for August 2022). Any data from June 2003 up until the most recent month of data available can be selected.

The data definition


For this demonstration of NPS external tables capabilities to access AWS S3 data, the following tables were created in the NPS database.

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
Figure 1 – NPS database table definitions

The primary tables that will be used in the analysis portion of the demonstration are the AIRLINE_DELAY_CAUSE_CURRENT table (2019 – June 2022 data) and the AIRLINE_DELAY_CAUSE_HISTORY (2003 – 2018 data) external table (Parquet file). The historical data is placed in a single Parquet file to improve query performance versus having to join sixteen external tables in a single query.

The following diagram shows the data flows:

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
Figure 2 – Data flow for data analysis

Brief description of the flight delay data


Before the actual data analysis is discussed, it is important to understand the data columns tracked within the flight delay information and what the columns represent.

A flight is not counted as a delayed flight unless the delay is over 15 minutes from the original departure time.

There are five types of delays that are reported by the airlines participating in flight delay tracking:

◉ Air Carrier – the reason for the flight delay was within the airline’s control such as maintenance or flight crew issues, aircraft cleaning, baggage loading, fueling, and related issues.

◉ Extreme Weather – the flight delay was caused by extreme weather factors such as a blizzard, hurricane, or tornado.

◉ National Aviation System (NAS) – delays attributed to the national aviation system which covers a broad set of conditions such as non-extreme weather, airport operations, heavy traffic volumes, and air traffic control.

◉ Late arriving aircraft – a previous flight using the same aircraft arrived late, causing the present flight to depart late.

◉ Security – delays caused by an evacuation of a terminal or concourse, reboarding of an aircraft due to a security breach, inoperative screening equipment, and/or long lines more than 29 minutes in screening areas.

Since a flight delay can result from more than one of the five reasons for the delay, the delays are captured using several different columns of information. The first column, ARR_DELAY15 contains the number of minutes of the flight delay. There are five columns that correspond to the flight delay types: CARRIER_CT, WEATHER_CT, NAS_CT, SECURITY_CT, and LATE_AIRCRAFT_CT. The sum of these five columns will equal the time listed in the ARR_DELAY15 column.

Because multiple factors can contribute to a flight delay, the individual components of the flight delay can indicate a fractional portion of the overall flight delay. For example, the overall delay of 4.00 (ARR_DELAY15) is comprised of 2.67 for CARRIER_CT and 1.33 for LATE_AIRCRAFT_CT to equal the total 4.00 flight delay. This allows for further analysis to understand all factors that contributed to the overall flight delay time.

Here is an excerpt of the flight delay data to illustrate how the ARR_DELAY15 and flight delay reason columns interact:

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
Figure 3 – Portion of the flight delay data highlighting the column relationships

Flight delay data analysis


In this final section, the actual data analysis and results of the flight delay data analysis will be highlighted.

After the flight delay tables and external files (Parquet format files) were created and data loaded, there were several queries executed to validate that the data was for the correct date range within each table and that valid data was loaded into all the tables (internal and external).

Once this data validation and table verification was complete, the data analysis of the flight delay data began.

The initial data analysis was performed on the data in the internal NPS database table to look at the current flight delay data (2019 – June 2022) using this query.

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
Figure 4 – Initial analysis on current flight delay data

The data was displayed using a bar graph as well to make it easier to understand.

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
Figure 5 – Bar graph of current flight delay data (2019 – June 2022)

In looking at this graph, it appears that 2022 has fewer flight delays than the other recent years of flight delay data, with the exception of 2020 (the height of the COVID-19 pandemic). However, the flight delay data for 2022 is for six months only (January – June) versus the 12-months of data for the years 2019 through 2021. Therefore, the data must be normalized to provide a true comparison of flight delays between 2019 through 2021 and the partial year’s data of 2022.

After the data is normalized by comparing the number of flight delays compared to the total number of flights, the data can provide a valid comparison from the 2019 through the June 2022 time-period.

Figure 6 – There is a higher ratio of delayed flights in 2022 than in the period from 2019 – 2021

As Figure 6 highlights, when looking at the number of delayed flights compared to the total flights for the period, the flight delays in 2022 have increased over the prior years (2019 – 2021).

The next step in the analysis is to look at the historical flight delay data (2003 – 2018) to determine if the 2022 flight delays follow a historical pattern or if the flight delays have increased in 2022 due to the results of the pandemic period (airport staffing shortages, pilot shortages, and related factors).

Here is the initial query result on the historical flight delay data using a line graph output.

Figure 7 – Initial query using the historical data (2003 – 2018)

Figure 8 – Flight delays increased early in the historical years

After looking at the historical flight delay data from 2003–2018 at a high level, it was determined that the historical data should be separated into two separate time periods: 2003–2012 and 2013–2018. This separation was determined by analyzing the flight delays for each month of the year (January through December) and comparing the data for each of the historical years of data (2003–2018). With this flight delay comparison, the period from 2013–2018 had fewer flight delays for each month than the flight delay data for the period from 2003–2012.

The result of this query was output in a bar graph format to highlight the lower number of flight delays for the years from 2013–2018.

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
Figure 9 – Flight delays were lower during 2013 through 2018

The final analysis combines the historical flight delay data and illustrates the benefit of combining data from external AWS S3 parquet format and local Netezza format do a monthly analysis of the 2022 flight delay data (local Netezza) and graph it alongside the two historical periods (parquet): 2003–2012 and 2013–2018.

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
Figure 10 – The query to calculate monthly flight delays for 2022

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
Figure 11 – Flight delay comparison of 2022 (red) with historical period #1 (2003-2012) (blue) and historical period #2 (2013-2018) (green)

As the flight delay data graph indicates, the flight delays for 2022 are higher for every month from January through June (remember, the 2022 flight delay data is only through June) than the historical period #2 from 2013–2018. Only the oldest historical data (2003–2012) had flight delays comparable to 2022. Since the earlier analysis of current data (2019–June 2022) showed that 2022 had more flight delays than the period from 2019 through 2021, flight delays have increased in 2022 versus the last 10 years of flight delay data. This seems to indicate that the cause of the increased flight delays are factors related to the COVID-19 pandemic impacts to the airline industry.

A solution for quicker data analysis


The capabilities of NPS along with the ability to perform data analysis using Jupyter notebooks and integration with IBM Watson Studio as part of Cloud Pak for Data as a Service (with a free tier of usage) allow clients to perform data analysis quickly on a data set that can span the data warehouse and external Parquet format files in the cloud. This combination provides clients flexibility and cost savings by allowing them to host data in a storage medium based on application performance requirements, frequency of data access required, and budgetary constraints. By not requiring a client to move their data into the data warehouse, NPS can provide an advantage over other vendors such as Snowflake.

Supplemental section with additional details


Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
The SQL used to create the native Netezza table with current data (2019-June 2022)

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
The SQL to define a database source in Netezza for the cloud object storage bucket

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
The SQL to create external table for 2003 through 2018 from parquet files

Amazon Simple Storage Service (S3), IBM, IBM Exam, IBM Exam Prep, IBM Career, IBM Skills, IBM Jobs, IBM Tutorial and Materials, IBM Guides
The SQL to ‘create table as select’ from the parquet file

Source: ibm.com

Saturday, 25 February 2023

5 misconceptions about cloud data warehouses

Modernize: Cloud-ready data, Collect: Make data accessible, Organize: Business-ready analytics, IBM Exam, IBM Exam Prep, IBM Exam Preparation, IBM Tutorial and Materials, IBM Jobs, IBM Prep, IBM Preparation

In today’s world, data warehouses are a critical component of any organization’s technology ecosystem. They provide the backbone for a range of use cases such as business intelligence (BI) reporting, dashboarding, and machine-learning (ML)-based predictive analytics, that enable faster decision making and insights.

The rise of cloud has allowed data warehouses to provide new capabilities such as cost-effective data storage at petabyte scale, highly scalable compute and storage, pay-as-you-go pricing and fully managed service delivery. Companies are shifting their investments to cloud software and reducing their spend on legacy infrastructure. In 2021, cloud databases accounted for 85% of the market growth in databases. These developments have accelerated the adoption of hybrid-cloud data warehousing; industry analysts estimate that almost 50% of enterprise data has been moved to the cloud.

What is holding back the other 50% of datasets on-premises? Based on our experience speaking with CTOs and IT leaders in large enterprises, we have identified the most common misconceptions about cloud data warehouses that cause companies to hesitate to move to the cloud.

Misconception 1: Cloud data warehouses are more expensive


When considering moving data warehouses from on-premises to the cloud, companies often get sticker shock at the total cost of ownership. However, a more detailed analysis is needed to make an informed decision. Traditional on-premises warehouses require a significant initial capital investment and ongoing support fees, as well as additional expenses for managing the enterprise infrastructure. In contrast, cloud data warehouses may have a higher annual subscription fee, but they incorporate the upfront investment and additional ongoing overhead. Cloud warehouses also provide customers with elastic scalability, cheaper storage, savings on maintenance and upgrade costs, and cost transparency, which allows customers to have greater control over their warehousing costs. Industry analysts estimate that organizations that implement best practices around cloud cost controls and cloud migration see an average savings of 21% when using a public cloud and a 13x revenue growth rate for adopters of hybrid-cloud through end-to-end reinvention.

Misconception 2: Cloud data warehouses do not provide the same level of security and compliance as on-premises warehouses


Companies in highly regulated industries such as finance, insurance, transportation and manufacturing have a complex set of compliance requirements for their data, often leading to an additional layer of complexity when it comes to migrating data to the cloud. In addition, companies have complex data security requirements. However, over the past decade, a vast array of compliance and security standards, such as SOC2, PCI, HIPAA, and GDPR, have been introduced, and met by cloud providers. The rise of sovereign clouds and industry specific clouds are addressing the concerns of governmental and industry specific regulatory requirements. In addition, warehouse providers take on the responsibility of patching and securing the cloud data warehouse, to ensure that business users stay compliant with the regulations as they evolve.

Misconception 3: All data warehouse migrations are the same, irrespective of vendors


While migrating to the cloud, CTOs often feel the need to revamp and “modernize” their entire technology stack – including moving to a new cloud data warehouse vendor. However, a successful migration usually requires multiple rounds of data replication, query optimization, application re-architecture and retraining of DBAs and architects.

To mitigate these complexities, organizations should evaluate whether a hybrid-cloud version of their existing data warehouse vendor can satisfy their use cases, before considering a move to a different platform. This approach has several benefits, such as streamlined migration of data from on-premises to the cloud, reduced query tuning requirements and continuity in SRE tooling, automations, and personnel. It also enables organizations to create a decentralized hybrid-cloud data architecture where workloads can be distributed across on-prem and cloud.

Misconception 4: Migration to cloud data warehouses needs to be 0% or 100%


Companies undergoing cloud migrations often feel pressure to migrate everything to the cloud to justify the investment of the migration. However, different workloads may be better suited for different deployment environments. With a hybrid-cloud approach to data management, companies can choose where to run specific workloads, while maintaining control over costs and workload management. It allows companies to take advantage of the benefits of the cloud, such as scale and elasticity, while also retaining the control and security of sensitive workloads in-house. For example, Marriott International built a decentralized hybrid-cloud data architecture while migrating from their legacy analytics appliances, and saw a nearly 90% increase in performance. This enabled data-driven analytics at scale across the organization.

Misconception 5: Cloud data warehouses reduce control over your deployment


Some DBAs believe that cloud data warehouses lack the control and flexibility of on-prem data warehouses, making it harder to respond to security threats, performance issues or disasters. In reality, cloud data warehouses have evolved to provide the same control maturity as on-prem warehouses. Cloud warehouses also provide a host of additional capabilities such as failover to different data centers, automated backup and restore, high availability, and advanced security and alerting measures. Organizations looking to increase adoption of ML are turning to cloud data warehouses that support new, open data formats to catalog, ingest, and query unstructured data types. This functionality provides access to data by storing it in an open format, increasing flexibility for data exploration and ML modeling used by data scientists, facilitating governed data use of unstructured data, improving collaboration, and reducing data silos with simplified data lake integration.

Additionally, some DBAs worry that moving to the cloud reduces the need for their expertise and skillset. However, in reality, cloud data warehouses only automate the operational management of data warehousing such as scaling, reliability and backups, freeing DBAs to work on high value tasks such as warehouse design, performance tuning and ecosystem integrations.

By addressing these five misconceptions of cloud data warehouses and understanding the nuances, advantages, trade-offs and total cost ownership of both delivery models, organizations can make more informed decisions about their hybrid-cloud data warehousing strategy and unlock the value of all their data.

Getting started with a cloud data warehouse


At IBM we believe in making analytics secure, collaborative and price-performant across all deployments, whether running in the cloud, hybrid, or on-premises. For those considering a hybrid or cloud-first strategy, our data warehousing SaaS offerings including IBM Db2 Warehouse and Netezza Performance Server, are available across AWS, Microsoft Azure, and IBM Cloud and are designed to provide customers with the availability, elastic scaling, governance, and security required for SLA-backed, mission critical analytics.

When it comes to moving workloads to the cloud, IBM’s Expert Labs migration services ensure 100% workload compatibility between on-premises workloads and SaaS solutions.

No matter where you are in your journey to cloud, our experts are here to help customize the right approach to fit your needs. See how you can get started with your analytics journey to hybrid cloud by contacting an IBM database expert today.

Source: ibm.com

Thursday, 16 February 2023

A step-by-step guide to setting up a data governance program

IBM Exam, IBM Certification, IBM Career, IBM Skills, IBM Jobs, IBM Prep, IBM Preparation, IBM Guides, IBM Learning

In our last blog, we delved into the seven most prevalent data challenges that can be addressed with effective data governance. Today we will share our approach to developing a data governance program to drive data transformation and fuel a data-driven culture.

Data governance is a crucial aspect of managing an organization’s data assets. The primary goal of any data governance program is to deliver against prioritized business objectives and unlock the value of your data across your organization.

Realize that a data governance program cannot exist on its own – it must solve business problems and deliver outcomes. Start by identifying business objectives, desired outcomes, key stakeholders, and the data needed to deliver these objectives. Technology and data architecture play a crucial role in enabling data governance and achieving these objectives.

Don’t try to do everything at once! Focus and prioritize what you’re delivering to the business, determine what you need, deliver and measure results, refine, expand, and deliver against the next priority objectives. A well-executed data governance program ensures that data is accurate, complete, consistent, and accessible to those who need it, while protecting data from unauthorized access or misuse.

IBM Exam, IBM Certification, IBM Career, IBM Skills, IBM Jobs, IBM Prep, IBM Preparation, IBM Guides, IBM Learning

Consider the following four key building blocks of data governance:

◉ People refers to the organizational structure, roles, and responsibilities of those involved in data governance, including those who own, collect, store, manage, and use data.

◉ Policies provide the guidelines for using, protecting, and managing data, ensuring consistency and compliance.

◉ Process refers to the procedures for communication, collaboration and managing data, including data collection, storage, protection, and usage.

◉ Technology refers to the tools and systems used to support data governance, such as data management platforms and security solutions.

IBM Exam, IBM Certification, IBM Career, IBM Skills, IBM Jobs, IBM Prep, IBM Preparation, IBM Guides, IBM Learning

For example, if the goal is to improve customer retention, the data governance program should focus on where customer data is produced and consumed across the organization, ensuring that the organization’s customer data is accurate, complete, protected, and accessible to those who need it to make decisions that will improve customer retention.

It’s important to coordinate and standardize policies, roles, and data management processes to align them with the business objectives. This will ensure that data is being used effectively and that all stakeholders are working towards the same goal.

Starting a data governance program may seem like a daunting task, but by starting small and focusing on delivering prioritized business outcomes, data governance can become a natural extension of your day-to-day business.

Building a data governance program is an iterative and incremental process


Step 1: Define your data strategy and data governance goals and objectives

What are the business objectives and desired results for your organization? You should consider both long-term strategic goals and short-term tactical goals and remember that goals may be influenced by external factors such as regulations and compliance.

IBM Exam, IBM Certification, IBM Career, IBM Skills, IBM Jobs, IBM Prep, IBM Preparation, IBM Guides, IBM Learning

A data strategy identifies, prioritizes, and aligns business objectives across your organization and its various lines of business. Across multiple business objectives, a data strategy will identify data needs, measures and KPIs, stakeholders, and required data management processes, technology priorities and capabilities.

It is important to regularly review and update your data strategy as your business and priorities change. If you don’t have a data strategy, you should build one – it doesn’t take a long time, but you do need the right stakeholders to contribute.

Once you have a clear understanding of business objectives and data needs, set data governance goals and priorities. For example, an effective data governance program may:

◉ Improve data quality, which can lead to more accurate and reliable decision making
◉ Increase data security to protect sensitive information
◉ Enable compliance and reporting against industry regulations
◉ Improve overall trust and reliability of your data assets
◉ Make data more accessible and usable, which can improve efficiency and productivity.

Clearly defining your goals and objectives will guide the prioritization and development of your data governance program, ultimately driving revenue, cost savings, and customer satisfaction.

Step 2: Secure executive support and essential stakeholders

Identify key stakeholders and roles for the data governance program and who will need to be involved in its execution. This should include employees, managers, IT staff, data architects, and line of business owners, and data custodians within and outside your organization.

An executive sponsor is crucial – an individual who understands the significance and objectives of data governance, recognizes the business value that data governance enables, and who supports the investment required to achieve these outcomes.

With key sponsorship in place, assemble the team to understand the compelling narrative, define what needs to be accomplished, how to raise awareness, and how to build the funding model that will be used to support the implementation of the data governance program.

The following is an example of typical stakeholder levels that may participate in a data governance program:

IBM Exam, IBM Certification, IBM Career, IBM Skills, IBM Jobs, IBM Prep, IBM Preparation, IBM Guides, IBM Learning

By effectively engaging key stakeholders, identifying and delivering clear business value, the implementation of a data governance program can become a strategic advantage for your organization.

Step 3: Assess, build & refine your data governance program

With your business objectives understood and your data governance sponsors and stakeholders in place, it’s important to map these objectives against your existing People, Processes and Technology capabilities to achieve these objectives.

IBM Exam, IBM Certification, IBM Career, IBM Skills, IBM Jobs, IBM Prep, IBM Preparation, IBM Guides, IBM Learning

Data management frameworks such as the EDM Council’s DCAM and CDMC offer a structured way to assess your data maturity against industry benchmarks with a common language and set of data best practices.

Look at how data is currently being governed and managed within your organization. What are the strengths and weaknesses of your current approach? What is needed to deliver key business objectives?

Remember, you don’t have to (nor should you) do everything at once. Identify areas for improvement, in context of business objectives, to prioritize your efforts and focus on the most important areas to deliver results to the business in a meaningful way. An effective and efficient data governance program will support your organization’s growth and competitive advantage.

Step 4: Document your organization’s data policies

Data policies are a set of documented guidelines for how an organization’s data assets are consistently governed, managed, protected and used. Data policies are driven by your organization’s data strategy, align against business objectives and desired outcomes, and may be influenced by internal and external regulatory factors. Data policies may include topics such as data collection, storage, and usage, data quality and security:

IBM Exam, IBM Certification, IBM Career, IBM Skills, IBM Jobs, IBM Prep, IBM Preparation, IBM Guides, IBM Learning

Data policies ensure that your data is being used in a way that supports the overall goals of your organization and complies with relevant laws and regulations. This can lead to improved data quality, better decision making, and increased trust in the organization’s data assets, ultimately leading to a more successful and sustainable organization. 

Step 5: Establish roles and responsibilities

Define clear roles and responsibilities of those involved in data governance, including those responsible for collecting, storing, and using data. This will help ensure that everyone understands their role and can effectively contribute to the data governance effort.

IBM Exam, IBM Certification, IBM Career, IBM Skills, IBM Jobs, IBM Prep, IBM Preparation, IBM Guides, IBM Learning

The structure of data governance can vary depending on the organization. In a large enterprise, data governance may have a dedicated team overseeing it (as in the table above), while in a small business, data governance may be part of existing roles and responsibilities. A hybrid approach may also be suitable for some organizations. It is crucial to consider company culture and to develop a data governance framework that promotes data-driven practices. The key to success is to start small, learn and adapt, while focusing on delivering and measuring business outcomes.

Having a clear understanding of the roles and responsibilities of data governance participants can ensure that they have the necessary skills and knowledge to perform their duties.

Step 6: Develop and refine data processes

Data governance processes ensure effective decision making and enable consistent data management practices by coordinating teams across (and outside of) your organization. Additionally, data governance processes can also ensure compliance with regulatory standards and protect sensitive data.

Data processes provide formal channels for direction, escalation, and resolution. Data governance processes should be lightweight to achieve your business goals without adding unnecessary burden or hindering innovation.

Processes may be automated through tools, workflow, and technology.

It is important to establish these processes early to prevent issues or confusion that may arise later in the data management implementation.

Step 7 – Implement, evaluate, and adapt your strategy

Once you have defined the components of your data governance program, it’s time to put them in action. This could include implementing new technologies or processes or making changes to existing ones.

IBM Exam, IBM Certification, IBM Career, IBM Skills, IBM Jobs, IBM Prep, IBM Preparation, IBM Guides, IBM Learning

It is important to remember that data governance programs can only be successful if they demonstrate value to the business, so you need to measure and report on the delivery of the prioritized business outcomes. Regularly monitoring and reviewing your strategy will ensure that it is meeting your goals and business objectives.

Continuously evaluate your goals and objectives and adjust as needed. This will allow your data governance program to evolve and adapt to the changing needs of the organization and the industry. An approach of continuous improvement will enable your data governance program to stay relevant and deliver maximum value to the organization.

Get started on your data governance program


In conclusion, by following an incremental structured approach and engaging key stakeholders, you can build a data governance program that aligns with the unique needs of your organization and supports the delivery of accelerated business outcomes.

Implementing a data governance program can present unique challenges such as limited resources, resistance to change and a lack of understanding of the value of data governance. These challenges can be overcome by effectively communicating the value and benefits of the program to all stakeholders, providing training and support to those responsible for implementation, and involving key decision-makers in the planning process.

By implementing a data governance program that delivers key business outcomes, you can ensure the success of your program and drive measurable business value from your organization’s data assets while effectively manage your data, improving data quality, and maintaining the integrity of data throughout its lifecycle.

Source: ibm.com

Wednesday, 15 February 2023

Smart approaches to allocations in enterprise financial and operational planning

IBM Career, IBM Tutorial and Materials, IBM Certification, IBM Prep, IBM Preparation

An allocation is the process of shifting overhead costs throughout an organization. One company might want to distribute costs across business units or departments. Another might want to assign costs to individual products or projects. Fundamentally, the smartest approach to allocations is about properly assigning costs to the areas that benefit from those costs.

When organizations allocate costs, they benefit from more accurate financial reports that show a greater level of detail. By properly allocating the costs, we can see true profitability results and make strategic decisions based on those results.

These detailed results give organizations the ability to answer questions such as:

◉ Are we making money on this project?
◉ Should we stop selling this product?
◉ What personnel changes should we make?

Four key phrases in the allocation process


There are four key phrases associated with allocations. These are:

1. Source – The source is simply the original value. This is the value that will be allocated or moved somewhere.
2. Driver – This is the basis for allocation calculations. Drivers can be dollars, units sold, headcount, etc. These are tangible items that are used to determine how to spread the source costs.
3. Target – The target is where you want to move the cost to. The source is where you start from, the target is where you move it.
4. Offset – An offset is typically a negative value associated with that target. This is used to create a balanced accounting entry and ensure that your end result is the same as your starting value.

Performing allocations


How do you perform an allocation? First, you calculate the allocation amount. We use the driver to help determine a percentage to spread the cost. For example, a company has an office with two different departments. A very simple allocation takes the overall office costs, splits them in two, and allocates a piece to each department. In this example, the allocation percentage is half. Once the amount is defined, we post a journal entry to both move the allocated amount into the target and remove the amount from the source.

This is a very simple explanation of allocations. In the real world, some companies take a complex approach to allocations.

A complex approach to allocating product costs


One Revelwood client in the healthcare industry asked us to create an allocation model for their IBM Planning Analytics environment. They wanted to allocate a series of product costs by territory and customer combinations. The organization has multiple product categories such as commercial products, Medicare products and Medicaid products.

We tackled this challenge by designing an allocation model that could be added into their existing planning and reporting model. The new model allows the company to calculate allocation percentages by utilizing a series of methods. These methods included:

◉ Using a standard allocation approach to calculate percentages via drivers that could be easily redefined
◉ Defining a fixed percentage for a single product, then allocating the remaining percentage to all other products
◉ Defining fixed percentages to existing subsets of products and then allocating those percentages into the specific products within each subset
◉ Defining a percentage to an ad-hoc subset of products and allocating only to those products
Incorporating various combinations of these methods

This Planning Analytics allocation model uses a two-step process. First, it calculates the allocation percentages. Next, it uses the calculated percentages to allocate costs by entity, customer, product, territory and more. The model offers the financial team the flexibility to use an allocated series of independent expenses using a variety of drivers and approaches. This unique approach allows the team to separate the allocations into pieces and analyze the details throughout the process.

This approach saved the business a significant amount of time. Before Planning Analytics, the company spent days creating allocation calculations in Excel. With Planning Analytics, the company can now complete this process in approximately four minutes.

Source: ibm.com

Saturday, 4 February 2023

Unlocking the power of data governance by understanding key challenges

IBM, IBM Career, IBM Exam Prep, IBM Tutorial and Materials, IBM Certification, IBM Guides, IBM Prep, IBM Preparation Exam

We introduced Data Governance: what it is and why it is so important. In this blog, we will explore the challenges that organizations face as they start their governance journey.

Organizations have long struggled with data management and understanding data in a complex and ever-growing data landscape. While operational data runs day-to-day business operations, gaining insights and leveraging data across business processes and workflows presents a well-known set of data governance challenges that technology alone cannot solve.

Every organization deals with the following challenges of data governance, and it is important to address these as part of your strategy:

Multiple data silos with limited collaboration


Data silos make it difficult for organizations to get a complete and accurate picture of their business. Silos exist naturally when data is managed by multiple operational systems. Silos may also represent the realities of a distributed organization. Breaking down these silos to encourage data access, data sharing and collaboration will be an important challenge for organizations in the coming years. The right data architecture to link and gain insight across silos requires the communication and coordination of a strategic data governance program.

Inconsistent or lacking business terminology, master data, hierarchies


Raw data without clear business definitions and rules is ripe for misinterpretation and confusion. Any use of data – such as combining or consolidating datasets from multiple sources – requires a level of understanding of that data beyond the physical formats. Combining or linking data assets across multiple repositories to gain greater data analytics and insights requires alignment. It needs linking with consistent master data, reference data, data lineage and hierarchies. Building and maintaining these structures requires the policies and coordination of effective data governance.

A need to ensure data privacy and data security


Data privacy and data security are major challenges when it comes to managing the increasing volume, usage, and complexity of new data. As more and more personal or sensitive data is collected and stored digitally, the risks of data breaches and cyber-attacks increase. To address these challenges and practice responsible data stewardship, organizations must invest in solutions that can protect their data from unauthorized access and breaches.

Ever-changing regulations and compliance requirements


As the regulatory landscape surrounding data governance continues to evolve, organizations need to stay up-to-date on the latest requirements and mandates. Organizations need to ensure that their enterprise data governance practices are compliant. They need to have the ability to:

◉ Monitor data issues
◉ Ensure data conformity with data quality
◉ Establish and manage business rules, data standards and industry regulations
◉ Manage risks associated with changing data privacy regulations

Lack of a 360-degree view of organization data


A 360-degree view of data refers to having a comprehensive understanding of all the data within an organization, including its structure, sources, and usage. Think about use cases like Customer 360, Patient 360 or Citizen 360 which provide organizational-specific views. Without these views, organizations will struggle to make data-driven business decisions, as they may not have access to all the information they need to fully understand their business and drive the right outcomes.

The growing volume and complexity of data


As the amount of data generated by organizations continues to grow, it will become increasingly challenging to manage and govern this data effectively. This may require implementing new technologies and data management processes to help handle the volume and complexity of data. These technologies and processes must be adopted to work within the data governance sphere of influence.

The challenges of remote work


The COVID-19 pandemic led to a significant shift towards remote work, which can present challenges for data governance initiatives. Organizations must find ways to effectively manage data and track compliance across data sources and stakeholders in a remote work environment. With remote work becoming the new normal, organizations need to ensure that their data is being accessed and used appropriately, even when employees are not physically present in the office. This requires a set of data governance best practices – including policies, procedures, and technologies – to control and monitor access to data and systems.

If any or all of these seven challenges feel familiar, and you need support with your data governance strategy, know that you aren’t alone. Our next blog will discuss the building blocks of a data governance strategy and share our point of view on how to establish a data governance framework from the ground up.

In the meantime, learn more about building a data-driven organization with The Data Differentiator guide for data leaders.

Source: ibm.com

Thursday, 2 February 2023

Data platform trinity: Competitive or complementary?

IBM, IBM Exam, IBM Exam Prep, IBM Tutorial and Materials, IBM Guides, IBM Certification, IBM Skill, IBM Job

Data platform architecture has an interesting history. Towards the turn of millennium, enterprises started to realize that the reporting and business intelligence workload required a new solution rather than the transactional applications. A read-optimized platform that can integrate data from multiple applications emerged. It was Datawarehouse.

In another decade, the internet and mobile started the generate data of unforeseen volume, variety and velocity. It required a different data platform solution. Hence, Data Lake emerged, which handles unstructured and structured data with huge volume.

Yet another decade passed. And it became clear that data lake and datawarehouse are no longer enough to handle the business complexity and new workload of the enterprises. It is too expensive. Value of the data projects are difficult to realize. Data platforms are difficult to change. Time demanded a new solution, again.

Guess what? This time, at least three different data platform solutions are emerging: Data Lakehouse, Data Fabric, and Data Mesh. While this is encouraging, it is also creating confusion in the market. The concepts and values are overlapping. At times different interpretations are emerging depending on who is being asked.

This article endeavors to alleviate those confusions. The concepts will be explained. And then a framework will be introduced, which will show how these three concepts may lead to one another or be used with each other.

Data lakehouse: A mostly new platform


Concept of lakehouse was made popular by Databricks. They defined it as: “A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.”

While traditional data warehouses made use of an Extract-Transform-Load (ETL) process to ingest data, data lakes instead rely on an Extract-Load-Transform (ELT) process. Extracted data from multiple sources is loaded into cheap BLOB storage, then transformed and persisted into a data warehouse, which uses expensive block storage.

This storage architecture is inflexible and inefficient. Transformation must be performed continuously to keep the BLOB and data warehouse storage in sync, adding costs. And continuous transformation is still time-consuming. By the time the data is ready for analysis, the insights it can yield will be stale relative to the current state of transactional systems.

Furthermore, data warehouse storage cannot support workloads like Artificial Intelligence (AI) or Machine Learning (ML), which require huge amounts of data for model training. For these workloads, data lake vendors usually recommend extracting data into flat files to be used solely for model training and testing purposes. This adds an additional ETL step, making the data even more stale.

Data lakehouse was created to solve these problems. The data warehouse storage layer is removed from lakehouse architectures. Instead, continuous data transformation is performed within the BLOB storage. Multiple APIs are added so that different types of workloads can use the same storage buckets. This is an architecture that’s well suited for the cloud since AWS S3 or Azure DLS2 can provide the requisite storage.

Data fabric: A mostly new architecture


The data fabric represents a new generation of data platform architecture. It can be defined as: A loosely coupled collection of distributed services, which enables the right data to be made available in the right shape, at the right time and place, from heterogeneous sources of transactional and analytical natures, across any cloud and on-premises platforms, usually via self-service, while meeting non-functional requirements including cost effectiveness, performance, governance, security and compliance.

The purpose of the data fabric is to make data available wherever and whenever it is needed, abstracting away the technological complexities involved in data movement, transformation and integration, so that anyone can use the data. Some key characteristics of data fabric are:

A network of data nodes

A data fabric is comprised of a network of data nodes (e.g., data platforms and databases), all interacting with one another to provide greater value. The data nodes are spread across the enterprise’s hybrid and multicloud computing ecosystem.

Each node can be different from the others

A data fabric can consist of multiple data warehouses, data lakes, IoT/Edge devices and transactional databases. It can include technologies that range from Oracle, Teradata and Apache Hadoop to Snowflake on Azure, RedShift on AWS or MS SQL in the on-premises data center, to name just a few.

All phases of the data-information lifecycle

The data fabric embraces all phases of the data-information-insight lifecycle. One node of the fabric may provide raw data to another that, in turn, performs analytics. These analytics can be exposed as REST APIs within the fabric, so that they can be consumed by transactional systems of record for decision-making.

Analytical and transactional worlds come together

Data fabric is designed to bring together the analytical and transactional worlds. Here, everything is a node, and the nodes interact with one another through a variety of mechanisms. Some of these require data movement, while others enable data access without movement. The underlying idea is that data silos (and differentiation) will eventually disappear in this architecture.

Security and governance are enforced throughout

Security and governance policies are enforced whenever data travels or is accessed throughout the data fabric. Just as Istio applies security governance to containers in Kubernetes, the data fabric will apply policies to data according to similar principles, in real time.

Data discoverability

Data fabric promotes data discoverability. Here, data assets can be published into categories, creating an enterprise-wide data marketplace. This marketplace provides a search mechanism, utilizing metadata and a knowledge graph to enable asset discovery. This enables access to data at all stages of its value lifecycle.

The advent of the data fabric opens new opportunities to transform enterprise cultures and operating models. Because data fabrics are distributed but inclusive, their use promotes federated but unified governance. This will make the data more trustworthy and reliable. The marketplace will make it easier for stakeholders across the business to discover and use data to innovate. Diverse teams will find it easier to collaborate, and to manage shared data assets with a sense of common purpose.

Data fabric is an embracing architecture, where some new technologies (e.g., data virtualization) play a key role. But it allows existing databases and data platforms to participate in a network, where a data catalogue or data marketplace can help in discovering new assets. Metadata plays a key role here in discovering the data assets.

Data mesh: A mostly new culture


Data mesh as a concept is introduced by Thoughtworks. They defined it as: “…An analytical data architecture and operating model where data is treated as a product and owned by teams that most intimately know and consume the data.” The concept stands on four principles: Domain ownership, data as a product, self-serve data platforms, and federated computational governance.

Data fabric and data mesh as concepts have overlaps. For example, both recommend a distributed architecture – unlike centralized platforms such as datawarehouse, data lake, and data lakehouse. Both want to bring out the idea of a data product offered through a marketplace.

Differences exist also. As it is clear from the definition above, unlike data fabric, data mesh is about analytical data. It is narrower in focus than data fabric. Secondly, it emphasizes operational model and culture, meaning it is beyond just an architecture like data fabric. The nature of data product can be generic in data fabric, whereas data mesh clearly prescribes domain-driven ownership of data products.

The relationship between data lakehouse, data fabric and data mesh


Clearly, these three concepts have their own focus and strength. Yet, the overlap is evident.

Lakehouse stands apart from the other two. It is a new technology, like its predecessors. It can be codified. Multiple products exist in the market, including Databricks, Azure Synapse and Amazon Athena.

Data mesh requires a new operating model and cultural change. Often such cultural changes require a shift in the collective mindset of the enterprise. As a result, data mesh can be revolutionary in nature. It can be built from ground up at a smaller part of the organization before spreading into the rest of it.

Data fabric does not have such pre-requisites as data mesh. It is does not expect such cultural shift. It can be built up using existing assets, where the enterprise has invested over the period of years. Thus, its approach is evolutionary.

So how can an enterprise embrace all these concepts?

Address old data platforms by adopting a data lakehouse

It can embrace adoption of a lakehouse as part of its own data platform evolution journey. For example, a bank may get rid of its decade old datawarehouse and deliver all BI and AI use cases from a single data platform, by implementing a lakehouse.

Address data complexity with a data fabric architecture

If the enterprise is complex and has multiple data platforms, if data discovery is a challenge, if data delivery at different parts of the organization is difficult – data fabric may be a good architecture to adopt. Along with existing data platform nodes, one or multiple lakehouse nodes may also participate there. Even the transactional databases may also join the fabric network as nodes to offer or consume data assets.

Address business complexity with a data mesh journey

To address the business complexity, if the enterprise embarks upon a cultural shift towards domain driven data ownership, promotes self-service in data discovery and delivery, and adopts federated governance – they are on a data mesh journey. If the data fabric architecture is already in place, the enterprise may use it as a key enabler in their data mesh journey. For example, the data fabric marketplace may offer domain centric data products – a key data mesh outcome – from it. The metadata driven discovery already established as a capability through data fabric can be useful in discovering the new data products coming out of mesh.

Every enterprise can look at their respective business goals and decide which entry point suits them best. But even though entry points or motivations can be different, an enterprise may easily use all three concepts together in their quest to data-centricity.

Source: ibm.com