Cloud Pak for Data Solutions Explained

Jingdong Sun
8 min readApr 26, 2021

--

Resource Consumption Guardrails

The below scenario happened several times when I was a IBM Cloud Pak for Data SRE Architect:

An engineer from the IBM Cloud Pak for Data Client Experience team contacted me and asked me to join a customer session to help debugging a Cloud Pak for Data functional stability issue. After we logged on to the client cluster, checked the worker node resource usage, we found that the worker nodes were unbalanced: some worker nodes’ CPU usage were at normal level, but others were at 95% or even 99% usage. The issue was obviously a resource unbalancing and overloading problem.

This blog will give some suggestions, best practices, and troubleshooting advice for the above issue based on my Cloud Pak for Data customer support experience.

Background

IBM Cloud Pak for Data is a cloud native platform that hosts many services and add-ons. Due to service architecture differences and static/dynamic resource consumption behavior, the real time resource management can be challenging in customer environments.

To better understand how to troubleshoot the resource issue, we first need to understand how Cloud Pak for Data services are deployed: Cloud Pak for Data deploys on Red Hat Openshift, which is a Kubernetes platform. All Cloud Pak for Data service deployments are through the Kubernetes scheduler. So our story needs to start with Kubernetes scheduler.

Kubernetes resource allocation and behavior

The Kubernetes scheduler places pods into worker nodes based on pod resource (CPU and memory) request and the worker node resource usage. To learn more about this, please read the following blog where the detailed behaviors are documented:

Cloud Pak for Data 3.x service resource set up

Once pods are placed and deployed, one major fact that affects the workload is the pod’s resource request and limit settings. When pods are running workloads, their resource usages can go beyond their requested amounts and be at the maximum of the limit settings, which may cause potential work node resource overload issues.

Here are some example Cloud Pak for Data services/pods CPU and memory settings for v3.0.1:

From the above examples, you can see that some service CPU request and limit settings have big differences — the ratios of limit to request settings can be ten times or even more than a hundred times difference. For example:

  • The Regulatory Accelerator, IIRA-ml-prod pod, has a CPU request of 0.1, but the limit is set to 8, giving a ration of limit to request equal to 80.
  • IBM Streams, an instance management service (statefulset), has a CPU request of 0.1, and the limit set to 12, giving a ratio of limit to request of 120.

For services which have big ratios of limit to request, when the Kubernetes scheduler deploys these services on a worker node, even if the initial calculation of resource reservations are fine, once workload is running and services take more resources beyond request setting, worker node may become overloaded. If CPU hits the maximum, it will affect service performance significantly even though it will not cause the service crash since CPU is a compressible resource. If memory hit the maximum, it will cause services and pods crash (OOMKill) and restart. These will bring negative effects on Cloud Pak for Data instances and make it difficult to manage these instances.

How to avoid this situation?

Estimating Cluster Size for Cloud Pak for Data

When planning a cluster for Cloud Pak for Data, please give enough resources based on the services you plan to deploy and also based on potential workload you expect to run.

Sizing calculation tool

IBM Cloud Pak for Data Sales Configurator
IBM Cloud Pak for Data Sales Configurator

Cloud Pak for Data provides a resource sizing calculator tool: Sales Configurator.

Note: Sales Configurator tool is an internal tool, please contact IBM Cloud Pak for Data sales team for using the tool.

Each Cloud Pak for Data service provides CPU and memory sizing information with different scale levels in the above tool, so IBM sales team can help customers and provide good estimation of how big a cluster they need for their workload.

Build Guardrails

Monitoring the Cloud Pak for Data cluster to find the cluster resource usage trend is one of the important things Cloud Pak for Data cluster administrator must do, as it will give the customer a good guideline on what the Cloud Pak for Data daily resource usage looks like and predict the future trend.

Please find following blogs for details of how to monitor Cloud Pak for Data clusters:

  1. Cloud Pak for Data: Performance Monitoring Best Practices (Part 1)
  2. Cloud Pak for Data: Performance Monitoring Best Practices (Part 2)

Of course, just monitoring will not be enough. In certain cases, the administrator will need to adjust the cluster resource allocation, or move the services and pods around to have the best performance. The operation tools I discuss in the next section will help with these issues.

Operation tools

When a Cloud Pak for Data administrator notices that cluster resources usage is abnormal, or that the cluster can be optimized to improve performance, they can use following tools to adjust clusters.

load balance tool

This is a tool to help re-balance workloads of the worker nodes, so each worker node has balanced resource usage.

The worker node resource imbalance situation generally happens at the time of Cloud Pak for Data service deployment. In particular, this issue is noticed to happen more often with Watson Knowledge Catalog service. When this problem happens, based on the severity of the worker node load, some services may not be in healthy state to function well. When running oc describe nodes, end user will notice that some worker nodes are way overloaded than others. When this problem happens, the tool will help and bring resource balance to cluster.

Note: This tool is currently not available externally, but is accessible by IBM sales team or support team. Details of this tool can be found here: https://github.ibm.com/PrivateCloud-analytics/zen-cluster-balance-tool

scaling command

cpd-cli scale can help adjust the Cloud Pak for Data and service real time deployment sizing, to reduce or increase CPU and memory footprint for the service.

If some services are in medium or large scaling size, but have low workload and the cluster is resource tight, the customer has a couple of options. They can scale down the services to small size, or even customize the size, to release resources to be used by other services that have high-demands.

service quiesce/unquiesce

In case a Cloud Pak for Data cluster is overloaded, and the cluster administrator can not add more resources to the cluster to reduce the load in a timely manner, the administrator can evaluate the cluster situation and reduce the cluster resource loading by quiescing some services that are temporarily not needed.

Once more resources are available and added to the cluster, the administrator can unquiesce/resume these services.

This page documents a tool to support quiesce/unquiesce Cloud Pak for Data services: cpd quiesce/unquiesce

Understand runtime behaviors that affect resource usage

Cloud Pak for Data supports Python notebooks, R Studio, and Spark applications. These runtimes will take resources dynamically. If end users run many applications that take too many resources, the cluster will be loaded and the service function may become slow, or even fail.

By default, runtime environments will be cleaned up with predefined idle timeout:

  • All default CPU and GPU environment runtimes are shutdown automatically if they have been idle for longer than 18 hours.
  • Stopping the notebook kernel doesn’t stop the environment runtime.
  • All Spark runtimes are automatically stopped after 30 minutes of inactivity.
  • A RStudio is stopped after an idle time of 2 hour.

However, the best practice recommended is to stop all active runtimes when you no longer need them. Project users with Administration roles can stop all runtimes in the project. Users with editor roles can stop the runtimes they started. Active runtimes across all projects for your account can be found by clicking Administration > Platform Management > Environments from Cloud Pak for Data Console.

Cloud Pak for Data Runtime environments
Cloud Pak for Data Runtime environments

Put into practices

Here is a set of best practices for getting the most out of your Cloud Pak for Data services and resources.

Monitor for operation patterns

You should monitor your cluster for resource usage patterns to help predict future resource usage which allows you to plan the workload ahead of time.

Plan for predictable/plan-able workload

Customer can use following steps to limit service instance resource consumption in Cloud Pak for Data:

  1. Remove the “Create Services Instances” permission from all default user roles other than Administrator.
  2. Only administrators can provision service instances and only after confirming available resource.
  3. End users shall not provision service instances.
  4. Administrators for service instances need to add other users to those shared instances.
  5. Change the administrator password, don’t share the administrator password, and don’t assign administrator privileges to anyone other than actual administrators.

If customer have multi-tenancy set up when using Cloud Pak for Data and service, they can set up quota for each tenant, to provide resource guardrails for the tenant.

For jobs that can be scheduled, such as notebook deployments, ML model deployment, Cloud Pak for Data administrator can give end users suggestions and create plan for job deployments, to distribute the workload evenly over time.

Adjustment for unexpected workload peak

Please use tools listed above to adjust cluster workload and resource as necessary.

Cleanup failures or hanging pods

Due to nature of cloud native, Cloud Pak for Data administrator also needs to cleanup cluster when some pods are left hanging.

When a pod hangs in “Error” state or “Terminating” state, it will not release resource. This will drain resource quickly, so customer need to manually cleanup these pods to release resources.

Generally, this command will help delete the hanging pods:

oc delete <pod-name> -n <namespace> --force --grace-period=0

Conclusions

Cluster resource management is a challenge task, specifically when running many services in different workload.

This blog listed some best practices and tools to help customers:

  1. Know your deployed services and estimated workload
  2. Better plan the cluster size using Cloud Pak for Data Sales Configurator tool.
  3. Monitor the cluster to gain experience of the resource usage.
  4. Plan dynamic workload and distribute the workload evenly if possible.
  5. Use tools to fix the cluster workload issue when happens.

Back to the story at the beginning of the blog, after the engineers run the load balance tool and redistribute the services in a balanced way cross all work nodes, the Cloud Pak for Data and services functions were stable.

Here are some references:

--

--