Tanzu Platform Self-Managed 10.1

Troubleshooting your Tanzu Platform Self-Managed deployment

Last Updated March 03, 2025

Where to find Logs

Installer logs provides vital information on what went wrong. Access the installer logs - logs_installer.log in the root folder from where the installer command is run.

Service Status

Check status of all packages being installed by running following commands.

kubectl -n tanzusm get packageInstalls

Describe particular package Installs

kubectl -n tanzusm describe packageInstalls <package name>

List all apps with kapp command

kapp ls -n tanzusm --column Name

List app and resource status

kapp inspect -a  <app name listed in above command>  -n tanzusm

Prometheus Data Collection

Prometheus data can offer insights to RED metrics of the TPSM services. Below are the steps to collect this data.

Prerequisite: jq (https://jqlang.github.io/jq/download/) mostly this will be available in jumper box if not install it and make it executable.

To create a dump of Prometheus data:

  1. Export the cluster config file to KUBECONFIG environment variable
    export KUBECONFIG=<cluster kube config file location full path>
    
  2. Find out the Prometheus pod name and store it in a variable
    TP_PROMETHEUS_POD="$(kubectl get pods -l app.kubernetes.io/part-of=prometheus -l app.kubernetes.io/component=server -n tanzusm -o json | jq -r .'items[0].metadata.name')"  
    
  3. Port forward the Prometheus admin API to invoke the Prometheus data snapshot creation from localhost
    kubectl  port-forward  pods/$TP_PROMETHEUS_POD 9090:9090 -n tanzusm
    
  4. Open a new terminal and export the KUBECONFIG (same as step 1) and run the below commands
  5. Create the Prometheus data snapshot in Prometheus server and store the snapshot name in a variable
    TP_PROMETHEUS_SNAPSHOT_NAME=$(curl -X POST -s http://localhost:9090/prometheus/api/v1/admin/tsdb/snapshot | jq -r .data.name)
    
  6. Create a directory in local to copy the snapshot from cluster
    mkdir -p prometheus-data/$TP_PROMETHEUS_SNAPSHOT_NAME
    
  7. Find out the Prometheus pod name and store it in a variable
    TP_PROMETHEUS_POD="$(kubectl get pods -l app.kubernetes.io/part-of=prometheus -l app.kubernetes.io/component=server -n tanzusm -o json | jq -r .'items[0].metadata.name')"
    
  8. Copy the snapshot to the local
    kubectl cp -n tanzusm $TP_PROMETHEUS_POD:/bitnami/prometheus/data/snapshots/$TP_PROMETHEUS_SNAPSHOT_NAME   prometheus-data/$TP_PROMETHEUS_SNAPSHOT_NAME
    
  9. Create a tar of the local directory and upload it
    tar -czvf prometheus-$TP_PROMETHEUS_SNAPSHOT_NAME.tar.gz prometheus-data/$TP_PROMETHEUS_SNAPSHOT_NAME
    
  10. Go to the previous terminal session and close the port forwarding by pressing control + c in the keyboard

Troubleshooting common issues

Use the following table to list Troubleshooting common issues.

Issue Description Solution
After Restore Pods for a few of the services stuck in crash loopback off after sufficiently waiting for reconciliation

After restoring and sufficiently waiting for the pods of a few of the services stuck in crash-loop-backoff

Delete the pods and Kapp kick the app that encapsulates those pods.

kubectl -n tanzusm delete pods kctrl package installed kick -i -n tanzusm
Velero Storage Backend Misconfiguration

Symptom:
Backups are not stored correctly, or storage errors occur

Possible Causes:
Incorrect bucket names or paths. Misconfigured storage provider credentials.

Check Storage Configuration:

Verify that the bucket/container name and region are correctly specified in the Velero configuration:

velero backup-location get

Ensure that Velero's credentials have access to the storage backing

Pods going into evicted state Symptom:
Few of the pods being observed in the evicted state

Possible Causes:
health of the nodes.

This issue is due to cluster capacity, run the below command to get the node's status and check the health of the nodes.
kubectl get nodes
Perform necessary actions: increase the number of nodes in the cluster and ensure that the overall cluster capacity is sufficient for installation.
Velero command fails Symptom:
Velero create backup Or Velero restore command fails
Check the connectivity of the Object store from within the workload cluster.
kubectl get pods -n velero kubectl logs -n velero -f
Check that none of the node agents and controller pods are in an error state on Velero namespace.
kubectl get pods -n velero
After startup, application dashboard, which shows vulnerablities, takes time to show data. Symptom:
Due to multiple reconcilers in system, application takes time to sync.
Wait for one or two hours after installation, because reconciliation of background information takes time.
Kafka coordinator issue and the consumers are not able to connect to Kafka Symptom:
Due to an incorrect start/stop sequence, Kafka goes into an inconsistent state.
Run the clean_kafka.sh mentioned in the Post Restore Action section.
After adding/updating the certificate TPSM installer command fails Symptom:
This happens due to certificate content is not aligned as per yaml string format.
The certificate in the `config.yaml` file must be provided as string literal in yaml format. Eg. certificate: | -----BEGIN CERTIFICATE----- .... .... -----END CERTIFICATE-----
After updating the certificate user is not able to login Symptom:
Sometimes Carvel interrupts the stakater reloader to restart the deployment.
Restart the below services once certificate is updated. kubectl rollout restart deployment/graphql-stitching-service deployment/uaa deployment/ucp-core-controllers -n tanzusm
After updating the certificate user is not able to login Symptom:
Some times it has been found that user makes mistake where they update the tls certificate public key but do no update the private key.
Logs from Contour-Envovy pod: "Failed to load private key from , Cause: error:0b000074:X.509 certificate routines:OPENSSL_internal:KEY_VALUES_MISMATCH" code=13 connection=76 context=xds node_id=contour-envoy-775df8f468-rh7xj node_version=v1.28.2 response_nonce=62 version_info=
Seaweedfs: When we get a 'no writable space' error in Seaweed, use the given commands to resolve the issue. Symptom:
E1126 08:10:06.993985 s3api_object_handlers_put.go:161 upload to filer error: rpc error: code = Unknown desc = failed to find writable volumes for collection:daedalus replication:000 ttl: error: No more writable volumes!

  1. pause reconciliation
    kctrl package installed pause -i seaweedfs -n tanzusm
  2. Increase storage size of statefulset by editing or patching
    kubectl edit pvc data-seaweedfs-volume-0 -n tanzusm
  3. portforwod master pod to 9333
    kubectl port-forward seaweedfs-master-0 9333:9333 -n tanzusm
  4. To check volume allocation status, check for writable space allocated to bucket, it should not be null.
    curl http://localhost:9333/dir/status | jq
  5. If volume not gets allocated to any bucket and free space is available us following command for allocation
    curl "http://localhost:9333/vol/grow?collection=&count=8

Installer command-line help

To get help with the installer commands, use the --help flag. For example:

tanzu-sm-installer  --help  

You can also use this flag with sub-commands. For example:

tanzu-sm-installer install --help 

Installer sample commands

tanzu-sm-installer verify \
  -f config.yaml \
  -u "${ARTIFACTORY_USER}:${ARTIFACTORY_API_TOKEN}" \
  -r ${DOCKER_REGISTRY}/hub-self-managed/${TANZU_SM_VERSION}/repo \
  --kubeconfig ${KUBECONFIG}
tanzu-sm-installer install \
  -f config.yaml \
  -u "${ARTIFACTORY_USER}:${ARTIFACTORY_API_TOKEN}" \
  -r ${DOCKER_REGISTRY}/hub-self-managed/${TANZU_SM_VERSION}/repo \
  --yes
tanzu-sm-installer post-verify \
  --kubeconfig ${KUBECONFIG}
tanzu-sm-installer push collectors \
  -a "${REGISTRY_USERNAME}:{$REGISTRY_PASSWORD}" \
  -r "${REGISTRY_ENDPOINT}" \
  -f tanzusm-collector.tar -s
tanzu-sm-installer push tanzu-plugins \
  -u "${REGISTRY_USERNAME}:${REGISTRY_PASSWORD}" \
  -r "${REGISTRY_ENDPOINT}/${REPO_PATH}" \
  -i tanzu-bundle/tpsm-plugin-bundle.tar.gz
tanzu-sm-installer push tmc-extensions \
  -a "${REGISTRY_USERNAME}:${REGISTRY_PASSWORD}" \
  -r "${REGISTRY_ENDPOINT}/${REPO_PATH}" \
  -f agent-images.tar
tanzu-sm-installer log \
  --kubeconfig ${KUBECONFIG}
tanzu-sm-installer reset \
  --kubeconfig ${KUBECONFIG} -p
tanzu-sm-installer velero install \
  --provider aws \
  --image harbor.tanzu.io:8443/library/velero:v1.14.1 \
  --plugins harbor.tanzu.io:8443/library/velero/velero-plugin-for-aws:v1.10.0 \
  --bucket <BUCKET_NAME> \
  --secret-file <PATH_TO_CREDENTIAL_FILE> \
  --use-volume-snapshots=false \
  --features=EnableCSI \ 
  --use-node-agent \
  --backup-location-config region=<OBJECT_STORAGE_SERVICE_REGION>,s3ForcePathStyle="true",s3Url=<OBJECT_STORAGE_SERVICE_PATH>
tanzu-sm-installer velero backup create <KAPP_ARTIFACT_BACKUP_NAME> \ 
  --snapshot-move-data \
  --include-resources=apps.kappctrl.k14s.io,packageinstalls.packaging.carvel.dev \ 
  --include-namespaces tanzusm
tanzu-sm-installer velero restore create <SECRET_RESTORE_NAME> \ 
  --snapshot-move-data \
  --from-backup  <FULL_BACKUP_NAME>  \
  --include-resources=secrets
tanzu-sm-installer velero schedule create kapp-backup-schedule \ 
  --snapshot-move-data \
  --schedule="@every 4h" \
  --include-resources=apps.kappctrl.k14s.io,packageinstalls.packaging.carvel.dev \ 
  --include-namespaces tanzusm

Create an installation log bundle

To create an installation log bundle, run the following command:

tanzu-sm-installer log

By default, this command creates the log bundle in /tmp/tanzusm.

The following table describes the options of the tanzu-sm-installer log command.

Short name Full name Description Default value
-h --help Help for the log command NA
-k --kubeconfig Absolute path of the kubeconfig file that connects to the kubernetes cluster on which Tanzu Platform is running NA
-n --namespace Name of the namespace in which Tanzu Platform is running tanzusm
-o --outdir Location in which to create the log bundle if not the default location /tmp/tanzusm
-w --workdir Working directory for Crashd and Starlark /tmp/tanzusm/work