Troubleshooting your Tanzu Platform Self-Managed deployment

Last Updated March 03, 2025

Where to find Logs

Installer logs provides vital information on what went wrong. Access the installer logs - logs_installer.log in the root folder from where the installer command is run.

Service Status

Check status of all packages being installed by running following commands.

kubectl -n tanzusm get packageInstalls

Describe particular package Installs

kubectl -n tanzusm describe packageInstalls <package name>

List all apps with kapp command

kapp ls -n tanzusm --column Name

List app and resource status

kapp inspect -a  <app name listed in above command>  -n tanzusm

Prometheus Data Collection

Prometheus data can offer insights to RED metrics of the TPSM services. Below are the steps to collect this data.

Prerequisite: jq (https://jqlang.github.io/jq/download/) mostly this will be available in jumper box if not install it and make it executable.

To create a dump of Prometheus data:

Export the cluster config file to KUBECONFIG environment variable

export KUBECONFIG=<cluster kube config file location full path>

Find out the Prometheus pod name and store it in a variable

TP_PROMETHEUS_POD="$(kubectl get pods -l app.kubernetes.io/part-of=prometheus -l app.kubernetes.io/component=server -n tanzusm -o json | jq -r .'items[0].metadata.name')"

Port forward the Prometheus admin API to invoke the Prometheus data snapshot creation from localhost
```
kubectl  port-forward  pods/$TP_PROMETHEUS_POD 9090:9090 -n tanzusm
```
Open a new terminal and export the KUBECONFIG (same as step 1) and run the below commands

Create the Prometheus data snapshot in Prometheus server and store the snapshot name in a variable

TP_PROMETHEUS_SNAPSHOT_NAME=$(curl -X POST -s http://localhost:9090/prometheus/api/v1/admin/tsdb/snapshot | jq -r .data.name)

Create a directory in local to copy the snapshot from cluster
```
mkdir -p prometheus-data/$TP_PROMETHEUS_SNAPSHOT_NAME
```

Find out the Prometheus pod name and store it in a variable

TP_PROMETHEUS_POD="$(kubectl get pods -l app.kubernetes.io/part-of=prometheus -l app.kubernetes.io/component=server -n tanzusm -o json | jq -r .'items[0].metadata.name')"

Copy the snapshot to the local

kubectl cp -n tanzusm $TP_PROMETHEUS_POD:/bitnami/prometheus/data/snapshots/$TP_PROMETHEUS_SNAPSHOT_NAME   prometheus-data/$TP_PROMETHEUS_SNAPSHOT_NAME

Create a tar of the local directory and upload it

tar -czvf prometheus-$TP_PROMETHEUS_SNAPSHOT_NAME.tar.gz prometheus-data/$TP_PROMETHEUS_SNAPSHOT_NAME

Go to the previous terminal session and close the port forwarding by pressing control + c in the keyboard

Troubleshooting common issues

Use the following table to list Troubleshooting common issues.

Issue	Description	Solution
After Restore Pods for a few of the services stuck in crash loopback off after sufficiently waiting for reconciliation	After restoring and sufficiently waiting for the pods of a few of the services stuck in crash-loop-backoff	Delete the pods and Kapp kick the app that encapsulates those pods. `kubectl -n tanzusm delete pods` `kctrl package installed kick -i -n tanzusm`
Velero Storage Backend Misconfiguration	Symptom: Backups are not stored correctly, or storage errors occur Possible Causes: Incorrect bucket names or paths. Misconfigured storage provider credentials.	Check Storage Configuration: Verify that the bucket/container name and region are correctly specified in the Velero configuration: `velero backup-location get` Ensure that Velero's credentials have access to the storage backing
Pods going into evicted state	Symptom: Few of the pods being observed in the evicted state Possible Causes: health of the nodes.	This issue is due to cluster capacity, run the below command to get the node's status and check the health of the nodes. `kubectl get nodes` Perform necessary actions: increase the number of nodes in the cluster and ensure that the overall cluster capacity is sufficient for installation.
Velero command fails	Symptom: Velero create backup Or Velero restore command fails	Check the connectivity of the Object store from within the workload cluster. `kubectl get pods -n velero kubectl logs -n velero -f` Check that none of the node agents and controller pods are in an error state on Velero namespace. `kubectl get pods -n velero`
After startup, application dashboard, which shows vulnerablities, takes time to show data.	Symptom: Due to multiple reconcilers in system, application takes time to sync.	Wait for one or two hours after installation, because reconciliation of background information takes time.
Kafka coordinator issue and the consumers are not able to connect to Kafka	Symptom: Due to an incorrect start/stop sequence, Kafka goes into an inconsistent state.	Run the clean_kafka.sh mentioned in the Post Restore Action section.
After adding/updating the certificate TPSM installer command fails	Symptom: This happens due to certificate content is not aligned as per yaml string format.	The certificate in the `config.yaml` file must be provided as string literal in yaml format. Eg. certificate: \| -----BEGIN CERTIFICATE----- .... .... -----END CERTIFICATE-----
After updating the certificate user is not able to login	Symptom: Sometimes Carvel interrupts the stakater reloader to restart the deployment.	Restart the below services once certificate is updated. kubectl rollout restart deployment/graphql-stitching-service deployment/uaa deployment/ucp-core-controllers -n tanzusm
After updating the certificate user is not able to login	Symptom: Some times it has been found that user makes mistake where they update the tls certificate public key but do no update the private key.	Logs from Contour-Envovy pod: "Failed to load private key from , Cause: error:0b000074:X.509 certificate routines:OPENSSL_internal:KEY_VALUES_MISMATCH" code=13 connection=76 context=xds node_id=contour-envoy-775df8f468-rh7xj node_version=v1.28.2 response_nonce=62 version_info=
Seaweedfs: When we get a 'no writable space' error in Seaweed, use the given commands to resolve the issue.	Symptom: E1126 08:10:06.993985 s3api_object_handlers_put.go:161 upload to filer error: rpc error: code = Unknown desc = failed to find writable volumes for collection:daedalus replication:000 ttl: error: No more writable volumes!	pause reconciliation `kctrl package installed pause -i seaweedfs -n tanzusm` Increase storage size of `statefulset` by editing or patching `kubectl edit pvc data-seaweedfs-volume-0 -n tanzusm` portforwod master pod to 9333 `kubectl port-forward seaweedfs-master-0 9333:9333 -n tanzusm` To check volume allocation status, check for writable space allocated to bucket, it should not be null. `curl http://localhost:9333/dir/status \| jq` If volume not gets allocated to any bucket and free space is available us following command for allocation `curl "http://localhost:9333/vol/grow?collection=&count=8`

Installer command-line help

To get help with the installer commands, use the --help flag. For example:

tanzu-sm-installer  --help

You can also use this flag with sub-commands. For example:

tanzu-sm-installer install --help

Installer sample commands

tanzu-sm-installer verify \
  -f config.yaml \
  -u "${ARTIFACTORY_USER}:${ARTIFACTORY_API_TOKEN}" \
  -r ${DOCKER_REGISTRY}/hub-self-managed/${TANZU_SM_VERSION}/repo \
  --kubeconfig ${KUBECONFIG}

tanzu-sm-installer install \
  -f config.yaml \
  -u "${ARTIFACTORY_USER}:${ARTIFACTORY_API_TOKEN}" \
  -r ${DOCKER_REGISTRY}/hub-self-managed/${TANZU_SM_VERSION}/repo \
  --yes

tanzu-sm-installer post-verify \
  --kubeconfig ${KUBECONFIG}

tanzu-sm-installer push collectors \
  -a "${REGISTRY_USERNAME}:{$REGISTRY_PASSWORD}" \
  -r "${REGISTRY_ENDPOINT}" \
  -f tanzusm-collector.tar -s

tanzu-sm-installer push tanzu-plugins \
  -u "${REGISTRY_USERNAME}:${REGISTRY_PASSWORD}" \
  -r "${REGISTRY_ENDPOINT}/${REPO_PATH}" \
  -i tanzu-bundle/tpsm-plugin-bundle.tar.gz

tanzu-sm-installer push tmc-extensions \
  -a "${REGISTRY_USERNAME}:${REGISTRY_PASSWORD}" \
  -r "${REGISTRY_ENDPOINT}/${REPO_PATH}" \
  -f agent-images.tar

tanzu-sm-installer log \
  --kubeconfig ${KUBECONFIG}

tanzu-sm-installer reset \
  --kubeconfig ${KUBECONFIG} -p

tanzu-sm-installer velero install \
  --provider aws \
  --image harbor.tanzu.io:8443/library/velero:v1.14.1 \
  --plugins harbor.tanzu.io:8443/library/velero/velero-plugin-for-aws:v1.10.0 \
  --bucket <BUCKET_NAME> \
  --secret-file <PATH_TO_CREDENTIAL_FILE> \
  --use-volume-snapshots=false \
  --features=EnableCSI \ 
  --use-node-agent \
  --backup-location-config region=<OBJECT_STORAGE_SERVICE_REGION>,s3ForcePathStyle="true",s3Url=<OBJECT_STORAGE_SERVICE_PATH>

tanzu-sm-installer velero backup create <KAPP_ARTIFACT_BACKUP_NAME> \ 
  --snapshot-move-data \
  --include-resources=apps.kappctrl.k14s.io,packageinstalls.packaging.carvel.dev \ 
  --include-namespaces tanzusm

tanzu-sm-installer velero restore create <SECRET_RESTORE_NAME> \ 
  --snapshot-move-data \
  --from-backup  <FULL_BACKUP_NAME>  \
  --include-resources=secrets

tanzu-sm-installer velero schedule create kapp-backup-schedule \ 
  --snapshot-move-data \
  --schedule="@every 4h" \
  --include-resources=apps.kappctrl.k14s.io,packageinstalls.packaging.carvel.dev \ 
  --include-namespaces tanzusm

Create an installation log bundle

To create an installation log bundle, run the following command:

tanzu-sm-installer log

By default, this command creates the log bundle in /tmp/tanzusm.

The following table describes the options of the tanzu-sm-installer log command.

Short name	Full name	Description	Default value
`-h`	`--help`	Help for the `log` command	NA
`-k`	`--kubeconfig`	Absolute path of the kubeconfig file that connects to the kubernetes cluster on which Tanzu Platform is running	NA
`-n`	`--namespace`	Name of the namespace in which Tanzu Platform is running	`tanzusm`
`-o`	`--outdir`	Location in which to create the log bundle if not the default location	`/tmp/tanzusm`
`-w`	`--workdir`	Working directory for Crashd and Starlark	`/tmp/tanzusm/work`

Content feedback and comments

Tanzu Platform Self-Managed 10.1