Switching Between NSX-OVS and Upstream OVS Kernel Modules

Since NSX-OVS is not supported in the latest kernel version, you can switch the NSX-OVS kernel module to the upstream OVS kernel module before upgrading the kernel to the latest version. If NCP does not work with the latest kernel after a kernel upgrade, you can do a rollback (switch back to NSX-OVS and downgrade the kernel).
The first procedure below describes how to switch the NSX-OVS kernel module to the upstream OVS kernel module when you upgrade the kernel. The second procedure describes how to switch back to the NSX-OVS kernel module when you downgrade the kernel.
Both procedures involve the Kubernetes concepts
taints
and
tolerations
. For more information about these concepts, see https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration.

Switch to the upstream OVS kernel module

  1. Modify the
    tolerations
    of both
    daemonset.apps/nsx-ncp-bootstrap
    and
    daemonset.apps/nsx-node-agent
    . Change the following:
    - effect: NoExecute operator: Exists
    to:
    - effect: NoExecute key: evict-user-pods
  2. Modify the nsx-node-agent configmap. Change
    use_nsx_ovs_kernel_module
    to
    False
    .
  3. Taint
    worker-node1 "evict-user-pods:NoExecute" to evict all user pods in this node to other nodes:
    kubectl taint nodes worker-node1 evict-user-pods:NoExecute
  4. Taint
    worker-node1 "evict-ncp-pods:NoExecute" to evict nsx-node-agent and nsx-ncp-bootstrap pods in this node to other nodes:
    kubectl taint nodes worker-node1 evict-ncp-pods:NoExecute
  5. Uninstall the ovs-kernel module and restore the upstream OVS kernel module on worker-node1.
    1. Delete kmod files
      vport-geneve.ko
      ,
      vport-gre.ko
      ,
      vport-lisp.ko
      ,
      vport-stt.ko
      ,
      vport-vxlan.ko
      ,
      openvswitch.ko
      in directory
      /lib/modules/$(uname -r)/weak-updates/openvswitch
      .
    2. If there are
      vport-geneve.ko
      ,
      vport-gre.ko
      ,
      vport-lisp.ko
      ,
      vport-stt.ko
      ,
      vport-vxlan.ko
      ,
      openvswitch.ko
      files in directory
      /lib/modules/$(uname -r)/nsx/usr-ovs-kmod-backup
      , move them to directory
      /lib/modules/$(uname -r)/weak-updates/openvswitch
      .
    3. Delete directory
      /lib/modules/$(uname -r)/nsx
      .
  6. Upgrade the kernel of worker-node1 to the latest version and reboot it.
    Note: Set SELinux to Permissive mode on worker-node1 if containerd and kubelet cannot be running.
  7. Restart kubelet.
  8. Remove
    taint
    "evict-ncp-pods:NoExecute" from worker-node1. Verify that bootstrap and node-agent can start.
  9. Remove
    taint
    "evict-user-pods:NoExecute" from worker-node1. Verify that all pods in this node are running.
  10. Repeat steps 3-9 for other nodes.
  11. Recover the
    tolerations
    of both nsx-ncp-bootstrap and nsx-node-agent DaemonSets in step 1.

Switch back to the NSX-OVS kernel module

  1. Modify the
    tolerations
    of both
    daemonset.apps/nsx-ncp-bootstrap
    and
    daemonset.apps/nsx-node-agent
    . Change the following:
    - effect: NoExecute operator: Exists
    to:
    - effect: NoExecute key: evict-user-pods
  2. Modify the nsx-node-agent configmap. Change
    use_nsx_ovs_kernel_module
    to
    True
    .
  3. Taint
    worker-node1 "evict-user-pods:NoExecute" to evict all user pods in this node to other nodes:
    kubectl taint nodes worker-node1 evict-user-pods:NoExecute
  4. Taint
    worker-node1 "evict-ncp-pods:NoExecute" to evict nsx-node-agent and nsx-ncp-bootstrap pods in this node to other nodes:
    kubectl taint nodes worker-node1 evict-ncp-pods:NoExecute
  5. Downgrade the kernel of worker-node1 to a supported version and reboot it.
    Note: Set SELinux to Permissive mode on worker-node1 if containerd and kubelet cannot be running.
  6. Restart kubelet.
  7. Remove
    taint
    "evict-ncp-pods:NoExecute" from worker-node1. Verify that bootstrap and node-agent can start.
  8. Remove
    taint
    "evict-user-pods:NoExecute" from worker-node1. Verify that all pods in this node are running.
  9. Repeat steps 3-8 for other nodes.
  10. Recover the
    tolerations
    of both nsx-ncp-bootstrap and nsx-node-agent DaemonSets in step 1.