Feature Based Change Log

Requirement	Notes	isMvp?
Secrets and ConfigMaps can get shared across namespaces		YES

Requirement	Notes	isMvp?
UI to enable and disable plugins		YES
Dynamic Plugin Framework in place		YES
Testing Infra up and running		YES
Docs and read me for creating and testing Plugins		YES
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

Bug OCPBUGS-11237: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/jenkins/pull/1642

Bug OCPBUGS-2622: CI: Backend unit tests fails because devfile registry was updated (mock response)

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1678~~. The following is the description of the original issue:
—
Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)

~~OCPBUGS-1677~~ is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.

This is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh

Actual results:
Unit tests fail

Expected results:
Unit tests should pass again

Additional info:

https://github.com/openshift/console/pull/12194

Bug OCPBUGS-2920: Update Prometheus Alerts

View the Description View the linked PRs

Our Prometheus alerts are inconsistent with both upstream and sometimes our own vendor folder. Let's do a clean update run before the next release is branched off.

https://github.com/openshift/cluster-etcd-operator/pull/956

Bug OCPBUGS-4598: [4.10z] Spike in pod-latency graph observed due to ovnkube-master restarts

View the Description View the linked PRs

Description of problem:

When running node-density (245 pods/node) on a 120 node cluster, we see that there is a huge spike (~22s) in Avg pod-latency. When the spike occurs we see all the ovnkube-master pods go through a restart.

The restart happens because of (ovnkube-master pods)

2022-08-10T04:04:44.494945179Z panic: reflect: call of reflect.Value.Len on ptr Value

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-09-114621

How reproducible:

Steps to Reproduce:
1. Run node-density on a 120 node cluster

Actual results:

Spike observed in pod-latency graph ~22s

Expected results:

Steady pod-latency graph ~4s

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/1426

Bug OCPBUGS-1087: OLM Reports ResolutionFailed when there are multiple upgrade paths between channel entries

View the Description View the linked PRs

Description of problem:

Catalog affected : icr.io/cpopen/datapower-operator-catalog:1.6.2

opm render icr.io/cpopen/datapower-operator-catalog:1.6.2 -o yaml > catalog.yaml

 yq 'select(.schema == "olm.channel") | select(.name=="v1.6")' catalog.yaml
entries:
  - name: datapower-operator.v1.6.0
    skipRange: '>=1.0.0 <1.6.0'
  - name: datapower-operator.v1.6.1
    replaces: datapower-operator.v1.6.0
    skipRange: '>=1.0.0 <1.6.1'
  - name: datapower-operator.v1.6.2
    replaces: datapower-operator.v1.6.1
    skipRange: '>=1.0.0 <1.6.2'
name: v1.6
package: datapower-operator
schema: olm.channel

This have worked fine untill 4.10 resolver changes. Also using both the replaces and skiprange seems to be okay the way it was explained here
https://v0-18-z.olm.operatorframework.io/docs/concepts/olm-architecture/operator-catalog/creating-an-update-graph/#skiprange

How reproducible:
use following subscription, install 1.6.0 and then upgrade to 1.6.2

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/datapower-operator.openshift-operators: ""
  name: datapower-operator
  namespace: openshift-operators
spec:
  channel: v1.6
  installPlanApproval: Manual
  name: datapower-operator
  source: datapower
  sourceNamespace: openshift-marketplace
  startingCSV: datapower-operator.v1.6.0

Error n subscription yaml :

  conditions:
  - lastTransitionTime: "2022-09-09T13:42:08Z"
    message: all available catalogsources are healthy
    reason: AllCatalogSourcesHealthy
    status: "False"
    type: CatalogSourcesUnhealthy
  - message: 'a unique replacement chain within a channel is required to determine
      the relative order between channel entries, but 2 replacement chains were found
      in channel "v1.6" of package "datapower-operator": datapower-operator.v1.6.2...datapower-operator.v1.6.0,
      datapower-operator.v1.6.1...datapower-operator.v1.6.0'
    reason: ErrorPreventedResolution
    status: "True"
    type: ResolutionFailed

Logs

I0909 13:43:51.492784       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"openshift-operators", UID:"aacda32d-748f-4408-88df-f895e74a23fe", APIVersion:"v1", ResourceVersion:"1260", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' a unique replacement chain within a channel is required to determine the relative order between channel entries, but 2 replacement chains were found in channel "v1.6" of package "datapower-operator": datapower-operator.v1.6.2...datapower-operator.v1.6.0, datapower-operator.v1.6.1...datapower-operator.v1.6.0
E0909 13:43:52.095288       1 queueinformer_operator.go:290] sync "openshift-operators" failed: a unique replacement chain within a channel is required to determine the relative order between channel entries, but 2 replacement chains were found in channel "v1.6" of package "datapower-operator": datapower-operator.v1.6.2...datapower-operator.v1.6.0, datapower-operator.v1.6.1...datapower-operator.v1.6.0

Expected results:

if this upgrade strategy which has worked fine before is still okay, this error should not be there. As per affected catalog maintainer, this seems to affecting reconciling of other resources.

https://github.com/openshift/operator-framework-olm/pull/385

Bug OCPBUGS-1239: [4.10] Disconnected IPI OCP cluster install on baremetal fails when hostname of master nodes does not include the text "master

View the Description View the linked PRs

Description of problem:

Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"

Version-Release number of selected component (if applicable): 4.10.22

How reproducible: Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"

Steps to Reproduce:

Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"

Actual results: master nodes do come up.

Expected results: master nodes should come up despite that the text "master" is not in their hostname.

Additional info:

Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"

The code for the cluster-baremetal-operator at the following link:

https://github.com/openshift/cluster-baremetal-operator/blob/49d7b249c5dcef8228f206eff4530a25f03b201f/controllers/provisioning_controller.go#L441

The following condition is concerning:

if strings.Contains(bmh.Name, "master") && len(bmh.Spec.BootMACAddress) > 0

The packages reveal that bmh.Name references the name inside the metadata of the BMH object.

Should a customer have masters with names that do not include the text "master", the above condition can never become true, and so, the following slice is never created :

macs = append(macs, bmh.Spec.BootMACAddress)

https://github.com/openshift/cluster-baremetal-operator/pull/287

Bug OCPBUGS-2781: Import: Advanced option sentence is splited into two parts and headlines has no padding

View the Description View the linked PRs

This is a manual clone of https://bugzilla.redhat.com/show_bug.cgi?id=2093597 to backport this to 4.10.

Description of problem:
When importing a component from git or from a container image, and open one or more advanced options, the sentence "Click on the names to access advanced options for ..." is splited into two parts. And the headlines have no padding and everything looks squashed.

Version-Release number of selected component (if applicable):
4.10+

How reproducible:
Always

Steps to Reproduce:
1. Switch to dev perspective
2. Navigate to the add page > Import from container
3. Scroll down and open one or more of the advanced options

Actual results:
1. The sentence "Click on the names to access advanced options for ..." is shown before the opened option. The other available options are shown below the selected option.
2. The headline is displayed directly below "Click on the names to access advanced options for"
3. Another section is also shown directly under the first one.

Expected results:
1. The sentence "Click on the names to access advanced options for ..." and the options should be "one sentence" again.
2+3. Some padding for the header and/or between the sections, similar to 4.9. It must not look exactly as in 4.9, but there should be some padding between independent sections.

Additional info:
none

https://github.com/openshift/console/pull/12204

Bug OCPBUGS-6832: Include openshift_apps_deploymentconfigs_strategy_total to recent_metrics

View the Description View the linked PRs

Description of problem:

We need to include the `openshift_apps_deploymentconfigs_strategy_total` metrics to the IO archive file.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Create a cluster
2. Download the IO archive
3. Check the file `config/metrics`
4. You must find `openshift_apps_deploymentconfigs_strategy_total` insde of it

Actual results:

Expected results:

You should see the `openshift_apps_deploymentconfigs_strategy_total` at the `config/metrics` file.

Additional info:

https://github.com/openshift/insights-operator/pull/726

Bug OCPBUGS-1525: package-server-manager does not migrate packageserver CSV from v0.17.0 to v0.18.3 on OCP 4.8 -> 4.9 upgrade

View the Description View the linked PRs

This is a clone of issue OCPBUGS-1104. The following is the description of the original issue:
—
Description of problem:

In OCP 4.9, the package-server-manager was introduced to manage the packageserver CSV. However, when OCP 4.8 in upgraded to 4.9, the packageserver stays stuck in v0.17.0, which is the version in OCP 4.8, and v0.18.3 does not roll out, which is the version in OCP 4.9

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Install OCP 4.8

2. Upgrade to OCP 4.9 

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2022-08-31-160214   True        True          50m     Working towards 4.9.47: 619 of 738 done (83% complete)

$ oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.47    True        False         4m26s   Cluster version is 4.9.47

Actual results:

Check packageserver CSV. It's in v0.17.0 

$ oc get csv  NAME            DISPLAY          VERSION   REPLACES   PHASE packageserver   Package Server   0.17.0               Succeeded

Expected results:

packageserver CSV is at 0.18.3

Additional info:

packageserver CSV version in 4.8: https://github.com/openshift/operator-framework-olm/blob/release-4.8/manifests/0000_50_olm_15-packageserver.clusterserviceversion.yaml#L12

packageserver CSV version in 4.9: https://github.com/openshift/operator-framework-olm/blob/release-4.9/pkg/manifests/csv.yaml#L8

https://github.com/openshift/operator-framework-olm/pull/387

Bug OCPBUGS-1936: [4.10] Bootimage bump tracker

View the Description View the linked PRs

Tracker bug for bootimage bump in 4.10. This bug should block bugs which need a bootimage bump to fix.

https://github.com/openshift/installer/pull/6454

Bug OCPBUGS-1951: noAllowedAddressPairs set to true causes all the ports to be added without the allowed address pairs

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1890~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-1765~~. The following is the description of the original issue:
—
Description of problem:

If a customer creates a machine with a networks section like this

networks:
- filter: {}
noAllowedAddressPairs: false
subnets:
- filter: {}
uuid: primary-subnet-uuid
- filter: {}
noAllowedAddressPairs: true
subnets:
- filter: {}
uuid: other-subnet-uuid
primarySubnet: primary-subnet-uuid

Then all the ports are created without the allowed address pairs.

Doing some research in the source code, I have found that:
- For each entry on the networks: section, networks are filtered as per its filter: section[1]
- Then, if the subnets: section of the network entry is not empty, for each of the network IDs found above[2], 2 things are done that are relevant for this situatoin:
- The net ID is saved on a netsWithoutAllowedAddressPairs[3]. That map is later checked while creating any port[4].
- For each subnet entry that matches the network ID, a port is created[5].

So, the problematic behavior happens due to the following:

- Both entries in the networks array have empty filters. This means that both entries selected all the neutron networks.
- This configuration results in one port per subnet as expected because, in the later traversal of the subnets array of each entry[5], it is filtering by subnet and creating a single port as expected.
- However, the entry with "noAllowedAddressPairs: true" is selecting all the neutron networks, so it adds all of them to the netsWithoutAllowedAddressPairs map[3], regardless of the subnets filtering.
- As all the networks are in noAllowedAddressPairs: true array, all the ports created for the VM have their allowed address pairs removed[4].

Why do we consider this behavior undesired?

I understand that, if we create a port for a network that has no allowed pairs, we create all the other ports in the same networks without the pairs. However, it is surprising that a port in a network is removed the allowed address pairs due to a setting in an entry that yielded no port on that network. In other words, one would expect that the same subnet filtering that happens on each network entry in what regards yielding ports for the VM would also work for the noAllowedPairs parameter.

Version-Release number of selected component (if applicable):

4.10.30

How reproducible:

Always

Steps to Reproduce:

1. Create a machineset like in the description
2.
3.

Actual results:

All ports have no address pairs

Expected results:

Only the port on the secondary subnet has no address pairs.

Additional info:

A simple workaround would be to just fill the filter so that a single network is selected for each network entry.

References:
[1] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L576
[2] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L580
[3] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L581-L583
[4] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L658-L660
[5] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L610-L625

https://github.com/openshift/cluster-api-provider-openstack/pull/245

Bug OCPBUGS-8235: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-11240: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/jenkins/pull/1644

Bug OCPBUGS-2208: [4.10] Dual stack cluster fails on installation when multi-path routing entries exist

View the Description View the linked PRs

Description of problem:

We observed that a dual stack cluster deployed with AI gui only fails.
This cluster is dhcp for ipv4, RA/RS autoconfiguration for ipv6.

It fails with error in the onvkube container

```
I0906 07:45:43.044090   87450 gateway_init.go:261] Initializing Gateway Functionality
I0906 07:45:43.046398   87450 gateway_localnet.go:152] Node local addresses initialized to: map[10.131.31.214:{10.131.31.208 fffffff0} 10.255.0.2:{10.255.0.0 fffffe00} 127.0.0.1:{127.0.0.0 ff000000} 2001:1b74:480:613a:f6e9:d4ff:fef1:6f26:{2001:1b74:480:613a:: ffffffffffffffff0000000000000000} ::1:{::1 ffffffffffffffffffffffffffffffff} fd01:0:0:1::2:{fd01:0:0:1:: ffffffffffffffff0000000000000000} fe80::8ce9:b4ff:fe1a:1208:{fe80:: ffffffffffffffff0000000000000000} fe80::c8ef:ecff:fee3:64c7:{fe80:: ffffffffffffffff0000000000000000} fe80::f6e9:d4ff:fef1:6f26:{fe80:: ffffffffffffffff0000000000000000}]
I0906 07:45:43.047759   87450 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 7
I0906 07:45:43.048045   87450 helper_linux.go:97] Found default gateway interface br-ex 10.131.31.209
I0906 07:45:43.048152   87450 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 7
F0906 07:45:43.048318   87450 ovnkube.go:133] failed to get default gateway interface
```

on the node we observed that there is multi-path entry during

```
default proto ra metric 48 pref medium
        nexthop via fe80::e2f6:2d01:ab14:ec71 dev br-ex weight 1
        nexthop via fe80::e2f6:2d01:ab11:c271 dev br-ex weight 1
```

I manually remove one of the entries (`ip route delete`) and then delete the ovnkube-node pod. Then the installation continues, container works.

Every time there is multiple entry, if the onvkube-node starts, it fails.

Version-Release number of selected component (if applicable):

4.10.30

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

There might a side issue: the interface of the node upon boot takes time to get the ipv6 autoconfiguration, no RS packets seemed to be sent out (observed zero on all routers).

https://github.com/openshift/ovn-kubernetes/pull/1303

Bug OCPBUGS-6596: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-openstack/pull/251

Bug OCPBUGS-761: [OCPonRHV] CSI provisioned disks are effectively preallocated due to go-ovirt-client setting Provisioned and Initial size of the disk to the same value

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-515~~. The following is the description of the original issue:
—
When a thin provisioned COW format disk is created on OCP on RHV via CSI driver (a PVC -
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml

with for example requested storage 100 GB), the go-ovirt-client behaviour makes it so that the created disk has virtual size 100 GB and it's actual size is 110 GB.

But this is thin provisioned disk, so the initial size of the disk should be default of the engine and then grow as needed, it shouldn't be this big.

This causes all the disks created this way to be functionally preallocated (since it eats all that space), which is a real waste of space.

How reproducible: 100%

Steps to Reproduce:
1. Create a storage claim (PVC) in Openshift (
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml
) using the default storage class (or any other storage class with thinProvisioning: "true") and with requested storage i.e. 100Gi

$ oc create -f storage-claim.yaml

2. In the RHV web console navigate to Storage -> Disks and check Virtual size and Actual size of the created disk (PVC)

Actual results:
Disk from our example with requested storage 100GB reports virtual size 100GB and actual size 110 GB.

Expected results:
Thin provisioned disks should start with small initial size and then grow as needed, so its actual size should be considerably smaller (the default initial size set by the engine should be 2.5 GB if I'm not mistaken).

Note: The extra 10GB in the actual size are caused by overhead for the qcow2 disk format, which is 10%, and this was tracked here as a separate issue:
https://bugzilla.redhat.com/show_bug.cgi?id=2097139

https://github.com/openshift/ovirt-csi-driver/pull/117

Story MON-1949: Improve prometheus-adapter consistency

View the Description View the linked PRs

The current integration of prometheus-adapter in OpenShift uses the platform Prometheus as a backend to get metrics. The problem with this design is that we are getting metrics from 2 different Prometheus instances which don't have replicated data, so two queries sent at the same time to prometheus-adapter might yield different results since the underlying promQL queries executed by prometheus-adapter might be on different Prometheus servers. The consequence is that we end up having inconsistent data across multiple autoscaling requests.

This can be easily tested by running:

$ while true ; do date; oc adm top pod -n openshift-monitoring  prometheus-k8s-0 ; echo; sleep 1 ;done 

Mon Jul 26 03:55:07 EDT 2021
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   208m         4879Mi          

Mon Jul 26 03:55:08 EDT 2021                               
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   246m         4877Mi          

Mon Jul 26 03:55:09 EDT 2021                               
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   208m         4879Mi          

Mon Jul 26 03:55:10 EDT 2021
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   246m         4877Mi

This isn't a bug in itself since it was designed that way, but we could do better by using thanos-querier as a backend instead of the platform Prometheus because it will duplicate the metrics from both instances and serve one consistent result based on the data that it will get from the Prometheuses.

DoD:

Use thanos-querier as a backend for prometheus-adapter

https://github.com/openshift/cluster-monitoring-operator/pull/1417

Bug OCPBUGS-10020: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/jenkins/pull/1639

Bug OCPBUGS-4134: configure-ovs: delays boot waiting for patch port to be unmanaged

View the Description View the linked PRs

This bug is a backport clone of [Bugzilla Bug 2072040](https://bugzilla.redhat.com/show_bug.cgi?id=2072040). The following is the description of the original bug:
—
Description of problem:

configure-ovs resets network configuration on boot, and while doing so waits for all devices to become unmanaged in NetworkManager. Currently the patch port between br-int and br-ex created by ovn-kubernetes is managed by NetworkManager, never becomes unmanaged and causes an accumulated delay of 2 minutes on boot.

This patct port should never be managed by NetworkManager.

https://github.com/openshift/machine-config-operator/pull/3435

Bug OCPBUGS-6911: NMstate removes egressip in OpenShift cluster with SDN plugin

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-5926~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4486~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-95~~. The following is the description of the original issue:
—
In an OpenShift cluster with OpenShiftSDN network plugin with egressIP and NMstate operator configured, there are some conditions when the egressIP is deconfigured from the network interface.

The bug is 100% reproducible.

Steps for reproducing the issue are:

1. Install a cluster with OpenShiftSDN network plugin.

2. Configure egressip for a project.

3. Install NMstate operator.

4. Create a NodeNetworkConfigurationPolicy.

5. Identify on which node the egressIP is present.

6. Restart the nmstate-handler pod running on the identified node.

7. Verify that the egressIP is no more present.

Restarting the sdn pod related to the identified node will reconfigure the egressIP in the node.

This issue has a high impact since any changes triggered for the NMstate operator will prevent application traffic. For example, in the customer environment, the issue is triggered any time a new node is added to the cluster.

The expectation is that NMstate operator should not interfere with SDN configuration.

https://github.com/openshift/sdn/pull/501

Bug OCPBUGS-3743: Installer fails to build on ppc64le with golang 1.17.12

View the Description View the linked PRs

Description of problem:

Building the installer image started to fail on ppc64le

Version-Release number of selected component (if applicable):

4.10.0

How reproducible:

Always on the brew builders

Steps to Reproduce:

1.
2.
3.

Actual results:

2022-11-14 11:55:00,819 - atomic_reactor.tasks.binary_container_build - INFO - + go build -mod=vendor -ldflags ' -X github.com/openshift/installer/pkg/version.Raw=v4.10.0 -X github.com/openshift/installer/pkg/version.Commit=4e8922c14379dce4845f362ac3a83ff80f1dc655 -X github.com/openshift/installer/pkg/version.defaultArch=ppc64le -s -w' -tags ' release' -o bin/openshift-install ./cmd/openshift-install
2022-11-14 11:58:01,556 - atomic_reactor.tasks.binary_container_build - INFO - # github.com/openshift/installer/cmd/openshift-install
2022-11-14 11:58:01,556 - atomic_reactor.tasks.binary_container_build - INFO - github.com/aliyun/alibaba-cloud-sdk-go/services/cms.(*Client).DescribeMonitoringAgentHostsWithChan.func1: direct call too far: runtime.duffzero+1f0-tramp0 -20000e8

Expected results:

Successful build

Additional info:

This could potentially be a golang issue. After some preliminary investigation, we tried setting `-linkmode=external` in the ldflags which seems to allow the build to finish.
So while we investigate the real cause of the issue, changing the link mode will serve as a workaround to unblock the build pipeline.

https://github.com/openshift/installer/pull/6597

Bug OCPBUGS-1634: Developer catalog fails to load

View the Description View the linked PRs

Description of problem:

Customer is facing issue similar to https://github.com/devfile/api/issues/897

Version-Release number of selected component (if applicable):

OCP 4.10.17

How reproducible:
N/A
Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Tried working around it with ALL_PROXY but it did not help. Note because the console operator reverts changes pretty quickly testing this was a bit of a PITA

https://github.com/openshift/console/pull/12040

Bug OCPBUGS-3103: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/etcd/pull/171

Task MON-1656: Add a Makefile rule in CMO for verifications and checks

View the Description View the linked PRs

Add a Makefile rule in CMO to execute all the different rule that are used for verification and validation. Currenctly, some of them might not be at the right place, for example `check-assets` which is part of `generate` despite not being responsible of any generation. https://github.com/openshift/cluster-monitoring-operator/pull/1151/files#r629371735

DoD:

Add a new rule in CMO to handle verification
Add a CI job for this rule

Task MON-975: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/1338

Bug OCPBUGS-2196: Symptom Detection.Undiagnosed panic detected in pod

View the Description View the linked PRs

This bug is a backport clone of [Bugzilla Bug 2075091](https://bugzilla.redhat.com/show_bug.cgi?id=2075091). The following is the description of the original bug:
—
Symptom Detection.Undiagnosed panic detected in pod

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=Symptom%20Detection.Undiagnosed%20panic%20detected%20in%20pod

This problem seemed existing before. But number of cases surged and caused two nightly payloads to be rejected:

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-150057
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-185124

After that, it mysteriously disappeared.

Here is a specific case:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968

Message from the test case:

{ pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log.gz:E0412 15:52:33.358619 1 runtime.go:78] Observed a panic: runtime.boundsError

{x:4, y:4, signed:true, code:0x0}

(runtime error: index out of range [4] with length 4)}

Panic trace from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log:

E0412 15:52:33.358619 1 runtime.go:78] Observed a panic: runtime.boundsError

{x:4, y:4, signed:true, code:0x0}

(runtime error: index out of range [4] with length 4)
goroutine 77 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(

{0x1741840, 0xc000b635f0})
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ac9740})
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1741840, 0xc000b635f0}

)
/usr/lib/golang/src/runtime/panic.go:1038 +0x215
k8s.io/kube-state-metrics/v2/internal/store.createPodContainerInfoFamilyGenerator.func1(0xc003422c00)
/go/src/k8s.io/kube-state-metrics/internal/store/pod.go:134 +0x375
k8s.io/kube-state-metrics/v2/internal/store.wrapPodFunc.func1(

{0x1804880, 0xc003422c00})
/go/src/k8s.io/kube-state-metrics/internal/store/pod.go:1386 +0x5a
k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67
k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x1804880, 0xc003422c00}

)
/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc0000c13c0,

{0x1804880, 0xc003422c00})
/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:72 +0xd4
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Update(0xc003422c00, {0x1804880, 0xc003422c00}

)
/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:87 +0x25
k8s.io/client-go/tools/cache.(*Reflector).watchHandler(0xc000192fc0,

{0x0, 0x0, 0x26cdee0}

,

{0x1a373f8, 0xc0011c24c0}

, 0xc000623d60, 0xc0005ff380, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:506 +0xa55
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc000192fc0, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:429 +0x696
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:221 +0x26
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f02ffada1d0)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00036a2c0,

{0x1a1daa0, 0xc000386e60}

, 0x1, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc000192fc0, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:220 +0x1f8
created by k8s.io/kube-state-metrics/v2/internal/store.(*Builder).startReflector
/go/src/k8s.io/kube-state-metrics/internal/store/builder.go:508 +0x2c8
panic: runtime error: index out of range [4] with length 4 [recovered]
panic: runtime error: index out of range [4] with length 4

It points to https://github.com/openshift/kube-state-metrics/blob/6efa87f858ee53028fd2de40941b61c09e9ee049/internal/store/pod.go#L134 where the len of p.Status.ContainerStatuses and p.Spec.Containers seems to diverge.

Unfortunately the condition is ephemeral and the condition that caused the panic does not exist in the must-gather data.

The ask is to safe guard the code to avoid the panic and log useful debugging info to track down offenders.

https://github.com/openshift/kube-state-metrics/pull/79

Bug OCPBUGS-4128: catalog-operator fatal error: concurrent map writes

View the Description View the linked PRs

This bug is a backport clone of [Bugzilla Bug 2117324](https://bugzilla.redhat.com/show_bug.cgi?id=2117324). The following is the description of the original bug:
—
+++ This bug was initially created as a clone of Bug #2101357 +++

Description of problem:

message: "her.go:105 +0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
+0x130\n\ngoroutine 5545 [select, 7 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc00240a780,
0xc003321a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0xc002efea80?,
0xc0009cc7a0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5678 [select, 3 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc000b70480,
0xc0035d4500)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc002999e90?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5836 [select, 1 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc003b00180,
0xc003ff8a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc003a1c8d0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5905 [chan receive, 1 minutes]:\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel.func1()\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:51
+0x85\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:50
+0x231\n"
reason: Error
startedAt: "2022-06-27T00:00:59Z"

Version-Release number of selected component (if applicable):
mac:~ jianzhang$ oc exec catalog-operator-66cb8fd8c5-j7vkx – olm --version
OLM version: 0.19.0
git commit: 8c2bd46147a90d58e98de73d34fd79477769f11f
mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-25-081133 True False 10h Cluster version is 4.11.0-0.nightly-2022-06-25-081133

How reproducible:
always

Steps to Reproduce:
1. Install OCP 4.11
2. Check OLM pods

Actual results:

mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-66cb8fd8c5-j7vkx 1/1 Running 2 (8h ago) 10h
collect-profiles-27605340-wgsvf 0/1 Completed 0 42m
collect-profiles-27605355-ffgxd 0/1 Completed 0 27m
collect-profiles-27605370-w7ds7 0/1 Completed 0 12m
olm-operator-6cfd444b8f-r5q4t 1/1 Running 0 10h
package-server-manager-66589d4bf8-csr7j 1/1 Running 0 10h
packageserver-59977db6cf-nkn5w 1/1 Running 0 10h
packageserver-59977db6cf-nxbnx 1/1 Running 0 10h

mac:~ jianzhang$ oc get pods catalog-operator-66cb8fd8c5-j7vkx -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.26"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.26"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-26T23:12:45Z"
generateName: catalog-operator-66cb8fd8c5-
labels:
app: catalog-operator
pod-template-hash: 66cb8fd8c5
name: catalog-operator-66cb8fd8c5-j7vkx
namespace: openshift-operator-lifecycle-manager
ownerReferences:

apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: catalog-operator-66cb8fd8c5
uid: bcf173be-97bc-4152-8cec-d45f820e167c
resourceVersion: "67395"
uid: bb34c37b-b22d-4412-bbc0-1e57a1b2bd3a
spec:
containers:
args:
--namespace
openshift-marketplace
--configmapServerImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:033bd57b54bb2e827b0ae95a03cb6f7ceeb65422e85de635185cfed88c17b2b4
--opmImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:033bd57b54bb2e827b0ae95a03cb6f7ceeb65422e85de635185cfed88c17b2b4
--util-image
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
--writeStatusName
operator-lifecycle-manager-catalog
--tls-cert
/srv-cert/tls.crt
--tls-key
/srv-cert/tls.key
--client-ca
/profile-collector-cert/tls.crt
command:
/bin/catalog
env:
name: RELEASE_VERSION
value: 4.11.0-0.nightly-2022-06-25-081133
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8443
scheme: HTTPS
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: catalog-operator
ports:
containerPort: 8443
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8443
scheme: HTTPS
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
requests:
cpu: 10m
memory: 80Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
mountPath: /srv-cert
name: srv-cert
readOnly: true
mountPath: /profile-collector-cert
name: profile-collector-cert
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-7txv8
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-10-0-173-128.ap-southeast-1.compute.internal
nodeSelector:
kubernetes.io/os: linux
node-role.kubernetes.io/master: ""
preemptionPolicy: PreemptLowerPriority
priority: 2000000000
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsNonRoot: true
runAsUser: 65534
seLinuxOptions:
level: s0:c20,c0
seccompProfile:
type: RuntimeDefault
serviceAccount: olm-operator-serviceaccount
serviceAccountName: olm-operator-serviceaccount
terminationGracePeriodSeconds: 30
tolerations:
effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 120
effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 120
effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
volumes:
name: srv-cert
secret:
defaultMode: 420
secretName: catalog-operator-serving-cert
name: profile-collector-cert
secret:
defaultMode: 420
secretName: pprof-cert
name: kube-api-access-7txv8
projected:
defaultMode: 420
sources:
serviceAccountToken:
expirationSeconds: 3607
path: token
configMap:
items:
key: ca.crt
path: ca.crt
name: kube-root-ca.crt
downwardAPI:
items:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
configMap:
items:
key: service-ca.crt
path: service-ca.crt
name: openshift-service-ca.crt
status:
conditions:
lastProbeTime: null
lastTransitionTime: "2022-06-26T23:16:59Z"
status: "True"
type: Initialized
lastProbeTime: null
lastTransitionTime: "2022-06-27T01:19:09Z"
status: "True"
type: Ready
lastProbeTime: null
lastTransitionTime: "2022-06-27T01:19:09Z"
status: "True"
type: ContainersReady
lastProbeTime: null
lastTransitionTime: "2022-06-26T23:16:59Z"
status: "True"
type: PodScheduled
containerStatuses:
containerID: cri-o://f3fce12480556b5ab279236c290acdf15e1bc850426078730dcfd333ecda6795
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
lastState:
terminated:
containerID: cri-o://5e3ce04f1cf538f679af008dcb62f9477131aae1772f527d6207fdc7bb247a55
exitCode: 2
finishedAt: "2022-06-27T01:19:08Z"
message: "her.go:105 +0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
+0x130\n\ngoroutine 5545 [select, 7 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc00240a780,
0xc003321a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0xc002efea80?,
0xc0009cc7a0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5678 [select, 3 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc000b70480,
0xc0035d4500)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc002999e90?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5836 [select, 1 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc003b00180,
0xc003ff8a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc003a1c8d0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5905 [chan receive, 1 minutes]:\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel.func1()\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:51
+0x85\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:50
+0x231\n"
reason: Error
startedAt: "2022-06-27T00:00:59Z"
name: catalog-operator
ready: true
restartCount: 2
started: true
state:
running:
startedAt: "2022-06-27T01:19:08Z"
hostIP: 10.0.173.128
phase: Running
podIP: 10.130.0.26
podIPs:
ip: 10.130.0.26
qosClass: Burstable
startTime: "2022-06-26T23:16:59Z"

Expected results:
catalog-operator works well.

Additional info:
Operators can be subscribed successfully.
mac:~ jianzhang$ oc get sub -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
jian learn learn qe-app-registry beta
openshift-logging cluster-logging cluster-logging qe-app-registry stable
openshift-operators-redhat elasticsearch-operator elasticsearch-operator qe-app-registry stable
mac:~ jianzhang$
mac:~ jianzhang$ oc get pods -n jian
NAME READY STATUS RESTARTS AGE
552b4660850a7fe1e1f142091eb5e4305f18af151727c56f70aa5dffc1dg8cg 0/1 Completed 0 54m
learn-operator-666b687bfb-7qppm 1/1 Running 0 54m
qe-app-registry-hbzxg 1/1 Running 0 58m
mac:~ jianzhang$ oc get csv -n jian
NAME DISPLAY VERSION REPLACES PHASE
elasticsearch-operator.v5.5.0 OpenShift Elasticsearch Operator 5.5.0 Succeeded
learn-operator.v0.0.3 Learn Operator 0.0.3 learn-operator.v0.0.2 Succeeded

— Additional comment from jiazha@redhat.com on 2022-06-27 09:58:18 UTC —

Created attachment 1892927
olm must-gather

— Additional comment from jiazha@redhat.com on 2022-06-27 09:59:01 UTC —

Created attachment 1892928
marketplace project must-gather

— Additional comment from jiazha@redhat.com on 2022-06-28 02:05:39 UTC —

mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-25-132614 True False 145m Cluster version is 4.11.0-0.nightly-2022-06-25-132614

mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-869fb4bd4d-lbhgj 1/1 Running 3 (9m25s ago) 170m
collect-profiles-27606330-4wg5r 0/1 Completed 0 33m
collect-profiles-27606345-lmk4q 0/1 Completed 0 18m
collect-profiles-27606360-mksv6 0/1 Completed 0 3m17s
olm-operator-5f485d9d5f-wczjc 1/1 Running 0 170m
package-server-manager-6cf996b4cc-79lrw 1/1 Running 2 (156m ago) 170m
packageserver-5f668f98d7-2vjdn 1/1 Running 0 165m
packageserver-5f668f98d7-mb2wc 1/1 Running 0 165m
mac:~ jianzhang$ oc get pods catalog-operator-869fb4bd4d-lbhgj -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.34"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.34"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-27T23:13:12Z"
generateName: catalog-operator-869fb4bd4d-
labels:
app: catalog-operator
pod-template-hash: 869fb4bd4d
name: catalog-operator-869fb4bd4d-lbhgj
namespace: openshift-operator-lifecycle-manager
ownerReferences:

apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: catalog-operator-869fb4bd4d
uid: 3a1a8cd3-2151-4650-a96b-c1951d461c67
resourceVersion: "75671"
uid: 2f06d663-0697-4e88-9e9a-f802c6641efe
spec:
containers:
args:
--namespace
openshift-marketplace
--configmapServerImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
--opmImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
--util-image
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
--writeStatusName
operator-lifecycle-manager-catalog
--tls-cert
/srv-cert/tls.crt
--tls-key
/srv-cert/tls.key
--client-ca
/profile-collector-cert/tls.crt
command:
/bin/catalog
env:
name: RELEASE_VERSION
value: 4.11.0-0.nightly-2022-06-25-132614
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8443
scheme: HTTPS
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: catalog-operator
ports:
containerPort: 8443
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8443
scheme: HTTPS
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
requests:
cpu: 10m
memory: 80Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
mountPath: /srv-cert
name: srv-cert
readOnly: true
mountPath: /profile-collector-cert
name: profile-collector-cert
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-sdjns
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-10-0-190-130.ap-south-1.compute.internal
nodeSelector:
kubernetes.io/os: linux
node-role.kubernetes.io/master: ""
preemptionPolicy: PreemptLowerPriority
priority: 2000000000
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsNonRoot: true
runAsUser: 65534
seLinuxOptions:
level: s0:c20,c0
seccompProfile:
type: RuntimeDefault
serviceAccount: olm-operator-serviceaccount
serviceAccountName: olm-operator-serviceaccount
terminationGracePeriodSeconds: 30
tolerations:
effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 120
effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 120
effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
volumes:
name: srv-cert
secret:
defaultMode: 420
secretName: catalog-operator-serving-cert
name: profile-collector-cert
secret:
defaultMode: 420
secretName: pprof-cert
name: kube-api-access-sdjns
projected:
defaultMode: 420
sources:
serviceAccountToken:
expirationSeconds: 3607
path: token
configMap:
items:
key: ca.crt
path: ca.crt
name: kube-root-ca.crt
downwardAPI:
items:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
configMap:
items:
key: service-ca.crt
path: service-ca.crt
name: openshift-service-ca.crt
status:
conditions:
lastProbeTime: null
lastTransitionTime: "2022-06-27T23:16:47Z"
status: "True"
type: Initialized
lastProbeTime: null
lastTransitionTime: "2022-06-28T01:53:53Z"
status: "True"
type: Ready
lastProbeTime: null
lastTransitionTime: "2022-06-28T01:53:53Z"
status: "True"
type: ContainersReady
lastProbeTime: null
lastTransitionTime: "2022-06-27T23:16:47Z"
status: "True"
type: PodScheduled
containerStatuses:
containerID: cri-o://2a83d3ba503aa760d27aeef28bf03af84b9ff134ea76c6545251f3253507c22b
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
lastState:
terminated:
containerID: cri-o://2766bb3bad38c02f531915aac0321ffb18c7a993a7d84b27e8c7199935f7c858
exitCode: 2
finishedAt: "2022-06-28T01:53:52Z"
message: "izer/streaming/streaming.go:77 +0xa7\nk8s.io/client-go/rest/watch.(*Decoder).Decode(0xc002b12500)\n\t/build/vendor/k8s.io/client-go/rest/watch/decoder.go:49
+0x4f\nk8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc0036178c0)\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:105
+0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
+0x130\n\ngoroutine 3479 [sync.Cond.Wait]:\nsync.runtime_notifyListWait(0xc003a987c8,
0x8)\n\t/usr/lib/golang/src/runtime/sema.go:513 +0x13d\nsync.(*Cond).Wait(0xc001f9bf10?)\n\t/usr/lib/golang/src/sync/cond.go:56
+0x8c\ngolang.org/x/net/http2.(*pipe).Read(0xc003a987b0, {0xc0036e4001, 0xdff, 0xdff}
)\n\t/build/vendor/golang.org/x/net/http2/pipe.go:76 +0xeb\ngolang.org/x/net/http2.transportResponseBody.Read(
{0x0?}
,

{0xc0036e4001?, 0x2?, 0x203511d?}
)\n\t/build/vendor/golang.org/x/net/http2/transport.go:2407
+0x85\nencoding/json.(*Decoder).refill(0xc002fc0640)\n\t/usr/lib/golang/src/encoding/json/stream.go:165
+0x17f\nencoding/json.(*Decoder).readValue(0xc002fc0640)\n\t/usr/lib/golang/src/encoding/json/stream.go:140
+0xbb\nencoding/json.(*Decoder).Decode(0xc002fc0640,
{0x1d377c0, 0xc003523098}
)\n\t/usr/lib/golang/src/encoding/json/stream.go:63
+0x78\nk8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc003127770,

{0xc0036dd500, 0x1000, 0x1500}
)\n\t/build/vendor/k8s.io/apimachinery/pkg/util/framer/framer.go:152
+0x19c\nk8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc003502aa0,
0xc001f9bf10?,
{0x2366870, 0xc0044dca80}
)\n\t/build/vendor/k8s.io/apimachinery/pkg/runtime/serializer/streaming/streaming.go:77
+0xa7\nk8s.io/client-go/rest/watch.(*Decoder).Decode(0xc00059f700)\n\t/build/vendor/k8s.io/client-go/rest/watch/decoder.go:49
+0x4f\nk8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc0044dcd40)\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:105
+0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
+0x130\n"
reason: Error
startedAt: "2022-06-28T01:06:59Z"
name: catalog-operator
ready: true
restartCount: 3
started: true
state:
running:
startedAt: "2022-06-28T01:53:53Z"
hostIP: 10.0.190.130
phase: Running
podIP: 10.130.0.34
podIPs:
ip: 10.130.0.34
qosClass: Burstable
startTime: "2022-06-27T23:16:47Z"

— Additional comment from jiazha@redhat.com on 2022-06-28 02:09:23 UTC —

mac:~ jianzhang$ oc get pods package-server-manager-6cf996b4cc-79lrw -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.13"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.13"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-27T23:13:10Z"
generateName: package-server-manager-6cf996b4cc-
labels:
app: package-server-manager
pod-template-hash: 6cf996b4cc
name: package-server-manager-6cf996b4cc-79lrw
namespace: openshift-operator-lifecycle-manager
ownerReferences:

apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: package-server-manager-6cf996b4cc
uid: 191cb627-a1f3-452b-aed2-336de7871004
resourceVersion: "19056"
uid: 63cb7200-32d6-4b0d-bfb2-ba725cda7fbd
spec:
containers:
args:
--name
$(PACKAGESERVER_NAME)
--namespace
$(PACKAGESERVER_NAMESPACE)
command:
/bin/psm
start
env:
name: PACKAGESERVER_NAME
value: packageserver
name: PACKAGESERVER_IMAGE
value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
name: PACKAGESERVER_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
name: RELEASE_VERSION
value: 4.11.0-0.nightly-2022-06-25-132614
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: package-server-manager
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
requests:
cpu: 10m
memory: 50Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-k9hh2
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-10-0-190-130.ap-south-1.compute.internal
nodeSelector:
kubernetes.io/os: linux
node-role.kubern

— Additional comment from jiazha@redhat.com on 2022-06-28 02:10:02 UTC —

preemptionPolicy: PreemptLowerPriority
priority: 2000000000
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsNonRoot: true
runAsUser: 65534
seLinuxOptions:
level: s0:c20,c0
seccompProfile:
type: RuntimeDefault
serviceAccount: olm-operator-serviceaccount
serviceAccountName: olm-operator-serviceaccount
terminationGracePeriodSeconds: 30
tolerations:

effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 120
effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 120
effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
volumes:
name: kube-api-access-k9hh2
projected:
defaultMode: 420
sources:
serviceAccountToken:
expirationSeconds: 3607
path: token
configMap:
items:
key: ca.crt
path: ca.crt
name: kube-root-ca.crt
downwardAPI:
items:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
configMap:
items:
key: service-ca.crt
path: service-ca.crt
name: openshift-service-ca.crt
status:
conditions:
lastProbeTime: null
lastTransitionTime: "2022-06-27T23:16:47Z"
status: "True"
type: Initialized
lastProbeTime: null
lastTransitionTime: "2022-06-27T23:27:28Z"
status: "True"
type: Ready
lastProbeTime: null
lastTransitionTime: "2022-06-27T23:27:28Z"
status: "True"
type: ContainersReady
lastProbeTime: null
lastTransitionTime: "2022-06-27T23:16:47Z"
status: "True"
type: PodScheduled
containerStatuses:
containerID: cri-o://2c56089fabbaa6d7067feb231750ad20422e299dc33dabc0cd19ceca5951ed3a
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
lastState:
terminated:
containerID: cri-o://76cc5448de346ec636070a71868432c8f3aa213e16fe211c8fe18a3fee112d23
exitCode: 1
finishedAt: "2022-06-27T23:26:36Z"
message: "d/vendor/github.com/spf13/cobra/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/build/vendor/github.com/spf13/cobra/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/build/vendor/github.com/spf13/cobra/command.go:902\nmain.main\n\t/build/cmd/package-server-manager/main.go:36\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250\n1.6563723963629913e+09\tERROR\tFailed
to get API Group-Resources\t {\"error\": \"Get \\\"https://172.30.0.1:443/api?timeout=32s\\\": dial tcp 172.30.0.1:443: connect: connection refused\"}
\nsigs.k8s.io/controller-runtime/pkg/cluster.New\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/cluster/cluster.go:160\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/manager/manager.go:322\nmain.run\n\t/build/cmd/package-server-manager/main.go:67\ngithub.com/spf13/cobra.(*Command).execute\n\t/build/vendor/github.com/spf13/cobra/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/build/vendor/github.com/spf13/cobra/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/build/vendor/github.com/spf13/cobra/command.go:902\nmain.main\n\t/build/cmd/package-server-manager/main.go:36\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250\n1.6563723963631017e+09\tERROR\tsetup\tfailed
to setup manager instance\t
{\"error\": \"Get \\\"https://172.30.0.1:443/api?timeout=32s\\\": dial tcp 172.30.0.1:443: connect: connection refused\"}
\ngithub.com/spf13/cobra.(*Command).execute\n\t/build/vendor/github.com/spf13/cobra/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/build/vendor/github.com/spf13/cobra/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/build/vendor/github.com/spf13/cobra/command.go:902\nmain.main\n\t/build/cmd/package-server-manager/main.go:36\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250\nError:
Get \"https://172.30.0.1:443/api?timeout=32s\": dial tcp 172.30.0.1:443:
connect: connection refused\nencountered an error while executing the binary:
Get \"https://172.30.0.1:443/api?timeout=32s\": dial tcp 172.30.0.1:443:
connect: connection refused\n"
reason: Error
startedAt: "2022-06-27T23:26:11Z"
name: package-server-manager
ready: true
restartCount: 2
started: true
state:
running:
startedAt: "2022-06-27T23:26:54Z"
hostIP: 10.0.190.130
phase: Running
podIP: 10.130.0.13
podIPs:
ip: 10.130.0.13
qosClass: Burstable
startTime: "2022-06-27T23:16:47Z"

— Additional comment from jiazha@redhat.com on 2022-06-29 08:43:51 UTC —

Observed the error restarts:

mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-28-160049 True False 5h57m Cluster version is 4.11.0-0.nightly-2022-06-28-160049

mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-7b88dddfbc-rsfhz 1/1 Running 6 (26m ago) 5h51m
collect-profiles-27608160-6m7r6 0/1 Completed 0 37m
collect-profiles-27608175-94n56 0/1 Completed 0 22m
collect-profiles-27608190-nbzcf 0/1 Completed 0 7m55s
olm-operator-5977ffb855-lgfn8 1/1 Running 0 9h
package-server-manager-75db6dcfc-hql4v 1/1 Running 0 9h
packageserver-5955fb79cd-9n56n 1/1 Running 0 9h
packageserver-5955fb79cd-xf6f6 1/1 Running 0 9h

mac:~ jianzhang$ oc get pods catalog-operator-7b88dddfbc-rsfhz -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.121"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.121"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-29T02:46:23Z"
generateName: catalog-operator-7b88dddfbc-
labels:
app: catalog-operator
pod-template-hash: 7b88dddfbc
name: catalog-operator-7b88dddfbc-rsfhz
namespace: openshift-operator-lifecycle-manager
ownerReferences:

apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: catalog-operator-7b88dddfbc
uid: 4c796f4d-b2f2-4e0d-b2de-6e8ab5f45ce2
resourceVersion: "278732"
uid: 10e8c6e9-8f1c-4484-b934-c5208016c426
spec:
containers:
args:
--debug
--namespace
openshift-marketplace
--configmapServerImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
--opmImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
--util-image
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
--writeStatusName
operator-lifecycle-manager-catalog
--tls-cert
/srv-cert/tls.crt
--tls-key
/srv-cert/tls.key
--client-ca
/profile-collector-cert/tls.crt
command:
/bin/catalog
env:
name: RELEASE_VERSION
value: 4.11.0-0.nightly-2022-06-28-160049
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8443
scheme: HTTPS
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: catalog-operator
ports:
containerPort: 8443
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8443
scheme: HTTPS
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
requests:
cpu: 10m
memory: 80Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
mountPath: /srv-cert
name: srv-cert
readOnly: true
mountPath: /profile-collector-cert
name: profile-collector-cert
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-cv642
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
imagePullSecrets:
name: olm-operator-serviceaccount-dockercfg-f26bf
nodeName: ip-10-0-130-83.ap-south-1.compute.internal
nodeSelector:
kubernetes.io/os: linux
node-role.kubernetes.io/master: ""
preemptionPolicy: PreemptLowerPriority
priority: 2000000000
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsNonRoot: true
runAsUser: 65534
seLinuxOptions:
level: s0:c20,c0
seccompProfile:
type: RuntimeDefault
serviceAccount: olm-operator-serviceaccount
serviceAccountName: olm-operator-serviceaccount
terminationGracePeriodSeconds: 30
tolerations:
effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 120
effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 120
effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
volumes:
name: srv-cert
secret:
defaultMode: 420
secretName: catalog-operator-serving-cert
name: profile-collector-cert
secret:
defaultMode: 420
secretName: pprof-cert
name: kube-api-access-cv642
projected:
defaultMode: 420
sources:
serviceAccountToken:
expirationSeconds: 3607
path: token
configMap:
items:
key: ca.crt
path: ca.crt
name: kube-root-ca.crt
downwardAPI:
items:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
configMap:
items:
key: service-ca.crt
path: service-ca.crt
name: openshift-service-ca.crt
status:
conditions:
lastProbeTime: null
lastTransitionTime: "2022-06-29T02:46:23Z"
status: "True"
type: Initialized
lastProbeTime: null
lastTransitionTime: "2022-06-29T08:11:56Z"
status: "True"
type: Ready
lastProbeTime: null
lastTransitionTime: "2022-06-29T08:11:56Z"
status: "True"
type: ContainersReady
lastProbeTime: null
lastTransitionTime: "2022-06-29T02:46:23Z"
status: "True"
type: PodScheduled
containerStatuses:
containerID: cri-o://64a99805cca11e9af66452d739a91259566a35a9e580bfac81dd7205f9272e18
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
lastState:
terminated:
containerID: cri-o://938e2eb90465caf8ac99ff4405cfbb9a0f03b16598474f944cca44c3e70af9df
exitCode: 2
finishedAt: "2022-06-29T08:11:54Z"
message: "ub.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:128
+0xc3 fp=0xc003c11180 sp=0xc003c11118 pc=0x1a36603\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*subscriptionSyncer).Sync(0xc000a2f420,
{0x237a238, 0xc0004ce4c0}, {0x236a448, 0xc004993ce0})\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/syncer.go:77
+0x535 fp=0xc003c11648 sp=0xc003c11180 pc=0x1a4f535\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*QueueInformer).Sync(...)\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:35\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).processNextWorkItem(0xc0001e3ad0,
{0x237a238, 0xc0004ce4c0}
, 0xc000cee360)\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:287
+0x57c fp=0xc003c11f70 sp=0xc003c11648 pc=0x1a3ca7c\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(0x10000c0008fd6e0?,

{0x237a238, 0xc0004ce4c0}
, 0xc0004837b8?)\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:231
+0x45 fp=0xc003c11fb0 sp=0xc003c11f70 pc=0x1a3c4a5\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start.func3()\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221
+0x32 fp=0xc003c11fe0 sp=0xc003c11fb0 pc=0x1a3c152\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1571
+0x1 fp=0xc003c11fe8 sp=0xc003c11fe0 pc=0x4719c1\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221
+0x557\n"
reason: Error
startedAt: "2022-06-29T07:56:16Z"
name: catalog-operator
ready: true
restartCount: 6
started: true
state:
running:
startedAt: "2022-06-29T08:11:55Z"
hostIP: 10.0.130.83
phase: Running
podIP: 10.130.0.121
podIPs:
ip: 10.130.0.121
qosClass: Burstable
startTime: "2022-06-29T02:46:23Z"

— Additional comment from jiazha@redhat.com on 2022-07-04 03:50:38 UTC —

Please ignore comment 4, 5, they are nothing with this issue.

— Additional comment from jiazha@redhat.com on 2022-07-04 06:57:24 UTC —

Check the `previous` log.

mac:~ jianzhang$ oc logs catalog-operator-f8ddcb57b-j5rf2 --previous
time="2022-07-03T23:49:00Z" level=info msg="log level info"
...
...
time="2022-07-04T03:43:25Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-07-04T03:43:25Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
fatal error: concurrent map writes
fatal error: concurrent map writes

goroutine 559 [running]:
runtime.throw(

{0x2048546?, 0x0?}

)
/usr/lib/golang/src/runtime/panic.go:992 +0x71 fp=0xc001f9c508 sp=0xc001f9c4d8 pc=0x43e9f1
runtime.mapassign_faststr(0x1d09880, 0xc0031847b0,

{0x2079799, 0x2e}

)
/usr/lib/golang/src/runtime/map_faststr.go:295 +0x38b fp=0xc001f9c570 sp=0xc001f9c508 pc=0x419b4b
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.Pod(0xc001f4a900,

{0x203dd0f, 0xf}

,

{0xc00132ccc0, 0x38}

,

{0xc003582d50, 0x13}, 0xc00452c1e0, 0xc0031847b0, 0x5, ...)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/reconciler.go:227 +0xbdf fp=0xc001f9cbb0 sp=0xc001f9c570 pc=0x1a475ff
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*grpcCatalogSourceDecorator).Pod(0xc001f9ce50, {0xc003582d50, 0x13}

)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:125 +0xf9 fp=0xc001f9cc30 sp=0xc001f9cbb0 pc=0x1a42c99
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*GrpcRegistryReconciler).currentPodsWithCorrectImageAndSpec(0xc001f9ce68?,

{0xc001f4a900}

,

{0xc003582d50, 0x13}

)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:190 +0x198 fp=0xc001f9ce48 sp=0xc001f9cc30 pc=0x1a437b8
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*GrpcRegistryReconciler).CheckRegistryServer(0xc000bcbf80?, 0x493b77?)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:453 +0x4c fp=0xc001f9ce88 sp=0xc001f9ce48 pc=0x1a45fcc
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).healthy(0x38ca8453?, 0xc001f4a900)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:196 +0x7e fp=0xc001f9ced0 sp=0xc001f9ce88 pc=0x1a4ae1e
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).health(0x1bc37c0?, 0xc003e7e7e0, 0x8?)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:159 +0x2a fp=0xc001f9cf10 sp=0xc001f9ced0 pc=0x1a4ac8a
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).catalogHealth(0xc000a59a90,

{0xc003356a50, 0x11}

)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:137 +0x387 fp=0xc001f9d040 sp=0xc001f9cf10 pc=0x1a4a827
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).Reconcile(0xc000a59a90,

{0x237a238, 0xc0003fe5c0}, {0x7f9f6e5b3900?, 0xc0050f64f0?})
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:82 +0x21e fp=0xc001f9d118 sp=0xc001f9d040 pc=0x1a4a1be
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate.ReconcilerChain.Reconcile({0xc0008e43c0?, 0x3, 0xc001f9d258?}, {0x237a238, 0xc0003fe5c0}

,

{0x7f9f6e5b3328?, 0xc0050f6490?}

)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:128 +0xc3 fp=0xc001f9d180 sp=0xc001f9d118 pc=0x1a36603
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*subscriptionSyncer).Sync(0xc0004dfd50,

{0x237a238, 0xc0003fe5c0}, {0x236a448, 0xc003cbb760})
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/syncer.go:77 +0x535 fp=0xc001f9d648 sp=0xc001f9d180 pc=0x1a4f535
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*QueueInformer).Sync(...)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:35
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).processNextWorkItem(0xc00057a580, {0x237a238, 0xc0003fe5c0}

, 0xc000954720)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:287 +0x57c fp=0xc001f9df70 sp=0xc001f9d648 pc=0x1a3ca7c
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(0x0?,

{0x237a238, 0xc0003fe5c0}

, 0x0?)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:231 +0x45 fp=0xc001f9dfb0 sp=0xc001f9df70 pc=0x1a3c4a5
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start.func3()
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221 +0x32 fp=0xc001f9dfe0 sp=0xc001f9dfb0 pc=0x1a3c152
runtime.goexit()
/usr/lib/golang/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc001f9dfe8 sp=0xc001f9dfe0 pc=0x4719c1
created by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221 +0x557

Seems like it failed at: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/registry/reconciler/reconciler.go#L227

— Additional comment from agreene@redhat.com on 2022-07-05 16:01:22 UTC —

As Jian pointed out, the catalog operator is failing due to a concurrent write at https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/registry/reconciler/reconciler.go#L227.

This is happening because:

The grpcCatalogSourceDecorator includes a pointer to a catalogSource: https://github.com/operator-framework/operator-lifecycle-manager/blob/0b7970c3d609e1f483a0ba68f599a5ffd2e139b5/pkg/controller/registry/reconciler/grpc.go#L32-L35
The grpcCatalogSourceDecorator's Annotations function returns the catalogSource's annotations: https://github.com/operator-framework/operator-lifecycle-manager/blob/0b7970c3d609e1f483a0ba68f599a5ffd2e139b5/pkg/controller/registry/reconciler/grpc.go#L66
The grpcCatalogSourceDecorator's Pod function uses the annotations returned from it's Annotations function: https://github.com/operator-framework/operator-lifecycle-manager/blob/0b7970c3d609e1f483a0ba68f599a5ffd2e139b5/pkg/controller/registry/reconciler/grpc.go#L125
These catalogSource's annotations are updated within the Pod function here: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/registry/reconciler/reconciler.go#L227

Line 227 in the reconciler.go directly mutates the catalogSource's annotations. The grpcCatalogSourceDecorator's Annotations function should be returning a copy of the annotations or it should be created with a deepcopy of the catalogSource to avoid mutating an object in the lister cache.

This doesn't seem to be a blocker, but we should get a fix in swiftly.

— Additional comment from jiazha@redhat.com on 2022-07-13 05:02:04 UTC —

1, Create a cluster with the fixed PR via the Cluster-bot.
mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.ci.test-2022-07-13-022646-ci-ln-41fvni2-latest True False 126m Cluster version is 4.11.0-0.ci.test-2022-07-13-022646-ci-ln-41fvni2-latest

2, Subscribe some operators.
mac:~ jianzhang$ oc get sub -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
default etcd etcd community-operators singlenamespace-alpha
openshift-logging cluster-logging cluster-logging redhat-operators stable
openshift-operators-redhat elasticsearch-operator elasticsearch-operator redhat-operators stable

mac:~ jianzhang$ oc get sub -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
default etcd etcd community-operators singlenamespace-alpha
openshift-logging cluster-logging cluster-logging redhat-operators stable
openshift-operators-redhat elasticsearch-operator elasticsearch-operator redhat-operators stable
mac:~ jianzhang$
mac:~ jianzhang$
mac:~ jianzhang$ oc get csv -n openshift-operators-redhat
NAME DISPLAY VERSION REPLACES PHASE
elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded
mac:~ jianzhang$ oc get csv -n openshift-logging
NAME DISPLAY VERSION REPLACES PHASE
cluster-logging.5.4.2 Red Hat OpenShift Logging 5.4.2 Succeeded
elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded
mac:~ jianzhang$ oc get csv -n default
NAME DISPLAY VERSION REPLACES PHASE
elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded
etcdoperator.v0.9.4 etcd 0.9.4 etcdoperator.v0.9.2 Succeeded

3, Check OLM catalog-operator pods status.
mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-546db7cdf5-7pldg 1/1 Running 0 145m
collect-profiles-27628110-lr2nv 0/1 Completed 0 30m
collect-profiles-27628125-br8b8 0/1 Completed 0 15m
collect-profiles-27628140-m64gp 0/1 Completed 0 38s
olm-operator-754d7f6f56-26qhw 1/1 Running 0 145m
package-server-manager-77d5cbf696-v9w4p 1/1 Running 0 145m
packageserver-6884994d98-2smtw 1/1 Running 0 143m
packageserver-6884994d98-5d7jg 1/1 Running 0 143m

mac:~ jianzhang$ oc logs catalog-operator-546db7cdf5-7pldg --previous
Error from server (BadRequest): previous terminated container "catalog-operator" in pod "catalog-operator-546db7cdf5-7pldg" not found

No terminated container. catalog-operator works well. Verify it.

— Additional comment from aos-team-art-private@redhat.com on 2022-07-13 22:50:04 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.

— Additional comment from jiazha@redhat.com on 2022-07-18 07:23:08 UTC —

Changed the status to VERIFIED based on comment 10.

https://github.com/openshift/operator-framework-olm/pull/415

Bug OCPBUGS-5575: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/1682

Bug OCPBUGS-2884: [release-4.10] Make ccoctl set sts endpoints to regional in AWS credentials secrets

View the Description View the linked PRs

This bug card represents work done in https://issues.redhat.com/browse/CCO-257 to set STS endpoints to regional in AWS credentials secrets and is created to facilitate backporting the change to previous releases as required by the backport process [1].

[1] https://docs.google.com/document/d/1qbKUl4K9PPW4tCbdDgWSXd5klosX77ifkxQLqzjztGE/edit#heading=h.mml0rz88o0nj

https://github.com/openshift/cloud-credential-operator/pull/504

Bug OCPBUGS-233: Upgrade failing because restrictive scc is injected into version pod

View the Description View the linked PRs

Description of problem:

OCP Upgrade failing

Version-Release number of the following components:

oc version
Client Version: 4.8.0-202108312109.p0.git.0d10c3f.assembly.stream-0d10c3f
Server Version: 4.10.13
Kubernetes Version: v1.23.5+b463d71

How reproducible: Always

Steps to Reproduce:
1. Create the following SCC (that has `with readOnlyRootFilesystem: true`):
~~~
cat << EOF | oc create -f -
allowHostDirVolumePlugin: true
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: true
allowedCapabilities: []
apiVersion: security.openshift.io/v1
defaultAddCapabilities: []
fsGroup:
type: MustRunAs
groups: []
kind: SecurityContextConstraints
metadata:
annotations:
meta.helm.sh/release-name: azure-arc
meta.helm.sh/release-namespace: default
labels:
app.kubernetes.io/managed-by: Helm
name: kube-aad-proxy-scc
priority: null
readOnlyRootFilesystem: true
requiredDropCapabilities: []
runAsUser:
type: RunAsAny
seLinuxContext:
type: MustRunAs
supplementalGroups:
type: RunAsAny
users:

system:serviceaccount:azure-arc:azure-arc-kube-aad-proxy-sa
volumes:
configMap
hostPath
secret
EOF
~~~

2. oc adm upgrade --to=4.10.20

Actual results:

SCC kube-aad-proxy-scc, which has readOnlyRootFilesystem is injected inside the pod version-4.10.20-smvt9-6vqwc, causing it to fail.
~~~

oc get po -n openshift-cluster-version
NAME READY STATUS RESTARTS AGE
cluster-version-operator-6b5c8ff5c8-4bmxx 1/1 Running 0 33m
version-4.10.20-smvt9-6vqwc 0/1 Error 0 10s
oc logs version-4.10.20-smvt9-6vqwc -n openshift-cluster-version
oc logs version-4.10.20-smvt9-6vqwc
mv: cannot remove '/manifests/0000_00_cluster-version-operator_00_namespace.yaml': Read-only file system
mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_adminack_configmap.yaml': Read-only file system
mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_admingate_configmap.yaml': Read-only file system
mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_clusteroperator.crd.yaml': Read-only file system
mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_clusterversion.crd.yaml': Read-only file system
mv: cannot remove '/manifests/0000_00_cluster-version-operator_02_roles.yaml': Read-only file system
mv: cannot remove '/manifests/0000_00_cluster-version-operator_03_deployment.yaml': Read-only file system
mv: cannot remove '/manifests/0000_90_cluster-version-operator_00_prometheusrole.yaml': Read-only file system
mv: cannot remove '/manifests/0000_90_cluster-version-operator_01_prometheusrolebinding.yaml': Read-only file system
mv: cannot remove '/manifests/0000_90_cluster-version-operator_02_servicemonitor.yaml': Read-only file system
mv: cannot remove '/manifests/0001_00_cluster-version-operator_03_service.yaml': Read-only file system
~~~

Expected results:

Pod version-4.10.20-smvt9-6vqwc should run fine

Additional info:

I don't know why, but SCC kube-aad-proxy-scc is injected inside pod version-4.10.20-smvt9-6vqwc:
~~~
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.129.0.70"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.129.0.70"
],
"default": true,
"dns": {}
}]
openshift.io/scc: kube-aad-proxy-scc ### HERE
creationTimestamp: "2022-07-25T16:47:39Z"
generateName: version-4.10.20-5xqtv-
labels:
controller-uid: ba707bbe-1825-4f80-89ce-f6bf2301a812
job-name: version-4.10.20-5xqtv
name: version-4.10.20-5xqtv-9gcwk
namespace: openshift-cluster-version
ownerReferences:

apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: version-4.10.20-5xqtv
uid: ba707bbe-1825-4f80-89ce-f6bf2301a812
resourceVersion: "40040"
uid: 0d668d3d-7452-463f-a421-4dfee9c89c23
spec:
containers:
args:
-c
mkdir -p /etc/cvo/updatepayloads/KsrCX7X9QbtoXkW3TkPcww && mv /manifests /etc/cvo/updatepayloads/KsrCX7X9QbtoXkW3TkPcww/manifests
&& mkdir -p /etc/cvo/updatepayloads/KsrCX7X9QbtoXkW3TkPcww && mv /release-manifests
/etc/cvo/updatepayloads/KsrCX7X9QbtoXkW3TkPcww/release-manifests
command:
/bin/sh
image: quay.io/openshift-release-dev/ocp-release@sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600
imagePullPolicy: IfNotPresent
name: payload
resources:
requests:
cpu: 10m
ephemeral-storage: 2Mi
memory: 50Mi
securityContext:
privileged: true
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
mountPath: /etc/cvo/updatepayloads
name: payloads
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-fwblb
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
imagePullSecrets:
name: default-dockercfg-smmf4
nodeName: ip-10-0-215-206.eu-central-1.compute.internal
nodeSelector:
node-role.kubernetes.io/master: ""
preemptionPolicy: PreemptLowerPriority
priority: 1000000000
priorityClassName: openshift-user-critical
restartPolicy: OnFailure
schedulerName: default-scheduler
securityContext:
fsGroup: 1000030000
seLinuxOptions:
level: s0:c6,c0
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
key: node-role.kubernetes.io/master
effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
volumes:
hostPath:
path: /etc/cvo/updatepayloads
type: ""
name: payloads
name: kube-api-access-fwblb
projected:
defaultMode: 420
sources:
serviceAccountToken:
expirationSeconds: 3607
path: token
configMap:
items:
key: ca.crt
path: ca.crt
name: kube-root-ca.crt
downwardAPI:
items:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
configMap:
items:
key: service-ca.crt
path: service-ca.crt
name: openshift-service-ca.crt
status:
conditions:
lastProbeTime: null
lastTransitionTime: "2022-07-25T16:47:39Z"
status: "True"
type: Initialized
lastProbeTime: null
lastTransitionTime: "2022-07-25T16:47:39Z"
message: 'containers with unready status: [payload]'
reason: ContainersNotReady
status: "False"
type: Ready
lastProbeTime: null
lastTransitionTime: "2022-07-25T16:47:39Z"
message: 'containers with unready status: [payload]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
lastProbeTime: null
lastTransitionTime: "2022-07-25T16:47:39Z"
status: "True"
type: PodScheduled
containerStatuses:
containerID: cri-o://ac6f6a5d8925620f1a2835a50fe26ea02d35e3a5c2d033015f38fde5206daf8c
image: quay.io/openshift-release-dev/ocp-release@sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600
imageID: quay.io/openshift-release-dev/ocp-release@sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600
lastState:
terminated:
containerID: cri-o://fdac85e975eb00a3abd08e18061ae3673a857769ddfc87ca94a3527a8c7b83f3
exitCode: 1
finishedAt: "2022-07-25T16:47:42Z"
reason: Error
startedAt: "2022-07-25T16:47:42Z"
name: payload
ready: false
restartCount: 2
started: false
state:
terminated:
containerID: cri-o://ac6f6a5d8925620f1a2835a50fe26ea02d35e3a5c2d033015f38fde5206daf8c
exitCode: 1
finishedAt: "2022-07-25T16:47:56Z"
reason: Error
startedAt: "2022-07-25T16:47:56Z"
hostIP: 10.0.215.206
phase: Running
podIP: 10.129.0.70
podIPs:
ip: 10.129.0.70
qosClass: Burstable
startTime: "2022-07-25T16:47:39Z"
~~~

https://github.com/openshift/cluster-version-operator/pull/824

Bug OCPBUGS-2743: Fatal error while configuring sts endpoint in install-config

View the Description View the linked PRs

Description of problem:

Install failed if specify sts endpoint in the install-config.yaml file:

platform:
  aws:
    region: us-east-2
    serviceEndpoints:
    - name: sts
      url: https://sts.us-east-2.amazonaws.com

Errors:

level=error msg=Error: error configuring Terraform AWS Provider: error validating provider credentials: error calling sts:GetCallerIdentity: SignatureDoesNotMatch: Credential should be scoped to a valid region. 
level=error msg=  status code: 403, request id: f4e877fe-9e90-4cba-a455-2538d489a8d0
level=error
level=error msg=  on ../../../../../../../../tmp/openshift-install-cluster-1552617948/main.tf line 11, in provider "aws":
level=error msg=  11: provider "aws" {
level=error
level=error
level=error msg=Failed to read tfstate: open /tmp/openshift-install-cluster-1552617948/terraform.cluster.tfstate: no such file or directory

Version-Release number of selected component (if applicable):

4.10.z

How reproducible:

* Always

Steps to Reproduce:

1. Create an install-config.yaml, and specify sts endpoint 
2. Create an STS cluster

Actual results:

Error: error configuring Terraform AWS Provider: error validating provider credentials: error calling sts:GetCallerIdentity: SignatureDoesNotMatch: Credential should be scoped to a valid region.

Expected results:

No errors, create cluster successfully

Additional info:

No such issues on 4.8, 4.9, 4.11, 4.12

https://github.com/openshift/installer/pull/6518

Bug OCPBUGS-2623: Backport Request for BZ 2037513

View the Description View the linked PRs

Description of problem:

View https://bugzilla.redhat.com/show_bug.cgi?id=2037513

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-monitoring-operator/pull/1798

Bug OCPBUGS-4839: [4.10] Pod LSP missing from PortGroup

View the Description View the linked PRs

Description of problem:

For some reason, the LSP of a pod is not properly added to the port group where the ACL of a NetworkPolicy is applied. This results on the networkpolicy not being applied to the pod and communication not possible.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

Always with a concrete pod at customer environment.

Steps to Reproduce:

(not known exactly yet)

Actual results:

LSP not in port group. ACL not applied. Netpol not in effect.

Expected results:

LSP in port group. ACL applied. Netpol in effect.

Additional info:

Details in private comments, as they involve sensitive data.

Deleting the pod does nothing, but it is possible that this has something to do with the pod being recreated with the same name (although the LSPs UUIDs are different in each incarnation).

https://github.com/openshift/ovn-kubernetes/pull/1452

Bug OCPBUGS-4864: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1453

Bug OCPBUGS-5955: ip-reconciler removes the overlappingrangeipreservations whether the pod is alive or not [backport 4.10]

View the Description View the linked PRs

Description of problem:

The reconciler removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources whether the pod is alive or not.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Create pods and check the overlappingrangeipreservations.whereabouts.cni.cncf.io resources:

$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A
NAMESPACE          NAME                      AGE
openshift-multus   2001-1b70-820d-4b04--13   4m53s
openshift-multus   2001-1b70-820d-4b05--13   4m49s

2. Verify that when the ip-reconciler cronjob removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources when run:

$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        14m             4d13h

$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A
No resources found

$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        5s              4d13h

Actual results:

The overlappingrangeipreservations.whereabouts.cni.cncf.io resources are removed for each created pod by the ip-reconciler cronjob.
The "overlapping ranges" are not used.

Expected results:

The overlappingrangeipreservations.whereabouts.cni.cncf.io should not be removed regardless of if a pod has used an IP in the overlapping ranges.

Additional info:

https://github.com/openshift/cluster-network-operator/pull/1687

Bug OCPBUGS-7990: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1542

Bug OCPBUGS-2132: openshift template cakephp-mysql-persistent has a typo in apiVersion for DeploymentConfig

View the Description View the linked PRs

Description of problem:

openshift template cakephp-mysql-persistent has a typo in apiVersion for DeploymentConfig

Version-Release number of selected component (if applicable):

4.10

How reproducible:

This bug was resolved for ocp 4.11, Now there is requirement to backport this fix to ocp4.10

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-samples-operator/pull/466

Bug OCPBUGS-8672: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/multus-cni/pull/147

Bug OCPBUGS-1696: All Nodes overview in console are showing "Something went wrong"

View the Description View the linked PRs

Description of problem:

All Nodes overview in console are showing "Something went wrong"

Version-Release number of selected component (if applicable):

4.10

How reproducible:

I have tested in test-lab and i am able see node overviews

Steps to Reproduce:

1.
2.
3.

Actual results:

In the Console, Under compute tab nodes overview showing error "Oh no! something went wrong"

Expected results:

It should show the node overview

Additional info:

It is happening for all users with admin roles

https://github.com/openshift/console/pull/12129

Bug OCPBUGS-2973: `oc explain` output of default overlaySize is outdated

View the Description View the linked PRs

This bug is a backport clone of [Bugzilla Bug 2081119](https://bugzilla.redhat.com/show_bug.cgi?id=2081119). The following is the description of the original bug:
—
Description of problem:

`oc explain` output of default overlaySize is 10G. The outdated doc confused the customer with default overlaySize. The description should be updated since the default was removed from the implementation of storage.conf.

Version-Release number of selected component (if applicable):

How reproducible:

Actual results:

$ oc explain containerRuntimeConfig.spec.containerRuntimeConfig.logLevel
KIND: ContainerRuntimeConfig
VERSION: machineconfiguration.openshift.io/v1

FIELD: logLevel <string>

DESCRIPTION:
logLevel specifies the verbosity of the logs based on the level it is set
to. Options are fatal, panic, error, warn, info, and debug.
[qiwan@qiwan ~]$ oc explain containerRuntimeConfig.spec.containerRuntimeConfig.overlaySize
KIND: ContainerRuntimeConfig
VERSION: machineconfiguration.openshift.io/v1

FIELD: overlaySize <string>

DESCRIPTION:
overlaySize specifies the maximum size of a container image. This flag can
be used to set quota on the size of container images. (default: 10GB).

Expected results:

$ oc explain containerRuntimeConfig.spec.containerRuntimeConfig.logLevel
KIND: ContainerRuntimeConfig
VERSION: machineconfiguration.openshift.io/v1

FIELD: logLevel <string>

DESCRIPTION:
logLevel specifies the verbosity of the logs based on the level it is set
to. Options are fatal, panic, error, warn, info, and debug.
[qiwan@qiwan ~]$ oc explain containerRuntimeConfig.spec.containerRuntimeConfig.overlaySize
KIND: ContainerRuntimeConfig
VERSION: machineconfiguration.openshift.io/v1

FIELD: overlaySize <string>

DESCRIPTION:
overlaySize specifies the maximum size of a container image. This flag can
be used to set quota on the size of container images.

Additional info:

https://github.com/openshift/machine-config-operator/pull/3395

Bug OCPBUGS-3996: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/472

Bug OCPBUGS-10777: TestNewAppRun unit test failing

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-10622~~. The following is the description of the original issue:
—
Description of problem:

Unit test failing 

=== RUN   TestNewAppRunAll/app_generation_using_context_dir
    newapp_test.go:907: app generation using context dir: Error mismatch! Expected <nil>, got supplied context directory '2.0/test/rack-test-app' does not exist in 'https://github.com/openshift/sti-ruby'
    --- FAIL: TestNewAppRunAll/app_generation_using_context_dir (0.61s)

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1376/pull-ci-openshift-oc-master-images/1638172620648091648

Actual results:

unit tests fail

Expected results:

TestNewAppRunAll unit test should pass

Additional info:

https://github.com/openshift/oc/pull/1383

Bug OCPBUGS-1737: [OCPonOpenstack] Remove clustername length limitation

View the Description View the linked PRs

Description of problem:

Currently when installing Openshift on the Openstack cluster name length limit is allowed to  14 characters.
Customer wants to know if is it possible to override this validation when installing Openshift on Openstack and create a cluster name that is greater than 14 characters.

Version : OCP 4.8.5 UPI Disconnected 
Environment : Openstack 16 

Issue:
User reports that they are getting error for OCP cluster in Openstack UPI, where the name of the cluster is > 14 characters.

Error events :
~~~
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/openshift-install", "create", "manifests", "--dir=/home/gitlab-runner/builds/WK8mkokN/0/CPE/SKS/pipelines/non-prod/ocp4-openstack-build/ocpinstaller/install-upi"], "delta": "0:00:00.311397", "end": "2022-09-03 21:38:41.974608", "msg": "non-zero return code", "rc": 1, "start": "2022-09-03 21:38:41.663211", "stderr": "level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters", "stderr_lines": ["level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters"], "stdout": "", "stdout_lines": []}
~~~

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Users are getting error "cluster name is too long" when clustername contains more than 14 characters for OCP on Openstack

Expected results:

The 14 characters limit should be change for the OCP clustername on Openstack

Additional info:

https://github.com/openshift/installer/pull/6420

Bug OCPBUGS-2731: OpenStack UPI scripts do not create server group for Computes

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1346~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-1226~~. The following is the description of the original issue:
—
We added server groups for control plane and computes as part of ~~OSASINFRA-2570~~, except for UPI that only creates server group for the control plane.

We need to update the UPI scripts to create server group for computes to be consistent with IPI and have the instruction at https://docs.openshift.com/container-platform/4.11/machine_management/creating_machinesets/creating-machineset-osp.html work out of the box in case customers want to create MachineSets on their UPI clusters.

Related to OCPCLOUD-1135.

https://github.com/openshift/installer/pull/6510

Bug OCPBUGS-3778: [4.10] ovn-k network policy races

View the Description View the linked PRs

Description of problem:

Network policy code has some problems, most of them are races, therefore it can be difficult to reproduce and verify, here is the list

1. all kinds of add/delete port to/from default deny port group failures, possible symptoms:
  - port should’ve been added to default deny port group, but wasn’t: connections that should’ve been dropped are allowed
  - port should’ve been deleted from default deny port group, but wasn’t: connections that should be allowed are dropped
  - db ops failures when an attempt to add/delete port to/from default deny port group fails, e.g. because this operation already was done
2. default deny port group was overwritten when 2 network policies are created in a namespace at the same time. Can lead to ports not being added to the default deny port group => denied connections will be allowed
3. handle error when getting local pod from the cache fails, possible symptoms
  - "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy" log message
  - pod is not added to netpol port groups, network policy is not applied
4. creating deleted namespace via ensureNamespaceLocked, symptoms:
  - namespace was deleted, but address set is present in the db
5. policy acl loglevel update wasn’t applied, possible symptoms:
  - netpol acl log level isn’t set/updated to namespace loglevel
6. netpol cleanup failures, symptoms:
  - network policy failed to be deleted, something is still left in the db, error messages like
  - "failed to destroy network policy"
  - "Rollback of default port groups and acls for policy: %s/%s failed, Unable to ensure namespace for network policy"
7. concurrent write to sets.String - this will panic, you won’t miss
8. retry for network policy handler after network policy was deleted, you should see failures saying that some network policy related object is nil or doesn’t exist, e.g.
  - "peer AddressSet is nil, cannot add <object>"
9. host network and completed pods selected by network policy can produce error logs, no real harm
  - "Failed to get LSP for pod <namespace>/<name> for networkPolicy %s refetching err"
10. namespace pod handlers are never stopped, can affect memory usage and look like a memory leak
11. add local pod failure, since netpol port group is not committed to db yet, error looks like
  - "Failed to create *factory.localPodSelector <name>, error: object not found"

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Example 1
1. Create network policy with [in/e]gress selector that applies to a namespace labeled project: myproject
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: test
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              project: myproject

2. Use oc apply to delete network policy and crate a pod in project: myproject namespace at the same time
3. check ovnkube-master logs for "peer AddressSet is nil, cannot add peer pod(s)", this should retry with the same error 15 times
4. This may not work from the first try, since we need to hit specific order of network policy delete and pod add handling
5. With the new version no error messages should be present

Example 2
1. create network policy that applies to a namespace test
piVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: test
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
2. Create host network pod in namespace test
3. Check 15 logs saying "Failed to get LSP for pod %s/%s for networkPolicy %s refetching err: "
4. check final log "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy"
5. With the new version no error message should be present

All the other cases are difficult to reproduce, maybe just running some standard network policy tests and making sure everything works will be a good verification.

Actual results:

Expected results:

Additional info:

Only some parts were backported to 4.10 due to significant releases differences.
The problems that are fixed, and perf improvement:
1. don't retry unscheduled pod, wait for update event instead. 
2.  Cleanup pod handler for namespaceAndPod handler on namespace delete
event.
3. Only update localPods after successful db transaction, 
return fast from localPod handlers based on `np.localPods`
4. Don't retry fetching lsp from the lspCache, that "stops" the handler for 1 second
5.  Use stored portUUID to delete local pods instead of getting that info
from lspCache

https://github.com/openshift/ovn-kubernetes/pull/1434

Bug OCPBUGS-1266: Backport: https://github.com/openshift/kubernetes/pull/1295

View the Description View the linked PRs

+++ This bug was initially created as a clone of Bug #2117423 +++

Description of problem:

Backport https://github.com/openshift/kubernetes/pull/1295 to 4.10

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/kubernetes/pull/1362

Bug OCPBUGS-2523: e2e tests: Installs Red Hat Integration - 3scale operator test is failing due to change of Operator name

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-2451~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-2181~~. The following is the description of the original issue:
—
Description of problem:

E2E test Installs Red Hat Integration - 3scale operator test is failing due to change of Operator name

CI Search: https://search.ci.openshift.org/?search=Installs+Red+Hat+Integration+-+3scale+operator+in+test+namespace+and+creates+3scale+Backend+Schema+operand+instance&maxAge=24h&context=1&type=bug%2Bissue%2Bjunit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

https://github.com/openshift/console/pull/12184

Bug OCPBUGS-2070: Missing $SEARCH domain in /etc/resolve.conf for OCP v4.9.31 cluster

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1099~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-224~~. The following is the description of the original issue:
—
Description of problem:
OCP v4.9.31 cluster didn't have the $search domain in /etc/resolv.conf, which was there in the v4.8.29 OCP cluster. This was observed in all the nodes of the v4.9.31 cluster.
~~~
OpenShift 4.9.31
sh-4.4# cat /etc/resolv.conf

Generated by KNI resolv prepender NM dispatcher script
nameserver 172.xx.xx.xx
nameserver 10.xx.xx.xx
nameserver 10.xx.xx.xx
nameserver 10.xx.xx.xx

OpenShift 4.8.29

Generated by KNI resolv prepender NM dispatcher script
search sepia.lab.iad2.dc.paas.redhat.com
nameserver 172.xx.xx.xx
nameserver 10.xx.xx.xx
nameserver 10.xx.xx.xx
nameserver 10.xx.xx.xx
~~~

ENV: OpenStack IAD2, IPI installation. Connected cluster.

Version-Release number of selected component (if applicable):
OCP v4.9.31

How reproducible:
Always

Steps to Reproduce:
1. Install IPI cluster on OpenStack IAD2 platform having cluster version 4.9.31
2. Debug to any of the node(master/worker)
3. Check and confirm the missing search domain on all nodes of the cluster.

Actual results:
The search domain was missing when checked in `/etc/resolv.conf` file on all nodes of the cluster causing serious issues in the cluster.

Expected results:
The installer should embed the search domain in /etc/resolv.conf file on all nodes of the cluster.

Additional info:

Cu was trying to deploy secure Kerberos on the CoreOS nodes and it failed when the IPA-client install command failed. This is when the customer noticed this unusual behavior. They did not manually update the resolv.conf file to include the $search domain. They instead added the script below to /etc/NetworkManager/dispatcher.d/ and restarted NetworkManager on the node to fix this issue and installation was successful.
~~~
#!/bin/bash

set -eo pipefail

DISPATCHER_FILE="/etc/NetworkManager/dispatcher.d/30-resolv-prepender"
DOMAINS="$(grep -E '\s*DOMAINS=.*iad2.dc.paas.redhat.com' $DISPATCHER_FILE \

grep -oE '[a-z0-9]*.dev.iad2.dc.paas.redhat.com' \

tr '\n' ' ')"

>&2 echo "IT-PaaS: overwriting search domains in /etc/resolv.conf with: $DOMAINS"

sed -e "/^search/d" \
-e "/Generated by/c# Generated by KNI resolv prepender NM dispatcher script \nsearch $DOMAINS" \
/etc/resolv.conf > /etc/resolv.tmp

mv /etc/resolv.tmp /etc/resolv.conf
~~~

Cu confirms that the $search domain was missing since the cluster was freshly installed/ They even confirmed this with a fresh new cluster as well that it was missing.

The fresh cluster was initially installed at v4.9.31 but was updated afterward to v4.9.43 (the latest z-stream) to see if the updates fixed anything but it didn't make any difference. The cluster is currently running v4.9.43 and shows the $search domain missing in the /etc/resolv.conf file on all nodes.

https://github.com/openshift/machine-config-operator/pull/3366

Bug OCPBUGS-2800: [4.10] etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-2113~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-1329~~. The following is the description of the original issue:
—
Description of problem:

etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

Version-Release number of selected component (if applicable):

4.10.32

How reproducible:

Not always, after ~10 attempts

Steps to Reproduce:

1. Deploy SNO with Telco DU profile applied
2. Create multiple pods with local storage volumes attached(attaching yaml manifest)
3. Force delete and re-create pods 10 times

Actual results:

etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time

Expected results:

etcd and kube-apiserver do not get restarted

Additional info:

Attaching must-gather.

Please let me know if any additional info is required. Thank you!

https://github.com/openshift/cluster-etcd-operator/pull/953

Bug OCPBUGS-3485: Console shouldn't try to install dynamic plugins if permissions aren't available

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-3300~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-3265~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-3172~~. The following is the description of the original issue:
—
Customer is trying to install the Logging operator, which appears to attempt to install a dynamic plugin. The operator installation fails in the console because permissions aren't available to "patch resource consoles".

We shouldn't block operator installation if permission issues prevent dynamic plugin installation.

This is an OSD cluster, presumably for a customer with "cluster-admin", although it may be a paired down permission set called "dedicated-admin".

See https://docs.google.com/document/d/1hYS-bm6aH7S6z7We76dn9XOFcpi9CGYcGoJys514YSY/edit for permissions investigation work on OSD

https://github.com/openshift/console/pull/12258

Bug OCPBUGS-4604: fix https://github.com/kubernetes/kubernetes/pull/109137 in OpenShift 4.10

View the Description View the linked PRs

Description of problem:

Custonmer is facing this problem in OCP 4.10.40.
https://github.com/coredns/coredns/issues/5593

 Its root cause seems to be this bug in k8s code:

https://github.com/kubernetes/kubernetes/issues/109115
https://github.com/kubernetes/kubernetes/pull/109137

The issue seems to be fixed in OpenShift 4.11, but the customer can't update at this moment.

Can this fix be backported to OpenShift 4.10?

Version-Release number of selected component (if applicable):

4.10.40

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/openshift-apiserver/pull/323

Bug OCPBUGS-642: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/3315

Task MON-1218: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/1379

Bug OCPBUGS-3858: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1385

Bug OCPBUGS-691: [2112237] [ Cluster storage Operator 4.x(10/11) ] DefaultStorageClassController report fake message "No default StorageClass for this platform" on Alicloud, IBM, Nutanix

View the Description View the linked PRs

Backport to 4.10.z

https://github.com/openshift/cluster-storage-operator/pull/314

Bug OCPBUGS-7787: [4.10.z] Fix kubevirt-console tests

View the Description View the linked PRs

Description of problem:

[4.10.z] Fix kubevirt-console tests

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/12580

Bug OCPBUGS-1538: Make northd probe interval default to 10 seconds

View the Description View the linked PRs

Description of problem:

Tracking this for backport of https://bugzilla.redhat.com/show_bug.cgi?id=2072710

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-network-operator/pull/1563

Bug OCPBUGS-3619: List pages in pipelines is taking more time to load when there are too many items

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-2077~~. The following is the description of the original issue:
—
Description of problem:

Pipeline list page fetches all the pipelineruns to find the last pipeline run and which results in more load time. This performance issue needs to be addressed in all the pieplines list pages wherever applicable.

Version-Release number of selected component (if applicable):
4.9

How reproducible:
Always

Steps to Reproduce:
1. Create 10+ pipelines in a namespace
2. Create more number of pipelineruns under each pipeline
3. navigate to piplines list page.

Actual results:
Pipelines list will take a long time to load the list.

Expected results:

Pipeline list should not take more time to load the list.

Additional info:

Reduce the amount to data fetched to find the last pipelinerun, maybe use PartialMetadata to find the latest pipeline run and to improve the performance.

https://github.com/openshift/console/pull/12265

Bug OCPBUGS-205: [release-4.10] Add support for BUILDAH_QUIET environment variable

View the Description View the linked PRs

Description of problem:
Previously we had a bug opened for "Reduce buildah log level for default build log level [NEEDINFO]":

https://bugzilla.redhat.com/show_bug.cgi?id=1996883

We suggested customer using secrets, however, customer confirmed that they are using secrets as per:
https://docs.openshift.com/container-platform/4.8/cicd/builds/creating-build-inputs.html#builds-input-secrets-configmaps_creating-build-inputs

But not able to use "--quiet" build argument since they are using openshift s2i config:

strategy:
type: Source
sourceStrategy:
from:
kind: ImageStreamTag
namespace: xxxxxxx
name: 's2i-xxxxxx-xxxxxxx:v1.0.0'

Actual results:
Under this setting, the secret as well as every other openshift secret are printed.

Expected results:
Sensitive information (ENV) should not appear in build logs

Additional info:
Maybe pass the --quiet option via the buildconfig fir s2i?

— Additional comment from taxu@redhat.com on 2022-07-01 03:42:46 UTC —

Hi Build team,

I see that the target release for this bug is 4.11.0

Please let us know where to add the LOG_LEVEL for s2i and/or passing "--quiet" options once the fix is deployed.

Kind regards,

Tao Xu

— Additional comment from aos-team-art-private@redhat.com on 2022-07-13 06:44:30 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.

— Additional comment from rdlugyhe@redhat.com on 2022-07-26 20:34:06 UTC —

Please approve the updated Doc Text.

— Additional comment from cdaley@redhat.com on 2022-07-26 20:49:05 UTC —

Approved

— Additional comment from jitsingh@redhat.com on 2022-07-27 04:48:18 UTC —

verified

— Additional comment from oarribas@redhat.com on 2022-08-15 12:11:23 UTC —

@Corey, I can see this BZ now verified for 4.12. Can this be backported to previous OCP versions?

— Additional comment from cdaley@redhat.com on 2022-08-15 13:57:02 UTC —

What version are you interested in getting it backported to?

— Additional comment from oarribas@redhat.com on 2022-08-15 15:46:10 UTC —

@Corey, I can see only 4.10 and 4.11 as "Full Support" in [1], so at least to those versions.

https://github.com/openshift/builder/pull/310

Bug OCPBUGS-4861: [4.10] Network Policy executes duplicate transactions for every pod update

View the Description View the linked PRs

Description of problem:

With every pod update we are executing a mutate operation to add the pod port to the port group or add the pod IP to an address set. This functionally doesn't hurt, since mutate will not add duplicate values to the same set. However, this is bad for performance. For example, with a 730 network policies affecting a pod, and issuing 7 pod updates would result in over 5k transactions.

https://github.com/openshift/ovn-kubernetes/pull/1452

Bug OCPBUGS-5299: [4.10] [OVN]Sometimes after reboot egress node, egress IP cannot be applied anymore.

View the Description View the linked PRs

Description of problem:

[OVN][OSP] After reboot egress node, egress IP cannot be applied anymore.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-07-181244

How reproducible:

Frequently happened in automation. But didn't reproduce it in manual.

Steps to Reproduce:

1. Label one node as egress node

2.
Config one egressIP object
STEP: Check  one EgressIP assigned in the object.

Nov  8 15:28:23.591: INFO: egressIPStatus: [{"egressIP":"192.168.54.72","node":"huirwang-1108c-pg2mt-worker-0-2fn6q"}]

3.
Reboot the node, wait for the node ready.

Actual results:

EgressIP cannot be applied anymore. Waited more than 1 hour.
 oc get egressip
NAME             EGRESSIPS       ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-47031   192.168.54.72

Expected results:

The egressIP should be applied correctly.

Additional info:


Some logs
E1108 07:29:41.849149       1 egressip.go:1635] No assignable nodes found for EgressIP: egressip-47031 and requested IPs: [192.168.54.72]
I1108 07:29:41.849288       1 event.go:285] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egressip-47031", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'NoMatchingNodeFound' no assignable nodes for EgressIP: egressip-47031, please tag at least one node with label: k8s.ovn.org/egress-assignable


W1108 07:33:37.401149       1 egressip_healthcheck.go:162] Could not connect to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107): context deadline exceeded
I1108 07:33:37.401348       1 master.go:1364] Adding or Updating Node "huirwang-1108c-pg2mt-worker-0-2fn6q"
I1108 07:33:37.437465       1 egressip_healthcheck.go:168] Connected to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107)

After this log, seems like no logs related to "192.168.54.72" happened.

https://github.com/openshift/ovn-kubernetes/pull/1464

Task MON-1873: Tag all resources created by CMO e2e tests

View the Description View the linked PRs

The CMO e2e tests create a bunch of resources. These should be cleaned up on a successful run. However:

Some test failures leave the create resource behind, which have to be cleaned up before a re-run.
There have been developer reports that even successful runs don't tidy up everything.

In a CI context this is rarely a problem, however running the tests locally can be made quite awkward, especially repeated runs on the same cluster.

We should tag all resources created by the e2e tests with a label (app.kubernetes.io/created-by: cmo-e2e-test).
This will allow easy cleanup by deleting all resources with that label and will allow for checking proper clean-up.

DoD:
All e2e resources get properly tagged.
It is straight forward to ensure that future code changes don't skip adding this tag.

https://github.com/openshift/cluster-monitoring-operator/pull/1397

Bug OCPBUGS-3078: Backkport OCPBUGS-1718 to OCP 4.10

View the Description View the linked PRs

This is a request to back port the fix in ~~OCPBUGS-1718~~ to Openshift 4.10.

Description cloned from that bug:
Description of problem:

prometheus-k8s-0 ends in CrashLoopBackOff with evel=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0" on SNO after hard reboot tests

Version-Release number of selected component (if applicable):

4.11.6

How reproducible:

Not always, after ~10 attempts

Steps to Reproduce:

1. Deploy SNO with Telco DU profile applied
2. Hard reboot node via out of band interface
3. oc -n openshift-monitoring get pods prometheus-k8s-0

Actual results:

NAME               READY   STATUS             RESTARTS          AGE
prometheus-k8s-0   5/6     CrashLoopBackOff   125 (4m57s ago)   5h28m

Expected results:

Running

Additional info:

Attaching must-gather.

The pod recovers successfully after deleting/re-creating.


[kni@registry.kni-qe-0 ~]$ oc -n openshift-monitoring logs prometheus-k8s-0
ts=2022-09-26T14:54:01.919Z caller=main.go:552 level=info msg="Starting Prometheus Server" mode=server version="(version=2.36.2, branch=rhaos-4.11-rhel-8, revision=0d81ba04ce410df37ca2c0b1ec619e1bc02e19ef)"
ts=2022-09-26T14:54:01.919Z caller=main.go:557 level=info build_context="(go=go1.18.4, user=root@371541f17026, date=20220916-14:15:37)"
ts=2022-09-26T14:54:01.919Z caller=main.go:558 level=info host_details="(Linux 4.18.0-372.26.1.rt7.183.el8_6.x86_64 #1 SMP PREEMPT_RT Sat Aug 27 22:04:33 EDT 2022 x86_64 prometheus-k8s-0 (none))"
ts=2022-09-26T14:54:01.919Z caller=main.go:559 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-09-26T14:54:01.919Z caller=main.go:560 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-09-26T14:54:01.921Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=127.0.0.1:9090
ts=2022-09-26T14:54:01.922Z caller=main.go:989 level=info msg="Starting TSDB ..."
ts=2022-09-26T14:54:01.924Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false
ts=2022-09-26T14:54:01.926Z caller=main.go:848 level=info msg="Stopping scrape discovery manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:862 level=info msg="Stopping notify discovery manager..."
ts=2022-09-26T14:54:01.926Z caller=manager.go:951 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-09-26T14:54:01.926Z caller=manager.go:961 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:899 level=info msg="Stopping scrape manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:858 level=info msg="Notify discovery manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:891 level=info msg="Scrape manager stopped"
ts=2022-09-26T14:54:01.926Z caller=notifier.go:599 level=info component=notifier msg="Stopping notification manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:844 level=info msg="Scrape discovery manager stopped"
ts=2022-09-26T14:54:01.926Z caller=manager.go:937 level=info component="rule manager" msg="Starting rule manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:1120 level=info msg="Notifier manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:1129 level=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0"

https://github.com/openshift/prometheus/pull/146

Bug OCPBUGS-4136: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/3436

Bug OCPBUGS-7012: [release-4.10] Egress FW ACL rules are invalid in dualstack mode

View the Description View the linked PRs

Description of problem:
Follow-up of: https://issues.redhat.com/browse/SDN-2988

This failure is perma-failing in the e2e-metal-ipi-ovn-dualstack-local-gateway jobs.

Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway/1597574181430497280
Search CI: https://search.ci.openshift.org/?search=when+using+openshift+ovn-kubernetes+should+ensure+egressfirewall+is+created&maxAge=336h&context=1&type=junit&name=e2e-metal-ipi-ovn-dualstack-local-gateway&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Sippy: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.13/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway%22%7D%5D%7D

Version-Release number of selected component (if applicable):

4.12,4.13

How reproducible:

Every time

Steps to Reproduce:

1. Setup dualstack KinD cluster
2. Create egress fw policy with spec
Spec:
  Egress:
    To:
      Cidr Selector:  0.0.0.0/0
    Type:             Deny
3. create a pod and ping to 1.1.1.1

Actual results:

Egress policy does not block flows to external IP

Expected results:

Egress policy blocks flows to external IP

Additional info:

It seems mixing ip4 and ip6 operands in ACL matchs doesnt work

https://github.com/openshift/ovn-kubernetes/pull/1508

Task MON-1659: set relatedObjects in ClusterOperator manifests

View the Description View the linked PRs

As mentioned in [1], the cluster monitoring operator doesn't define the relatedObjects field in the ClusterOperator manifest which is initially deployed by CVO [2].
If the CMO pod fails to start, the must-gather might miss information from the monitoring namespace. Note that once CMO runs, it will update the initial ClusterOperator object with the proper information [3].

[1] http://mailman-int.corp.redhat.com/archives/aos-devel/2021-May/msg00139.html
[2] https://github.com/openshift/cluster-monitoring-operator/blob/master/manifests/0000_50_cluster-monitoring-operator_06-clusteroperator.yaml
[3] https://github.com/openshift/cluster-monitoring-operator/blob/a6bc9824035ceb8dbfe7c53cf0c138bfb2ec5643/pkg/client/status_reporter.go#L49-L63

https://github.com/openshift/cluster-monitoring-operator/pull/1483

Bug OCPBUGS-1788: oc adm inspect --rotated-pod-logs not working properly for static pods

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-613~~. The following is the description of the original issue:
—

Description of problem:

The path used by --rotated-pod-logs to gather the rotated pod logs from /var/log/pods node folder via /api/v1/nodes/${NODE}/proxy/logs/${LOG_PATH} is only valid for regular pods but not for static pods.

The main problem is that, while normal pods have their rotated logs at this /var/log/pods/${POD_NAME}_${POD_UID_IN_API}/${CONTAINER_NAME}, static pods have them at /var/log/pods/${POD_NAME}_${CONFIG_HASH}/${CONTAINER_NAME} because the UID cannot be known at the time that the static pod is born (because static pods are created by kubelet before registering them in the kube-apiserver, and UID is assigned by the kube-apiserver).

The visible results of that are:

Spurious errors of not found resources related to the pods.
Rotated pod logs are not gathered even if present.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

Always if there are static pods.

Steps to Reproduce:

1. oc adm inspect --rotated-pod-logs ns/openshift-etcd (or any other project with static pods).

Actual results:

Rotated pods not gathered.

Errors like these

error: errors occurred while gathering data:
    one or more errors occurred while gathering pod-specific data for namespace: openshift-etcd

    [one or more errors occurred while gathering container data for pod etcd-master-0.example.net:

    the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-1.example.net:

    the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-2.example.net:

    the server could not find the requested resource]

Expected results:

No errors like the ones above and rotated pod logs to be gathered, if present.

Additional info:

Despite being marked as experimental, this --rotated-pod-logs is used in must-gather, so this issue can be easily reproduced by just running a default must-gather. I focused on bare oc adm inspect reproducers for simplicity.

https://github.com/openshift/oc/pull/1246

Bug OCPBUGS-4494: [release-4.10] openshift-ingress-operator with mTLS does not download CRL

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-3049~~. The following is the description of the original issue:
—
This is a copy of Bugzilla bug 2117524 for backport to 4.11.z

Original Text:

Description of problem:
On routers configured with mTLS and CRL defined in the CA with a CDP ; new CRL is downloaded only when restarting the ingress-operator.

2022-07-20T23:36:26.943Z	INFO	operator.clientca_configmap_controller	controller/controller.go:298	reconciling	{"request": "openshift-ingress-operator/service-bdrc"}
2022-07-20T23:36:26.943Z	INFO	operator.crl	crl/crl_configmap.go:69	certificate revocation list has expired	{"subject key identifier": "6aa909992e9890457b2a8de5659a44cab8e867a8"}
2022-07-20T23:36:26.943Z	INFO	operator.crl	crl/crl_configmap.go:69	retrieving certificate revocation list	{"subject key identifier": "6aa909992e9890457b2a8de5659a44cab8e867a8"}
2022-07-20T23:36:26.943Z	INFO	operator.crl	crl/crl_configmap.go:169	retrieving CRL distribution point	{"distribution point": "http://crl.domain.com/der/CN=XXXX,OU=XXX,O=XXX,C=XXX"}

Version-Release number of selected component (if applicable):
4.9.33

How reproducible:
Enable mTLS with a CRL

Actual results:
CRL is not download when expired
Clients get "SSL client certificate not trusted" errors while accessing resources

Expected results:
ingress-operator triggers CRL download when approaching expiration date so that the configmap is updated without manual action

https://github.com/openshift/cluster-ingress-operator/pull/861

Bug OCPBUGS-473: Various Jenkins CVEs for August 2022 [openshift-4.10.z]

View the Description View the linked PRs

CVE-2022-36882
CVE-2022-29047
CVE-2022-30945
CVE-2022-30946
CVE-2022-30948
CVE-2022-30952
CVE-2022-30953
CVE-2022-30954
CVE-2022-34174
CVE-2022-36883
CVE-2022-36884
CVE-2022-36885
CVE-2022-34177
CVE-2022-34176
CVE-2022-36881

https://github.com/openshift/jenkins/pull/1479

Bug OCPBUGS-579: Setting a telemeter proxy in the cluster-monitoring-config config map does not work as expected

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-516~~. The following is the description of the original issue:
—
Description of problem:

Setting a telemeter proxy in the cluster-monitoring-config config map does not work as expected

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
the following KCS details steps to add a proxy.
The steps have been verified at 4.7 but do not work at 4.8, 4.9 or 4.10

https://access.redhat.com/solutions/6172402

When testing at 4.8, 4.9 and 4.10 the proxy setting where also nested under `telemeterClient`

which triggered a telemeter restart but the proxy setting do not get set in the deployment as they do in 4.7

Actual results:

4.8, 4.9 and 4.10 without the nested `telemeterClient`
does not trigger a restart of the telemeter pod

Expected results:

I think the proxy setting should be nested under telemeterClient
but should set the environment variables in the deployment

Additional info:

This is a backport of https://bugzilla.redhat.com/show_bug.cgi?id=2116382 from 4.12 to 4.11.z. Creating manually because as seen in https://github.com/openshift/cluster-monitoring-operator/pull/1743 `/cherry-pick` doesn't work for bugs originally created in bugzilla

https://github.com/openshift/cluster-monitoring-operator/pull/1751

Bug OCPBUGS-7052: Sync jenkins-version.txt, base-plugins.txt and bundle-plugins.txt from master branch

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/jenkins/pull/1582

Bug OCPBUGS-784: [4.10] Remove `yq` curls from CI steps

View the Description View the linked PRs

Description of problem:

+++ This bug was initially created as a clone of 

Bug #2118514

Various CI steps use the upi-installer container for it's access to the
aws cli tools among other things. However, most of those steps also
curl yq directly from GitHub. We can save ourselves some headaches
when GitHub is down by just embedding the binary in the image already.

Whenever GitHub has issues or throttles us, YQ hash mismatch error out. The hash mismatch error is because github is probably returning an error page, although our scripts hide it.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/6285

Bug OCPBUGS-2614: e2e-gcp-builds is permafailing

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1237~~. The following is the description of the original issue:
—
job=pull-ci-openshift-origin-master-e2e-gcp-builds=all

This test has started permafailing on e2e-gcp-builds:

[sig-builds][Feature:Builds][Slow] s2i build with environment file in sources Building from a template should create a image from "test-env-build.json" template and run it in a pod [apigroup:build.openshift.io][apigroup:image.openshift.io]

The error in the test says

Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:21 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulling: Pulling image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8"
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Successfully pulled image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" in 1.763914719s
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Created: Created container test
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Started: Started container test
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:24 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Container image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" already present on machine
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:25 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Unhealthy: Readiness probe failed: Get "http://10.129.2.63:8080/": dial tcp 10.129.2.63:8080: connect: connection refused
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:26 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} BackOff: Back-off restarting failed container

https://github.com/openshift/origin/pull/27479

Bug OCPBUGS-4260: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/grafana/pull/93

Bug OCPBUGS-10814: Risk cache warming takes too long on channel changes

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-10671~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-10514~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-10221~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-5469~~. The following is the description of the original issue:
—
Description of problem:

When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel today only has to evaluate `OpenStackNodeCreationFails` but when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks is throttled at one every 10 minutes. This means if there are three new risks it may take up to 30 minutes after the channel has changed for the full set of conditional updates to be computed. This leads to a perception that no update paths are recommended because most will not wait 30 minutes, they expect immediate feedback.

Version-Release number of selected component (if applicable):

4.10.z, 4.11.z, 4.12, 4.13

How reproducible:

100%

Steps to Reproduce:

1. Install 4.10.34
2. Switch from stable-4.10 to stable-4.11
3.

Actual results:

Observe no recommended updates for 10-20 minutes because all available paths to 4.11 have a risk associated with them

Expected results:

Risks are computed in a timely manner for an interactive UX, lets say < 10s

Additional info:

This was intentional in the design, we didn't want risks to continuously re-evaluate or overwhelm the monitoring stack, however we didn't anticipate that we'd have long standing pile of risks and realize how confusing the user experience would be.

We intend to work around this in the deployed fleet by converting older risks from `type: promql` to `type: Always` avoiding the evaluation period but preserving the notification. While this may lead customers to believe they're exposed to a risk they may not be, as long as the set of outstanding risks to the latest version is limited to no more than one it's likely no one will notice. All 4.10 and 4.11 clusters currently have a clear path toward relatively recent 4.10.z or 4.11.z with no more than one risk to be evaluated.

https://github.com/openshift/cluster-version-operator/pull/917

Bug OCPBUGS-3336: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/12249

Bug OCPBUGS-3997: [4.10] provisioning interface on master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP IPI BareMetal install

View the Description View the linked PRs

Description of problem:

Provisioning interface on master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP 4.10.16 IPI BareMetal install.

Customer is performing an OCP 4.10.16 IPI BareMetal install and bootstrap node provisions just fine, but when master nodes are booted for provisioning, they are not getting an ipv4 address via dhcp. As such, the install is not moving forward at this point.

Version-Release number of selected component (if applicable):

OCP 4.10.16

How reproducible:

Perform OCP 4.10.16 IPI BareMetal install.

Actual results:

provisioning interface comes up (as evidenced by ipv6 address) but is not getting an ipv4 address via dhcp. OCP install / provisioning fails at this point.

Expected results:

provisioning interface successfully received an ipv4 ip address and successfully provisioned master nodes (and subsequently worker nodes as well.)

Additional info:

As a troubleshooting measure, manually adding an ipv4 ip address did allow the coreos image on the bootstrap node to be reached via curl.

Further, the kernel boot line for the first master node was updated for a static ip addresss assignment for further confirmation that the master node would successfully image this way which further confirming that the issue is the provisioning interface not receiving an ipv4 ip address from the dhcp server.

https://github.com/openshift/cluster-baremetal-operator/pull/308

Bug OCPBUGS-5692: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/3500

Bug OCPBUGS-2448: Downward API (annotations) is missing PCI information when using the tuning metaPlugin on SR-IOV Networks

View the Description View the linked PRs

This bug is a backport clone of [Bugzilla Bug 2092839](https://bugzilla.redhat.com/show_bug.cgi?id=2092839). The following is the description of the original bug:
—
Description of problem:

Application of customer uses the Downward API (/etc/podnetinfo/annotations) to get the pci-addresses for SR-IOV interfaces. However, when it adds the tuning metaplugin for sysctl configuration on the SR-IOV Networks, it does not get the pci-addresses for the SR-IOV interfaces.basically customer want to fetch the PCI-ID to bind the interface with the application and dpdk driver.

Version-Release number of selected component (if applicable):

OCP 4.7

Additional info:

OCP : 4.7
Customer: Ericsson
Case: 03221272
Customer has shared all the logs attached with the case.

Please do let us know, if you are able to access the attached log in the case.

For more information, please refer the case.

https://github.com/openshift/multus-cni/pull/137

Bug OCPBUGS-4160: Fails to deprovision cluster when swift omits 'content-type'

View the Description View the linked PRs

Description of problem: This is a follow-up to ~~OCPBUGS-2795~~ and ~~OCPBUGS-2941~~.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:

level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []

Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.

I'm setting the priority to critical as it's causing all our jobs to fail in master.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/6628

Bug OCPBUGS-4888: [4.10] Pods completed + deleted may leak

View the Description View the linked PRs

Description of problem:

When a pod runs to a completed state, we typically rely on the update event that will indicate to us that this pod is completed. At that point the pod IP is released and the port configuration is removed in OVN. The subsequent delete event for this pod will be ignored because it should have been cleaned up in the previous update.

However, there can be cases where the update event is missed with pod completed. In this case we will only receive a delete with pod completed event, and ignore tearing down the pod. The end result is the pod is not cleaned up in OVN and the IP address remains allocated, reducing the amount of address range available to launch another pod. This can lead to exhausting all IP addresses available for pod allocation on a node.

Version-Release number of selected component (if applicable):

4.10.24

How reproducible:

Not sure how to reproduce this. I'm guessing some lag in kapi updates can cause the completed update event and the final delete event to be combined into a single event.

Steps to Reproduce:

1.
2.
3.

Actual results:

Port still exists in OVN, IP remains allocated for a deleted pod.

Expected results:

IP should be freed, port should be removed from OVN.

Additional info:

Similar bug for completed pods with "err: the range is full" fixed in 4.10.21 https://bugzilla.redhat.com/show_bug.cgi?id=2091157

https://github.com/openshift/ovn-kubernetes/pull/1458

Bug OCPBUGS-5929: [4.10] nodeport not reachable port connection timeout

View the Description View the linked PRs

Description of problem:

NodePort port not accessible

Version-Release number of selected component (if applicable):

OCP 4.8.20

How reproducible:

$oc -n ui-nprd get services -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
docker-registry ClusterIP 10.201.219.240 <none> 5000/TCP 24d app=registry
docker-registry-lb LoadBalancer 10.201.252.253 internal-xxxxxx.xx-xxxx-1.elb.amazonaws.com 5000:30779/TCP 3d22h app=registry
docker-registry-np NodePort 10.201.216.26 <none> 5000:32428/TCP 3d16h app=registry

$oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxx.ca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.96
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -vz 10.81.23.96 32428
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.

In a new-created namespaces the same deployment works:

[RHEL7:> oc project
Using project "test-c1" on server "https://api.xx.xx.xxxx.xx.xx:6443".
[RHEL7:- ~/tmp]> oc port-forward service/docker-registry-np 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000

[1]+ Stopped oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> bg %1
[1]+ oc4 port-forward service/docker-registry-np 5000:5000 &
[RHEL7: ~/tmp]> nc -v localhost 5000
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 127.0.0.1:5000.
Handling connection for 5000

[RHEL7: ~/tmp]> kill %1
[RHEL7: ~/tmp]>
[1]+ Terminated oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> oc get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
docker-registry-np NodePort 10.201.224.174 <none> 5000:31793/TCP 68s

[RHEL7: ~/tmp]> oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
registry-75b7c7fd94-rx29j 1/1 Running 0 7m5s 10.201.1.29 ip-xxx.ca-central-1.compute.internal <none> <none>
[RHEL7: ~/tmp]> oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxxca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.87
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -v 10.81.23.87 31793
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.81.23.87:31793.

Actual results:

Working on new created namespace
Not working on already created namespace

Expected results:

Suppose to work on all namespaces.

Additional info:

This cluster get upgrade from 4.7.x to 4.8 and then they manually enable OVN.
The issue was happening on all namespaces but after restarting the ovnkube-master-xxxx pods only the newly created namespaces work.

https://github.com/openshift/ovn-kubernetes/pull/1482

Bug OCPBUGS-10012: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/jenkins/pull/1639

Task MON-1872: Use upstream kube-thanos in cluster-monitoring-operator jsonnet

View the Description View the linked PRs

As per [1], the jsonnet code for managing thanos-ruler resources should reuse the upstream kube-thanos project.

[1] https://github.com/openshift/cluster-monitoring-operator/blob/399c84dbca596b611b0c30a0d2df63a5d2b0b8cc/jsonnet/components/thanos-ruler.libsonnet#L1

https://github.com/openshift/cluster-monitoring-operator/pull/1478

Bug OCPBUGS-2660: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1326

Bug OCPBUGS-5296: Developer Topology always blanks with large contents when first rendering

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-5206~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4897~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-2500~~. The following is the description of the original issue:
—
Description of problem:

When the Ux switches to the Dev console the topology is always blank in a Project that has a large number of components.

Version-Release number of selected component (if applicable):

How reproducible:

Always occurs

Steps to Reproduce:

1.Create a project with at least 12 components (Apps, Operators, knative Brokers)
2. Go to the Administrator Viewpoint
3. Switch to Developer Viewpoint/Topology
4. No components displayed
5. Click on 'fit to screen'
6. All components appear

Actual results:

Topology renders with all controls but no components visible (see screenshot 1)

Expected results:

All components should be visible

Additional info:

https://github.com/openshift/console/pull/12485

Bug OCPBUGS-6903: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-openstack/pull/177

Bug OCPBUGS-7238: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/jenkins/pull/1587

Bug OCPBUGS-924: alertmanager-main pods failing to start due to startupprobe timeout

View the Description View the linked PRs

Description of problem:
During a fresh installation on a BareMetal platform, the monitoring cluster operator fails and becomes degraded. Further troubleshooting shows that the "alertmanagers" are not in a ready state (5/6).

Logs from the alertmanager:

level=info ts=2022-05-03T07:18:08.011Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=0993e91aab7afce476de5c45bead4ebb8d1295a7)"
level=info ts=2022-05-03T07:18:08.011Z caller=main.go:226 build_context="(go=go1.17.5, user=root@df86d88450ef, date=20220409-10:25:31)"

alertmanager-main pods are failing to start due to startupprobe timeout, it seems related to BZ 2037073
We tried to manually increase the timers in the startupprobe, but it was not possible.

Version-Release number of selected component (if applicable):
OCP 4.10.10

How reproducible:
OCP IPI Baremetal Install on HPE ProLiant BL460c Gen10, CU tried several time to redeploy always with the same outcome.

Actual results:
CMO is not being deployed

Expected results:
CMO deploys without errors

Additional info:

CU is deploying OCP 4.10 IPI on a baremetal disconnected cluster
cluster is 3 nodes with masters schedulable

https://github.com/openshift/cluster-monitoring-operator/pull/1760

Bug OCPBUGS-3500: systemReserved:ephemeral-storage in KubeletConfig doesn't work as expected

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-2079~~. The following is the description of the original issue:
—
Description of problem:

The setting of systemReserved: ephemeral-storage in KubeletConfig is not working as expected.

Version-Release number of selected component (if applicable):

4.10.z, may exist on other OCP versions as well.

How reproducible:

always

Steps to Reproduce:

1. Create a KubeletConfig on the node:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: system-reserved-config
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/master: ""
  kubeletConfig:
    systemReserved:
      cpu: 500m
      memory: 500Mi
      ephemeral-storage: 10Gi


2. Check node allocatable storage with command: oc describe node |grep -C 5 ephemeral-storage

Actual results:

The Allocatable:ephemeral-storage on the node is not capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default)

Expected results:

The Allocatable:ephemeral-storage on the node should be capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default)

Additional info:

The root cause might be: process argument '--system-reserved=cpu=500m,memory=500Mi' overwrote the setting in /etc/kubernetes/kubelet.conf, one example:

root        6824       1 27 Sep30 ?        1-09:00:24 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.58.47 --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --hostname-override= --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a7b6408460148cb73c59677dbc2c261076bc07226c43b0c9192cc70aef5ba62 --system-reserved=cpu=500m,memory=500Mi --v=2 --housekeeping-interval=30s

https://github.com/openshift/machine-config-operator/pull/3408

Bug OCPBUGS-1815: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/12110

Bug OCPBUGS-10750: 'oc adm upgrade channel ...' should not clobber unrecognized spec properties

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-8205~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-7960~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-7780~~. The following is the description of the original issue:
—
Description of problem:

4.9 and 4.10 oc calls to oc adm upgrade channel ... for 4.11+ clusters would clear spec.capabilities. Not all that many clusters try to restrict capabilities, but folks will need to bump their channel for at least every other minor (if their using EUS channels), and while we recommend folks use an oc from the 4.y they're heading towards, we don't have anything in place to enforce that.

Version-Release number of selected component (if applicable):

4.9 and 4.10 oc are exposed vs. the new-in-4.11 spec.capabilities. Newer oc could theoretically be exposed vs. any new ClusterVersion spec capabilities.

How reproducible:

100%

Steps to Reproduce:

1. Install a 4.11+ cluster with None capabilities.
2. Set the channel with a 4.10.51 oc, like oc adm upgrade channel fast-4.11.
3. Check the capabilities with oc get -o json clusterversion version | jq -c .spec.capabilities.

Actual results:

null

Expected results:

{"baselineCapabilitySet":"None"}

https://github.com/openshift/oc/pull/1379

Bug OCPBUGS-900: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/jenkins/pull/1561

Bug OCPBUGS-10524: Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-8399~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-7474~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-6714~~. The following is the description of the original issue:
—
Description of problem:

Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46

a customer cluster was patched. It is an Openshift 4.10.46 cluster with SDN.

More description about issue is available in private comment below since it contains customer data.

https://github.com/openshift/sdn/pull/519

Bug OCPBUGS-4086: Kube-State-metrics pod fails to start due to panic

View the Description View the linked PRs

The kube-state-metric pod inside the openshift-monitoring namespace is not running as expected.

On checking the logs I am able to see that there is a memory panic

~~~
2022-11-22T09:57:17.901790234Z I1122 09:57:17.901768 1 main.go:199] Starting kube-state-metrics self metrics server: 127.0.0.1:8082
2022-11-22T09:57:17.901975837Z I1122 09:57:17.901951 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.902389844Z I1122 09:57:17.902291 1 main.go:210] Starting metrics server: 127.0.0.1:8081
2022-11-22T09:57:17.903191857Z I1122 09:57:17.903133 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.906272505Z I1122 09:57:17.906224 1 builder.go:191] Active resources: certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
2022-11-22T09:57:17.917758187Z E1122 09:57:17.917560 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2022-11-22T09:57:17.917758187Z goroutine 24 [running]:
2022-11-22T09:57:17.917758187Z k8s.io/apimachinery/pkg/util/runtime.logPanic(

{0x1635600, 0x2696e10})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d
2022-11-22T09:57:17.917758187Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
2022-11-22T09:57:17.917758187Z panic({0x1635600, 0x2696e10}

)
2022-11-22T09:57:17.917758187Z /usr/lib/golang/src/runtime/panic.go:1038 +0x215
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.ingressMetricFamilies.func6(0x40)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:136 +0x189
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.wrapIngressFunc.func1(

{0x17fe520, 0xc00063b590})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:175 +0x49
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x17fe520, 0xc00063b590}

)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
~~~

Logs are attached to the support case

https://github.com/openshift/kube-state-metrics/pull/86

Bug OCPBUGS-5077: Service spec value `externalTrafficPolicy` does not trigger rules update in ovnkube-node pod handlers on edit

View the Description View the linked PRs

Description of problem:

[ovn] [ocp 4.10.z] Service `spec.externalTrafficPolicy` does not trigger rules update in ovnkube-node pod handlers on edit, even though it does successfully update the rules if deployed explicitly with that spec value set, or if you delete the handler pods for ovn (forces a refresh).

Version-Release number of selected component (if applicable):

observed in 4.10.32 and 4.10.40, tested on azure platform.

How reproducible:

every time

Steps to Reproduce:

1. Deploy a test pod with curlable resource in a test namespace

2. create a service from yaml exposing pod at internal clusterIP (example yaml provided by customer below)
~~~ 
apiVersion: v1
kind: Service
metadata:
  labels:
    run: test
  name: test
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    service.beta.kubernetes.io/azure-load-balancer-internal-subnet: paas1
spec:
  allocateLoadBalancerNodePorts: true
  externalTrafficPolicy: Cluster ##MODIFY THIS SPEC VALUE AND OBSERVE FAIL CONDITION
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    run: test
  sessionAffinity: None
  type: LoadBalancer
~~~

3. curl against service succeeds
4. edit service to change `spec.externalTrafficPolicy: local`
5. observe externalIP does not change, but healthz port updates
6. curl against same externalIP:port time out indefinitely, no response.

//workaround: 

delete service and redeploy with spec line set already to `local`, or delete ovnkube-node pod serving pod(s) to force refresh the local ruleset and allow traffic (curls subsequently will succeed).

Actual results:

spec change appears to update properly in the database but does not send a notification to update the ovnkube-node pod handlers (or similar) to allow traffic through once the externalTrafficPolicy spec value is changed.

Expected results:

spec change to service yaml should be immediately updated in DB AND update ovnkube-node handlers for same.

Additional info:

Attachments available and case number with specifics in next internal comment.

https://github.com/openshift/ovn-kubernetes/pull/1463

Bug OCPBUGS-6641: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1492

Bug OCPBUGS-7795: thanos-ruler-user-workload-1 pod is getting repeatedly re-created after upgrade do 4.10.41

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-7590~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-7458~~. The following is the description of the original issue:
—
Description of problem:

- After upgrading to OCP 4.10.41, thanos-ruler-user-workload-1 in the openshift-user-workload-monitoring namespace is consistently being created and deleted.
- We had to scale down the Prometheus operator multiple times so that the upgrade is considered as successful.
- This fix is temporary. After some time it appears again and Prometheus operator needs to be scaled down and up again.
- The issue is present on all clusters in this customer environment which are upgraded to 4.10.41.

Version-Release number of selected component (if applicable):

How reproducible:

N/A, I wasn't able to reproduce the issue.

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

https://github.com/openshift/prometheus-operator/pull/219

Bug OCPBUGS-1024: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1271

Bug OCPBUGS-2107: unable to install IPI PRIVATE OpenShift cluster in Azure due to organization policies

View the Description View the linked PRs

This bug is a backport clone of [Bugzilla Bug 2056519](https://bugzilla.redhat.com/show_bug.cgi?id=2056519). The following is the description of the original bug:
—
Version: 4.9

Platform:
azure

Please specify:

IPI

What happened?
Issue: Customer reports unable to install IPI PRIVATE OpenShift cluster in Azure. They have orgnization policy which do not allow them to create storage account with public access. It should be disallowed.

What did you expect to happen?
Installer completes successfully.

https://github.com/openshift/cluster-image-registry-operator/pull/804

Bug OCPBUGS-6575: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/telemeter/pull/446

Bug OCPBUGS-920: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/image-registry/pull/344

Task IR-227: Remove legacy code for platformStatus

View the Description View the linked PRs

Before platformStatus, the operator used to get information about AWS and GCP from the install-config config map. This code can be removed.

https://github.com/openshift/cluster-image-registry-operator/pull/739

Bug OCPBUGS-1828: CI: Backend unit tests fails because devfile registry was updated (fix assertion)

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1786~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-1677~~. The following is the description of the original issue:
—
Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)

This issue is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.

~~OCPBUGS-1678~~ is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh

Actual results:
Unit tests fail

Expected results:
Unit tests should pass again

Additional info:

https://github.com/openshift/console/pull/12112

Bug OCPBUGS-5788: ClusterResourceQuota values are not reflecting.

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-723~~. The following is the description of the original issue:
—
Description of problem:
I have a customer who created clusterquota for one of the namespace, it got created but the values were not reflecting under limits or not displaying namespace details.
~~~
$ oc describe AppliedClusterResourceQuota
Name: test-clusterquota
Created: 19 minutes ago
Labels: size=custom
Annotations: <none>
Namespace Selector: []
Label Selector:
AnnotationSelector: map[openshift.io/requester:system:serviceaccount:application-service-accounts:test-sa]
Scopes: NotTerminating
Resource Used Hard
-------- ---- ----
~~~

WORKAROUND: They recreated the clusterquota object (cache it off, delete it, create new) after which it displayed values as expected.

In the past, they saw similar behavior on their test cluster, there it was heavily utilized the etcd DB was much larger in size (>2.5Gi), and had many more objects (at that time, helm secrets were being cached for all deployments, and keeping a history of 10, so etcd was being bombarded).

This cluster the same "symptom" was noticed however etcd was nowhere near that in size nor the amount of etcd objects and/or helm cached secrets.

Version-Release number of selected component (if applicable): OCP 4.9

How reproducible: Occurred only twice(once in test and in current cluster)

Steps to Reproduce:
1. Create ClusterQuota
2. Check AppliedClusterResourceQuota
3. The values and namespace is empty

Actual results: ClusterQuota should display the values

Expected results: ClusterQuota not displaying values

https://github.com/openshift/cluster-policy-controller/pull/97

Bug OCPBUGS-400: Disable xmlstarlet code to fix current CVEs [release-4.10]

View the Description View the linked PRs

+++ This bug was initially created as a clone of Bug #2117811 +++

Description of problem:
We are currently unable to merge any pull requests to fix CVEs because of the use of the xmlstarlet command line utility which is not currently packaged for and available for RHEL.

Version-Release number of selected component (if applicable):

How reproducible:
ALways

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:
Another bug will be opened to reenable this code once the xmlstarlet package is available in RHEL or we find an alternative fix.

https://github.com/openshift/jenkins/pull/1469

Bug OCPBUGS-5831: Empty/missing node-sizing SYSTEM_RESERVED_ES parameter can result in kubelet not starting

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-4945~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4805~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4101~~. The following is the description of the original issue:
—
Description of problem:

We experienced two separate upgrade failures relating to the introduction of the SYSTEM_RESERVED_ES node sizing parameter, causing kubelet to stop running.

One cluster (clusterA) upgraded from 4.11.14 to 4.11.17. It experienced an issue whereby
/etc/node-sizing.env
on its master nodes contained an empty SYSTEM_RESERVED_ES value:

---
cat /etc/node-sizing.env
SYSTEM_RESERVED_MEMORY=5.36Gi
SYSTEM_RESERVED_CPU=0.11
SYSTEM_RESERVED_ES=
---

causing the kubelet to not start up. To restore service, this file was manually updated to set a value (1Gi), and kubelet was restarted.

We are uncertain what conditions led to this occuring on the clusterA master nodes as part of the upgrade.

A second cluster (clusterB) upgraded from 4.11.16 to 4.11.17. It experienced an issue whereby worker nodes were impacted by a similar problem, however this was because a custom node-sizing-enabled.env MachineConfig which did not set SYSTEM_RESERVED_ES

This caused existing worker nodes to go into a NotReady state after the ugprade, and additionally new nodes did not join the cluster as their kubelet would become impacted.

For clusterB the conditions are more well-known of why the value is empty.

However, for both clusters, if SYSTEM_RESERVED_ES ends up as empty on a node it can cause the kubelet to not start.

We have some asks as a result:
- Can MCO be made to recover from this situation if it occurs, perhaps through application of a safe default if none exists, such that kubelet would start correctly?
- Can there possibly be alerting that could indicate and draw attention to the misconfiguration?

Version-Release number of selected component (if applicable):

4.11.17

How reproducible:

Have not been able to reproduce it on a fresh cluster upgrading from 4.11.16 to 4.11.17

Expected results:

If SYSTEM_RESERVED_ES is empty in /etc/node-sizing*env then a default should be applied and/or kubelet able to continue running.

Additional info:

https://github.com/openshift/machine-config-operator/pull/3487

Story CONSOLE-2768: console-operator should use bindata instead of inlining manifests

View the Description View the linked PRs

console-operator codebase contains a lot of inline manifests. Instead we should put those manifests into a `/bindata` folder, from which they will be read and then updated per purpose.

https://github.com/openshift/console-operator/pull/550

Bug OCPBUGS-7741: [release-4.11] fail early on missing node status envs

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-7510~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-7373~~. The following is the description of the original issue:
—
Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000

Under some circumstances the static pod machinery fails to populate the node status in time to generate the correct env variables for ETCD_URL_HOST, ETCD_NAME etc. The pods that come up will fail to accept those variables.

This is particularly pronounced in SNO topologies, leading to installation failures.

The fix is to fail fast in the targetconfig/envvar controller to ensure the CEO goes degraded instead of silently failing on the rollout of an invalid static pod.

https://github.com/openshift/cluster-etcd-operator/pull/1008

Bug OCPBUGS-3697: [release-4.10] Update OWNERS file to reflect current team members

View the Description View the linked PRs

The OWNERS file for multiple branches in the openshift/jenkins repository need to be updated to reflect current team members for approvals.

https://github.com/openshift/jenkins/pull/1521

Bug OCPBUGS-1454: [release-4.10] Unable to create Network Policies: error: unexpectedly found multiple equivalent ACLs (arp v/s arp||nd) (ns_netpol1 v/s ns_netpol2)

View the Description View the linked PRs

Description of problem:

Intended to backport the corresponding https://bugzilla.redhat.com/show_bug.cgi?id=2095852 which has been fixed already for this version.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-3434: Pod logs: Long lines are corrupted when using timestamps=true

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-3174~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-3117~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-3084~~. The following is the description of the original issue:
—
Upstream Issue: https://github.com/kubernetes/kubernetes/issues/77603

Long log lines get corrupted when using '--timestamps' by the Kubelet.

The root cause is that the buffer reads up to a new line. If the line is greater than 4096 bytes and '--timestamps' is turrned on the kubelet will write the timestamp and the partial log line. We will need to refactor the ReadLogs function to allow for a partial line read.

https://github.com/kubernetes/kubernetes/blob/f892ab1bd7fd97f1fcc2e296e85fdb8e3e8fb82d/pkg/kubelet/kuberuntime/logs/logs.go#L325

apiVersion: v1
kind: Pod
metadata:
  name: logs
spec:
  restartPolicy: Never
  containers:
  - name: logs
    image: fedora
    args:
    - bash
    - -c
    - 'for i in `seq 1 10000000`; do echo -n $i; done'

kubectl logs logs --timestamps

https://github.com/openshift/kubernetes/pull/1415

Bug OCPBUGS-377: downloading govc is impacted by github rate limiting

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-262~~. The following is the description of the original issue:
—
github rate limit failures for upi image downloading govc.

https://search.ci.openshift.org/?search=build+error%3A+error+building+at+STEP+%22RUN+curl+-L+-o&maxAge=48h&context=1&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

https://github.com/openshift/installer/pull/6245

Bug OCPBUGS-5977: Implement LIST call chunking in openshift-sdn

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-4607~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4422~~. The following is the description of the original issue:
—
This bug is a backport clone of [Bugzilla Bug 2050230](https://bugzilla.redhat.com/show_bug.cgi?id=2050230). The following is the description of the original bug:
—
Description of problem:
In a large cluster, sdn daemonset can DoS the kube-apiserver with un-paginated LIST calls on high count resources.

Version-Release number of selected component (if applicable):

How reproducible:
NA

Steps to Reproduce:
NA

Actual results:
Kube API Server and Openshift API Server in one of the cluster keeps restarting, without proper exception. The cluster is not accessible.

Expected results:
Kube API Server and Openshift API Server should be stable.

Additional info:

https://github.com/openshift/sdn/pull/494

Bug OCPBUGS-620: oVirt CSI driver should use latest go-ovirt-client

View the Description View the linked PRs

Description of problem:

To avoid any potential bugs, the oVirt CSI driver should use the latest go-ovirt-client, preferably the tagged 1.0.0 version.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/ovirt-csi-driver/pull/116

Bug OCPBUGS-938: machineConfigPool without nodes assigned stuck at 0% progress

View the Description View the linked PRs

Description of problem:

When a custom machineConfigPool is created and no node is associated with it, the mcp remains at 0% progress.

Version-Release number of selected component (if applicable):

How reproducible:

100%

Steps to Reproduce:

1. Create a custom mcp:
~~~
cat << EOF | oc create -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: custom
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,custom]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/custom: "" 
EOF
~~~

Actual results:

The mcp is visible from the from "Administrator view > Cluster Settings > Details" at 0% progress

Expected results:

It shouldn't be stuck at 0%

Additional info:

https://github.com/openshift/console/pull/12061

Bug OCPBUGS-5961: Add support for API version v1beta1 for knativeServing and knativeEventing

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-5258~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-5191~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-5164~~. The following is the description of the original issue:
—
Description of problem:

It looks like the ODC doesn't register KNATIVE_SERVING and KNATIVE_EVENTING flags. Those are based on KnativeServing and KnativeEventing CRs, but they are looking for v1alpha1 version of those: https://github.com/openshift/console/blob/f72519fdf2267ad91cc0aa51467113cc36423a49/frontend/packages/knative-plugin/console-extensions.json#L6-L8
This PR https://github.com/openshift-knative/serverless-operator/pull/1695 moved the CRs to v1beta1, and that breaks that ODC discovery.

Version-Release number of selected component (if applicable):

Openshift 4.8, Serverless Operator 1.27

Additional info:

https://coreos.slack.com/archives/CHGU4P8UU/p1671634903447019

https://github.com/openshift/console/pull/12441

Bug OCPBUGS-6702: The MCO can generate a rendered config with old KubeletConfig contents, blocking upgrades

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-6622~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-6018~~. The following is the description of the original issue:
—
This is a public clone of OCPBUGS-3821

The MCO can sometimes render a rendered-config in the middle of an upgrade with old MCs, e.g.:

the containerruntimeconfigcontroller creates a new containerruntimeconfig due to the update
the template controller finishes re-creating the base configs
the kubeletconfig errors long enough and doesn't finish until after 2

This will cause the render controller to create a new rendered MC that uses the OLD kubeletconfig-MC, which at best is a double reboot for 1 node, and at worst block the update and break maxUnavailable nodes per pool.

https://github.com/openshift/machine-config-operator/pull/3517

Bug OCPBUGS-10912: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/524

Bug OCPBUGS-3968: must-gather namespace should have “privileged” warn and audit pod security labels besides enforce

View the Description View the linked PRs

Description of problem:

https://bugzilla.redhat.com/show_bug.cgi?id=2103126

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oc/pull/1293

Bug OCPBUGS-2185: Bump Jenkins version to 2.361.1

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1942~~. The following is the description of the original issue:
—
Description of problem:

Bump Jenkins version to 2.361.1 and also test the images built by running verify-jenkins.sh script. This script verifies the jenkins versions and plugin in an image. Verify script is present at https://gist.githubusercontent.com/coreydaley/fbf11d3b1a7a567f8c494da6a07bad41/raw/80e569131479c212d5e023bc41ce26fb15a17752/verify-jenkins.sh

Version-Release number of selected component (if applicable):

2.361.1

Additional info:

Verify script is present at https://gist.githubusercontent.com/coreydaley/fbf11d3b1a7a567f8c494da6a07bad41/raw/80e569131479c212d5e023bc41ce26fb15a17752/verify-jenkins.sh

https://github.com/openshift/jenkins/pull/1499

Bug OCPBUGS-2943: Fix sshd embedded version

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-2250~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-2099~~. The following is the description of the original issue:
—
Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/jenkins/pull/1509

Task IR-224: Bump openshift/api package

View the Description View the linked PRs

Acceptance criteria:

All tests (including e2e) pass
No regressions are introduced
openshift/api points to a recent commit on the master branch

https://github.com/openshift/cluster-image-registry-operator/pull/728

Bug OCPBUGS-2795: Fail to deprovision cluster on 4.10 and below when swift speaks HTTP2

View the Description View the linked PRs

Description of problem:

We're seeing CI fail to deprovision the cluster on OpenStack.

Deprovisioning cluster ...
/tmp/secret/metadata.json
Copying the installation artifacts to the Installer's asset directory...
Running the Installer's 'destroy cluster' command...
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
level=error msg=Cannot extract names from response with content-type: []
{"component":"entrypoint","file":"k8s.io/test-infra/prow/entrypoint/run.go:164","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 1h0m0s timeout","severity":"error","time":"2022-10-23T16:40:20Z"}
Copying the Installer logs to the artifacts directory...
{"component":"entrypoint","file":"k8s.io/test-infra/prow/entrypoint/run.go:254","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 10m0s grace period","severity":"error","time":"2022-10-23T16:50:20Z"}
{"component":"entrypoint","error":"os: process already finished","file":"k8s.io/test-infra/prow/entrypoint/run.go:256","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Could not kill process after grace period","severity":"error","time":"2022-10-23T16:50:20Z"}

A quick search shows this only affects OpenStack clusters, for OCP versions <= 4.10:
https://search.ci.openshift.org/?search=Cannot+extract+names+from+response+with+content-type&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

4.7 to 4.10

How reproducible:

Always? To be confirmed. It seems to have started happening sometime in Aug.

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/6529

Bug OCPBUGS-11236: Bump Jenkins to 2.387.1

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-11163~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-10976~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-10934~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-10917~~. The following is the description of the original issue:
—
Description of problem:

Product security has set a required Jenkins version to 2.387.1 for June 6th, 2023

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/jenkins/pull/1642

Bug OCPBUGS-6645: [release-4.10] openshift4/ose-jenkins:v4.10.0 run script throws too many arguments error

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-5947~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4833~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4819~~. The following is the description of the original issue:
—
Description of problem:

s2i/run script has a bug - /usr/libexec/s2i/run: line 578: [: too many arguments

Version-Release number of selected component (if applicable):

v4.10

How reproducible:

Start jenkins from the container image using /usr/libexec/s2i/run while having a route that contains a certificate or key that includes special characters.

Steps to Reproduce:

1. create a route that contains a TLS certificate
2. start a pod using openshift4/ose-jenkins:v4.10.0
3. view the log

Actual results:

2022/12/12 17:30:33 [go-init] No pre-start command defined, skip
2022/12/12 17:30:33 [go-init] Main command launched : /usr/libexec/s2i/run
CONTAINER_MEMORY_IN_MB='12288', using /usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.el8_4.x86_64/bin/java and /usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.el8_4.x86_64/bin/javac
Administrative monitors that contact the update center will remain active
Migrating slave image configuration to current version tag ...
/usr/libexec/s2i/run: line 578: [: too many arguments
Using JENKINS_SERVICE_NAME=jenkins
Generating jenkins.model.JenkinsLocationConfiguration.xml using (/var/lib/jenkins/jenkins.model.JenkinsLocationConfiguration.xml.tpl) ...
Jenkins URL set to: https://bojenkinsdev.micron.com/ in file: /var/lib/jenkins/jenkins.model.JenkinsLocationConfiguration.xml
+ exec java -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:+ParallelRefProcEnabled -XX:+UseG1GC -XX:+UseStringDeduplication -XX:HeapDumpPath=/var/log/jenkins/ '-Xlog:gc*=debug:file=/var/log/jenkins-engserv/gc-%t.log:utctime:filecount=2,filesize=100m' -Xms2g -Xmx8g -Dfile.encoding=UTF8 -Djavamelody.displayed-counters=log,error -Djava.util.logging.config.file=/var/lib/jenkins/logging.properties -Djavax.net.ssl.trustStore=/var/lib/jenkins/ca-anchors-keystore -Dcom.redhat.fips=false -Djdk.http.auth.tunneling.disabledSchemes= -Djdk.http.auth.proxying.disabledSchemes= -Duser.home=/var/lib/jenkins -Djavamelody.application-name=jenkins -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=true -Djenkins.install.runSetupWizard=false -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=false -XX:+AlwaysPreTouch -XX:ErrorFile=/var/log/jenkins-engserv -Dhudson.model.ParametersAction.keepUndefinedParameters=false -jar /usr/lib/jenkins/jenkins.war --prefix=/je...
Picked up JAVA_TOOL_OPTIONS: -XX:+UnlockExperimentalVMOptions -Dsun.zip.disableMemoryMapping=true

Expected results:

not have the following error:
/usr/libexec/s2i/run: line 578: [: too many arguments

Additional info:

https://github.com/openshift/jenkins/pull/1562

Bug OCPBUGS-8314: Bump to kubernetes 1.23.17

View the Description View the linked PRs

This fix contains the following changes coming from updated version of kubernetes up to v1.23.17:

Changelog:

v1.23.17: https://github.com/kubernetes/kubernetes/blob/release-1.23/CHANGELOG/CHANGELOG-1.23.md#changelog-since-v12316
v1.23.16: https://github.com/kubernetes/kubernetes/blob/release-1.23/CHANGELOG/CHANGELOG-1.23.md#changelog-since-v12315
v1.23.15: https://github.com/kubernetes/kubernetes/blob/release-1.23/CHANGELOG/CHANGELOG-1.23.md#changelog-since-v12314
v1.23.14: https://github.com/kubernetes/kubernetes/blob/release-1.23/CHANGELOG/CHANGELOG-1.23.md#changelog-since-v12313
v1.23.13: https://github.com/kubernetes/kubernetes/blob/release-1.23/CHANGELOG/CHANGELOG-1.23.md#changelog-since-v12312

https://github.com/openshift/kubernetes/pull/1497

Bug OCPBUGS-2553: [release-4.10] member loses rights after some other user login in openid / group sync

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-533~~. The following is the description of the original issue:
—
Description of problem:

customer is using Azure AD as openid provider and groups synchronization from the provider.

The scenario is the following:

1)

user A login.
groups are created with the membership.
User A is member of a group with admin rights and it's cluster-admin

2)

user B login:
groups are updated with membership
UserB is also member of the group with admin rights and it's cluster admin

3)

user A login:
groups are identical as in the former step.
user A has no administration rights.

The groups memberships are the same in step 2 and 3.
The cluster role bindings of the groups have never changed.

the only way to have user A again the admin rights is to delete the membership from the group and have user A login again.

I have not managed to reproduce this using RH SSO. Neither Azure AD.

But my configuration is not exactly the same yet.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oauth-apiserver/pull/85

Bug OCPBUGS-2037: OpenShift webconsole , metrics tab does not display data

View the Description View the linked PRs

Description of problem:

After upgrading the cluster to version 4.10.32 the web console and metrics tab  does not display any data.

we have to restart the thanos pods to get the functionality back to running state


Version-Release number of selected component (if applicable):

4.10.32

How reproducible:

NA

Steps to Reproduce:

NA

Actual results:

The data is not getting displayed and monitoring of the cluster is hampered

Expected results:

The data should be displayed properly without having to restart the pods

Additional info:

Bug OCPBUGS-8359: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-operator/pull/1125

Bug OCPBUGS-7021: [OVN] Egress traffic not balanced when spec.egressIPs has 2 IPs

View the Description View the linked PRs

Description of problem:

After an incident occurred on one of the 2 egress nodes the router policies in the ovn_cluster_router are not correct, and are not reconciled making the pods created before the incident to use only one egress node, while new pods use both.

Version-Release number of selected component (if applicable):

4.10.40

How reproducible:

Not know, visible at the customer cluster after an incident.

Steps to Reproduce:

1.
2.
3.

Actual results:

Pods created before the incident are not using one of the 2 egress nodes.

Expected results:

The configuration should be reconciled to makes the old pods to use all the egress nodes configured.

Additional info:

Could it be a regression of https://issues.redhat.com/browse/OCPBUGSM-33570 fix in 4.10.3?

https://github.com/openshift/ovn-kubernetes/pull/1542

Bug OCPBUGS-11499: Mailer Plugin (mailer) has breaking changes in newest version

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-11489~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-11348~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-11329~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-11158~~. The following is the description of the original issue:
—
The Mailer Plugin (mailer) version 435.438.v5b_81173f5b_a_1 is not compatible with the Pipeline: Basic Steps (workflow-basic-steps) plugin version 2.20.

Both plugins need to be updated to newer versions at the same time per https://github.com/jenkinsci/mailer-plugin/releases/tag/435.v79ef3972b_5c7

https://github.com/openshift/jenkins/pull/1658

Bug OCPBUGS-3943: Whereabouts CNI timesout while iterating exclude range [backport 4.10]

View the Description View the linked PRs

Description of problem:

When creating a pod with an additional network that contains a `spec.config.ipam.exclude` range, any address within the excluded range is still iterated while searching for a suitable IP candidate. As a result, pod creation times out when large exclude ranges are used.

Version-Release number of selected component (if applicable):

How reproducible:

with big exclude ranges, 100%

Steps to Reproduce:

1. create network-attachment-definition with a large range:

$ cat <<EOF| oc apply -f -       
apiVersion: k8s.cni.cncf.io/v1                                            
kind: NetworkAttachmentDefinition
metadata:
  name: nad-w-excludes
spec:
  config: |-
    {
      "cniVersion": "0.3.1",
      "name": "macvlan-net",
      "type": "macvlan",
      "master": "ens3",
      "mode": "bridge",
      "ipam": {
         "type": "whereabouts",
         "range": "fd43:01f1:3daa:0baa::/64",
         "exclude": [ "fd43:01f1:3daa:0baa::/100" ],
         "log_file": "/tmp/whereabouts.log",
         "log_level" : "debug"
      }
    }
EOF
2. create a pod with the network attached:

$ cat <<EOF|oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-exclude-range
  annotations:
    k8s.v1.cni.cncf.io/networks: nad-w-excludes
spec:
  containers:
  - name: pod-1
    image: openshift/hello-openshift
EOF

3. check pod status, event log and whereabouts logs after a while: 

$ oc get pods
NAME                        READY   STATUS              RESTARTS   AGE
pod-with-exclude-range      0/1     ContainerCreating   0          2m23s

$ oc get events
<...>
6m39s       Normal    Scheduled                                    pod/pod-with-exclude-range                   Successfully assigned default/pod-with-exclude-range to <worker-node>
6m37s       Normal    AddedInterface                               pod/pod-with-exclude-range                   Add eth0 [10.129.2.49/23] from openshift-sdn
2m39s       Warning   FailedCreatePodSandBox                       pod/pod-with-exclude-range                   Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

$ oc debug node/<worker-node> - tail /host/tmp/whereabouts.log
Starting pod/<worker-node>-debug ...
To use host binaries, run `chroot /host`
2022-10-27T14:14:50Z [debug] Finished leader election
2022-10-27T14:14:50Z [debug] IPManagement: {fd43:1f1:3daa:baa::1 ffffffffffffffff0000000000000000} , <nil>
2022-10-27T14:14:59Z [debug] Used defaults from parsed flat file config @ /etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.conf
2022-10-27T14:14:59Z [debug] ADD - IPAM configuration successfully read: {Name:macvlan-net Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[fd43:01f1:3daa:0baa::/80] DNS: {Nameservers:[] Domain: Search:[] Options:[]} Range:fd43:1f1:3daa:baa::/64 RangeStart:fd43:1f1:3daa:baa:: RangeEnd:<nil> GatewayStr: EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LeaderLeaseDuration:1500 LeaderRenewDeadline:1000 LeaderRetryPeriod:500 LogFile:/tmp/whereabouts.log LogLevel:debug OverlappingRanges:true SleepForRace:0 Gateway:<nil> Kubernetes: {KubeConfigPath:/etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:PodName:pod-with-exclude-range PodNamespace:default} 
2022-10-27T14:14:59Z [debug] Beginning IPAM for ContainerID: f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82
2022-10-27T14:14:59Z [debug] Started leader election
2022-10-27T14:14:59Z [debug] OnStartedLeading() called
2022-10-27T14:14:59Z [debug] Elected as leader, do processing
2022-10-27T14:14:59Z [debug] IPManagement - mode: 0 / containerID:f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 / podRef: default/pod-with-exclude-range
2022-10-27T14:14:59Z [debug] IterateForAssignment input >> ip: fd43:1f1:3daa:baa:: | ipnet: {fd43:1f1:3daa:baa:: ffffffffffffffff0000000000000000} | first IP: fd43:1f1:3daa:baa::1 | last IP: fd43:1f1:3daa:baa:ffff:ffff:ffff:ffff

Actual results:

Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Expected results:

additional network gets attached to the pod

Additional info:

https://github.com/openshift/whereabouts-cni/pull/105

Bug OCPBUGS-4830: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/3278

Bug OCPBUGS-6736: [2072601] ovn-nbctl.log is never rotated

View the Description View the linked PRs

Description of problem:

Description of problem:
The startup script for ovnkube-master container starts ovn-nbctl daemon that logs to ovn-nbctl.log file[1]. However, apart from tailing it, nothing is done with that file. Concretely, it is never rotated.
Recently, one 4.8 cluster has hit the situation where this file grew up to ~24GB size. It was necessary to manually remove that file and restart ovnkube-master.
Version-Release number of selected component (if applicable):
4.8. It seems that 4.10 can hit this problem as well[2] but latest master branch has changed so much that I am not sure if it will be possible for this to happen in 4.11.
How reproducible:
Always (but may take very long and requires no ovnkube-master container restart)
Steps to Reproduce:
1. Let /run/ovn/ovn-nbctl.log grow for long enough
Actual results:
File growing without control
Expected results:
Log file to be eventually rotated.
References:
[1] - https://github.com/openshift/cluster-network-operator/blob/release-4.8/bindata/network/ovn-kubernetes/ovnkube-master.yaml#L813
[2] - https://github.com/openshift/cluster-network-operator/blob/release-4.10/bindata/network/ovn-kubernetes/ovnkube-master.yaml#L750

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-network-operator/pull/1587

Bug OCPBUGS-2202: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1387

Bug OCPBUGS-10491: Prometheus continuously restarts due to slow WAL replay

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-4640~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4489~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4168~~. The following is the description of the original issue:
—
Description of problem:

Prometheus continuously restarts due to slow WAL replay

Version-Release number of selected component (if applicable):

openshift - 4.11.13

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-monitoring-operator/pull/1920

Bug OCPBUGS-4578: Origin tests for bonds - 4.10 backport

View the Description View the linked PRs

Origin tests for the bond-cni

Backport of https://github.com/openshift/origin/pull/27405

https://github.com/openshift/origin/pull/27593

Bug OCPBUGS-6972: Image registry Operator does not use Proxy when connecting to openstack

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-6907~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-6517~~. The following is the description of the original issue:
—
Description of problem:

When the cluster is configured with Proxy the swift client in the image registry operator is not using the proxy to authenticate with OpenStack, so it's unable to reach the OpenStack API. This issue became evident since recently the support was added to not fallback to cinder in case swift is available[1].

[1]https://github.com/openshift/cluster-image-registry-operator/pull/819

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Deploy a cluster with proxy and restricted installation
2. 
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-image-registry-operator/pull/837

Bug OCPBUGS-11239: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/jenkins/pull/1642

Bug OCPBUGS-2279: [release-4.10] Add ccoctl option to create private s3 bucket with OIDC configurations served through public CloudFront URL

View the Description View the linked PRs

This bug represents a backport of ~~CCO-222~~ to release-4.10.

https://github.com/openshift/cloud-credential-operator/pull/501

Bug OCPBUGS-11238: Jenkins images based on rhel8 are wrongly tagged with rhel7

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-11182~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-8497~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-8442~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-8437~~. The following is the description of the original issue:
—
Description of problem:

Jenkins images based on rhel8 are wrongly tagged with rhel7

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1.Check the OS version.
$ podman run -it --rm --entrypoint bash quay.io/openshift/origin-jenkins-agent-base:latest
bash-4.4# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.6 (Ootpa)

2.Check the image labels.
$ podman inspect quay.io/openshift/origin-jenkins-agent-base:latest | grep rhel7
                    "com.redhat.component": "jenkins-slave-base-rhel7-container",
                    "name": "openshift4/jenkins-slave-base-rhel7",
               "com.redhat.component": "jenkins-slave-base-rhel7-container",
               "name": "openshift4/jenkins-slave-base-rhel7",

Actual results:

OS is rhel8, but labels are rhel7

Expected results:

Labels should match the OS version

Additional info:

https://github.com/openshift/jenkins/pull/1643

Bug OCPBUGS-957: [2087981] PowerOnVM_Task is deprecated use PowerOnMultiVM_Task for DRS ClusterRecommendation

View the Description View the linked PRs

Description of problem:

Sometimes we see VMs fail to power on when the land on a host that does not have enough resources. The current power on does not retry or leverage DRS to power on the node on a suitable host.

https://github.com/vmware/govmomi/issues/1026

Our code is still making calls to PowerOnVM_Task which, according to the vsphere docs, is deprecated and we should use PowerOnMultiVM_Task instead.

PowerOnVM_Task does not return a DRS ClusterRecommendation, no vmotion nor host power operations will be done as part of a DRS-facilitated power on. To have DRS consider such operations use PowerOnMultiVM_Task.

https://vdc-download.vmware.com/vmwb-repository/dcr-public/b50dcbbf-051d-4204-a3e7-e1b618c1e384/538cf2ec-b34f-4bae-a332-3820ef9e7773/vim.VirtualMachine.html#powerOn:
As of vSphere API 5.1, use of this method with vCenter Server is deprecated; use PowerOnMultiVM_Task instead.

Version-Release number of selected component (if applicable):
4.8.x

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Sometimes powers on fails requiring manual intervention.

Expected results:
PowerOn should use DRS to ensure it's always successful.

Additional info:

https://github.com/openshift/machine-api-operator/pull/1065

Bug OCPBUGS-1037: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/jenkins/pull/1480

Bug OCPBUGS-5040: [release-4.10] unit-tests flaking on 4.10

View the Description View the linked PRs

Description of problem:

Unit-tests flaking on 4.10 PRs

Version-Release number of selected component (if applicable):

How reproducible:

Sometimes

Steps to Reproduce:

1.
2.
3.

Actual results:

Unit-test job fails with the following error: 

[Fail] OVN Pod Operations during execution [It] should not deallocate in-use and previously freed completed pods IP 
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/pods_test.go:560
 Ran 153 of 153 Specs in 215.549 seconds
 FAIL! -- 152 Passed | 1 Failed | 0 Pending | 0 Skipped
 --- FAIL: TestClusterNode (216.05s)
 FAIL	github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn	228.843s
  {"component":"entrypoint","error":"wrapped process failed: exit status 2","file":"k8s.io/test-infra/prow/entrypoint/run.go:79","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2022-12-17T0

Expected results:

All tests pass

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/1462

Story CONSOLE-2975: Migrate from Node Sass to Dart Sass

View the Description View the linked PRs

Node Sass is deprecated. See https://github.com/sass/node-sass

https://github.com/openshift/console/pull/10149

Bug OCPBUGS-450: KubeDaemonSetRolloutStuck alert using incorrect metric in 4.9 and 4.10

View the Description View the linked PRs

Description of problem:

kube_daemonset_updated_number_scheduled got renamed to kube_daemonset_status_updated_number_scheduled in 4.9, but the definition of KubeDaemonSetRolloutStuck didn't get updated up until 4.11 https://github.com/openshift/cluster-monitoring-operator/commit/35ffa690cec23d6a708aafea36a2d8b77f8a8556 .

Version-Release number of selected component (if applicable):

How reproducible:

Always

We suggest backporting the fix at least to 4.10, as it can affect both customers and ours ability to detect and troubleshoot certain issues, both from single-cluster as well as from the fleet-wide longer-term trends perspective point of view.

https://github.com/openshift/cluster-monitoring-operator/pull/1802

Task MON-1964: Make Telemeter receive endpoint request limit configurable

View the Description View the linked PRs

Currently, Telemeter is not equipped with configurable request limit for receive endpoint (for full context see: https://github.com/openshift/cluster-monitoring-operator/pull/1416). It is using the default limit defined in the code base, however it seems this limit might not be suitable for our usage.

As a part of this ticket, it should be:

1) Understood what is the appropriate limit for request size for our use cases

2) Make the limit configurable in Telemeter via a flag

3) Deploy the changes, initially to the staging environment, to enable our team to test it.

Bug OCPBUGS-5345: WriteRequestBodies audit profile records routes/status events at RequestResponse level

View the Description View the linked PRs

This bug is a backport clone of [Bugzilla Bug 2073220](https://bugzilla.redhat.com/show_bug.cgi?id=2073220). The following is the description of the original bug:
—
Description of problem:

https://docs.openshift.com/container-platform/4.10/security/audit-log-policy-config.html#about-audit-log-profiles_audit-log-policy-config

Version-Release number of selected component (if applicable): 4.*

How reproducible: always

Steps to Reproduce:
1. Set audit profile to WriteRequestBodies
2. Wait for api server rollout to complete
3. tail -f /var/log/kube-apiserver/audit.log | grep routes/status

Actual results:

Write events to routes/status are recorded at the RequestResponse level, which often includes keys and certificates.

Expected results:

Events involving routes should always be recorded at the Metadata level, per the documentation at https://docs.openshift.com/container-platform/4.10/security/audit-log-policy-config.html#about-audit-log-profiles_audit-log-policy-config

Additional info:

Story OCPCLOUD-1278: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-autoscaler-operator/pull/226

Bug OCPBUGS-2602: Intermittent DNS Resolution Errors After Upgrading to 4.10.25

View the Description View the linked PRs

When using cluster scaling and querying a URL in a pod on the side. All the while running some custom watches on endpoints and nodes.

When the nodes scale down, it seems a few seconds before an event marks the node as Not Ready and before the dns-default endpoint is removed from the endpoints list a DNS query can fail.

We wrote some simply watcher (see below for details) to log this and got the following events:

DNS lookup failure:

Tue Oct 18 12:33:23 UTC 2022 - Lookup success                                                                                                                                                                                                                                                                                                      
Tue Oct 18 12:33:28 UTC 2022 - DNS failure                                                                                                                                                                                                                                                                                                         
Tue Oct 18 12:33:41 UTC 2022 - Lookup success

The node was not yet Not Ready and the endpoint was still in the list of endpoints at that time (ntrdy indicates a NotReadyEndpoint):

2022-10-18 12:33:21.712180649 +0000 UTC m=+1047.610174444 - ip-10-0-137-100.ec2.internal -  MemoryPressure - False, DiskPressure - False, PIDPressure - False, Ready - True,
2022-10-18 12:33:39.11806612 +0000 UTC m=+1065.016059955 - ip-10-0-129-193.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:39.525574893 +0000 UTC m=+1065.423568712 - dns-default rdy: 10.128.0.2 rdy: 10.128.10.4 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.130.8.4 rdy: 10.131.0.3 ntrdy: 10.131.8.4
2022-10-18 12:33:39.526424974 +0000 UTC m=+1065.424418833 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.130.8.4 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.131.8.4
2022-10-18 12:33:39.528532869 +0000 UTC m=+1065.426526744 - ip-10-0-129-193.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:39.729859144 +0000 UTC m=+1065.627852917 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:39.936928994 +0000 UTC m=+1065.834922825 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:44.749587947 +0000 UTC m=+1070.647581767 - ip-10-0-188-175.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:44.952196646 +0000 UTC m=+1070.850190469 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.130.8.4 ntrdy: 10.131.8.4
2022-10-18 12:33:44.954865089 +0000 UTC m=+1070.852858965 - ip-10-0-188-175.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:45.159460169 +0000 UTC m=+1071.057454007 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:48.641412229 +0000 UTC m=+1074.539406059 - ip-10-0-188-175.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:48.846438064 +0000 UTC m=+1074.744431900 - ip-10-0-129-193.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:54.068542745 +0000 UTC m=+1079.966536563 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:34:31.752294563 +0000 UTC m=+1117.650288381 - ip-10-0-253-198.ec2.internal -  MemoryPressure - False, DiskPressure - False, PIDPressure - False, Ready - True,
2022-10-18 12:34:39.531848219 +0000 UTC m=+1125.429842032 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.131.8.4
2022-10-18 12:34:39.736866622 +0000 UTC m=+1125.634860439 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4
2022-10-18 12:34:39.941934912 +0000 UTC m=+1125.839928742 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3

So we can observe that the node goes into 'Unknown' at 12:33:39, and the endpoint goes into Not Ready soon after.
Not sure if this is a logic problem of draining a node or an issue with the autoscaler at this point in time, but it fixes itself at the next lookup 5 seconds later.

—

Detailed breakdown of how this was reproduced:
1. A cluster with autoscaling enabled is required.
2. Deploy a daemonset that attempt to use DNS / HTTP in a loop, e.g. the following Daemonset was used to test this:


apiVersion: apps/v1
kind: DaemonSet
metadata: 
  name: dns-tester
  labels: 
    app: dns-tester
spec: 
  selector: 
    matchLabels: 
      app: dns-tester
  template: 
    metadata: 
      labels: 
        app: dns-tester
    spec: 
      containers: 
      - name: dns-tester
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:72f2f7e906c321da6d6a00ce610780e8766e8432f7c553c5d03492f65fe5416c
        command: ["/bin/sh", "-c"]
        args: ['while true; do CURL=$(curl redhat.com 2>&1); if [[ "$CURL" == *"not resolve"* ]]; then echo `date` - "DNS failure"; else echo `date` - "Lookup success"; fi;  sleep 5; done']
        resources: 
          limits: 
            cpu: 100m
            memory: 200Mi

3. Run the following go program against the same cluster (this is what watches the node and endpoint events for the dns-default endpoints):

package main

import (
	"context"
	"fmt"
	"path/filepath"
	"sync"
	"time"

	corev1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	_ "k8s.io/client-go/plugin/pkg/client/auth"
	"k8s.io/client-go/tools/clientcmd"
	"k8s.io/client-go/util/homedir"
)

const dnsNamespace = "openshift-dns"
const dnsEndpoint = "dns-default"

func nodeWatch(clientset *kubernetes.Clientset, waitGroup sync.WaitGroup) {
	ctx := context.Background()
	defer waitGroup.Done()

	var nodes = clientset.CoreV1().Nodes()
	watcher, err := nodes.Watch(ctx, metav1.ListOptions{})
	if err != nil {
		panic(err.Error())
	}
	ch := watcher.ResultChan()

	for {
		event := <-ch
		node, ok := event.Object.(*corev1.Node)
		if !ok {
			fmt.Printf("%v", event)
			panic("Could not cast to nodes")
		}
		fmt.Printf("%v - %s - ", time.Now(), node.Name)
		for _, condition := range node.Status.Conditions {
			fmt.Printf(" %v - %v,", condition.Type, condition.Status)
		}
		fmt.Println()
	}
}

func dnsWatch(clientset *kubernetes.Clientset, waitGroup sync.WaitGroup) {
	ctx := context.Background()
	defer waitGroup.Done()
	var api = clientset.CoreV1().Endpoints(dnsNamespace)
	watcher, err := api.Watch(ctx, metav1.ListOptions{})
	if err != nil {
		panic(err.Error())
	}
	ch := watcher.ResultChan()
	for {
		event := <-ch
		endpoints, ok := event.Object.(*corev1.Endpoints)
		if !ok {
			fmt.Printf("%v", event)
			panic("Could not cast to Endpoint")
		}
		fmt.Printf("%v - %v", time.Now(), endpoints.ObjectMeta.Name)
		for _, endpoint := range endpoints.Subsets {
			for _, address := range endpoint.Addresses {
				fmt.Printf(" rdy: %v", address.IP)
			}
			for _, address := range endpoint.NotReadyAddresses {
				fmt.Printf(" ntrdy: %v", address.IP)
			}
		}
		fmt.Println()
	}
}

func main() {
	// AUTHENTICATE
	var home = homedir.HomeDir()
	var kubeconfig = filepath.Join(home, ".kube", "config")
	config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
	if err != nil {
		panic(err.Error())
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		panic(err.Error())
	}
	wg := sync.WaitGroup{}
	wg.Add(2)
	go dnsWatch(clientset, wg)
	go nodeWatch(clientset, wg)
	wg.Wait()
}

4. Create simulated pressure on the nodes to force a scaleup - e.g. use the following deployment:

apiVersion: apps/v1
kind: Deployment
metadata: 
  name: resource-eater
spec: 
  replicas: 4
  selector: 
    matchLabels: 
      app: resource-eater
  template: 
    metadata: 
      labels: 
        app: resource-eater
    spec: 
      containers: 
      - name: resource-eater
        image: busybox:latest
        command: ["/bin/sh", "-c"]
        args: ["sleep 3600"]
        resources: 
          requests: 
            memory: "8Gi"
            cpu: "1000m"
      affinity: 
        podAntiAffinity: 
          requiredDuringSchedulingIgnoredDuringExecution: 
          - labelSelector: 
              matchExpressions: 
              - key: app
                operator: In
                values: 
                - store
            topologyKey: "kubernetes.io/hostname"

5. Wait for the scale up to happen.
6. Delete the deployment that created the node pressure, so the scale down can happen (this can easily take 15 minutes).
7. Observe the events in the watcher program and the logs for the daemonset - this should show the same behavior as detailed above.

we found a few logged bugs that seemed related to this issue affecting clusters on 4.8 through 4.10. Those bugs are as follows:

https://issues.redhat.com/browse/OCPBUGS-647
https://issues.redhat.com/browse/OCPBUGS-488
https://bugzilla.redhat.com/show_bug.cgi?id=2061244

Using the boave mentioned steps, we have been able to reliably reproduce the issue of DNS failures during autoscale-down in 4.10 clusters.

https://github.com/openshift/cluster-dns-operator/pull/350

Bug OCPBUGS-4614: [4.10] Ipsec pods restart due to liveness probes fail in cluster with more than 150 +

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-4137~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-3824~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-2598~~. The following is the description of the original issue:
—
Description of problem:

Liveness probe of ipsec pods fail with large clusters. Currently the command that is executed in the ipsec container is
ovs-appctl -t ovs-monitor-ipsec ipsec/status && ipsec status
The problem is with command "ipsec/status". In clusters with high node count this command will return a list with all the node daemons of the cluster. This means that as the node count raises the completion time of the command raises too.

This makes the main command

ovs-appctl -t ovs-monitor-ipsec

To hang until the subcommand is finished.

As the liveness and readiness probe values are hardcoded in the manifest of the ipsec container herehttps//github.com/openshift/cluster-network-operator/blob/9c1181e34316d34db49d573698d2779b008bcc20/bindata/network/ovn-kubernetes/common/ipsec.yaml] the liveness timeout of the container probe of 60 seconds start to be insufficient as the node count list is growing. This resulted in a cluster with 170 + nodes to have 15+ ipsec pods in a crashloopbackoff state.

Version-Release number of selected component (if applicable):

Openshift Container Platform 4.10 but i think the same will be visible to other versions too.

How reproducible:

I was not able to reproduce due to an extreamely high amount of resources are needed and i think that there is no point as we have spotted the issue.

Steps to Reproduce:

1. Install an Openshift cluster with IPSEC enabled
2. Scale to 170+ nodes or more
3. Notice that the ipsec pods will start getting in a Crashloopbackoff state with failed Liveness/Readiness probes.

Actual results:

Ip Sec pods are stuck in a Crashloopbackoff state

Expected results:

Ip Sec pods to work normally

Additional info:

We have provided a workaround where CVO and CNO operators are scaled to 0 replicas in order for us to be able to increase the liveness probe limit to a value of 600 that recovered the cluster. 
As a next step the customer will try to reduce the node count and restore the default liveness timeout value along with bringing the operators back to see if the cluster will stabilize.

https://github.com/openshift/cluster-network-operator/pull/1651

Bug OCPBUGS-255: lifecycle.posStart hook does not have network connectivity.

View the Description View the linked PRs

+++ This bug was initially created as a clone of Bug #2081562 +++

Description of problem:

lifecycle.posStart does not have network connectivity on OpenShiftSDN CNI. (OVNKubernetes does not have the issue)

Version-Release number of selected component (if applicable):
4.10

How reproducible:
always

Steps to Reproduce:
1. create statefulset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ oc create -f statefulset.yaml
$ cat statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: httpd
spec:
serviceName: "httpd"
replicas: 1
selector:
matchLabels:
app: httpd
template:
metadata:
labels:
app: httpd
spec:
containers:

name: httpd
image: registry.redhat.io/rhel8/httpd-24:1-191
ports:
containerPort: 80
name: web
lifecycle:
postStart:
exec:
command:
/bin/sh
-c
curl -k https://<IP:PORT> > /tmp/urltest.txt
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Actual results:

PostStartHook fails
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
36s Normal Killing pod/httpd-0 FailedPostStartHook
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expected results:

PostStartHook should not fail.

Additional info:

by adding a dummy initContainers, you can workaround the issue.
something like this:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
spec:
initContainers:

name: init-myservice
image: busybox:1.28
command: ['sh', '-c', 'sleep 2']
containers:
name: httpd
image: registry.redhat.io/rhel8/httpd-24:1-191
ports:
containerPort: 80
name: web
lifecycle:
postStart:
exec:
command:
/bin/sh
-c
curl -k <IP:PORT> > /tmp/urltest.txt
....
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

— Additional comment from rphillips@redhat.com on 2022-05-11 19:48:10 UTC —

crio's contract with networking is to have networking up when the container starts. Moving to the openshift-sdn team to help triage what is going on.

— Additional comment from hyoskim@redhat.com on 2022-06-09 00:40:33 UTC —

Hello,

Is there any update on this issue?

— Additional comment from npinaeva@redhat.com on 2022-06-09 07:53:38 UTC —

Hello, yeah we found the root cause and working on the fix now - PR should be ready by the end of the week

— Additional comment from aos-team-art-private@bot.bugzilla.redhat.com on 2022-07-24 15:21:48 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.

— Additional comment from errata-xmlrpc@redhat.com on 2022-07-27 00:18:40 UTC —

This bug has been added to advisory RHSA-2022:5069 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com@REDHAT.COM)

— Additional comment from swasthan@redhat.com on 2022-07-27 05:38:55 UTC —

Hello Team, thank you for the help so far!

May we know if this is going to backport in v4.10.z as well?

Regards,
Swadeep

— Additional comment from zzhao@redhat.com on 2022-07-27 06:40:30 UTC —

this fixed PR is merged to build 4.12.0-0.nightly-2022-07-24-180529
So I update the target version to 4.12 version.

— Additional comment from zzhao@redhat.com on 2022-07-27 06:48:33 UTC —

still failed on build 4.12.0-0.nightly-2022-07-26-131732

Creating above statefulset and pod still cannot be worked with same error

27s Warning FailedPostStartHook pod/httpd-0 Exec lifecycle hook ([/bin/sh -c curl -k https://<IP:PORT> > /tmp/urltest.txt]) for Container "httpd" in Pod "httpd-0_default(7e519841-7092-4513-928b-03c7783ddc7d)" failed - error: command '/bin/sh -c curl -k https://<IP:PORT> > /tmp/urltest.txt' exited with 1: /bin/sh: -c: line 0: syntax error near unexpected token `>'...
85s Normal Killing pod/httpd-0 FailedPostStartHook

— Additional comment from npinaeva@redhat.com on 2022-07-27 12:50:53 UTC —

Hey @zzhao@redhat.com can you share full statefulset yaml you're running?
Doesn't "line 0: syntax error near unexpected token `>'..." mean bash command is wrong?

— Additional comment from zzhao@redhat.com on 2022-07-27 13:36:43 UTC —

(In reply to Nadia Pinaeva from comment #9)
> Hey @zzhao@redhat.com can you share full statefulset yaml you're running?
> Doesn't "line 0: syntax error near unexpected token `>'..." mean bash
> command is wrong?

I'm using the statefulset from comment 0

$ cat statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: httpd
spec:
serviceName: "httpd"
replicas: 1
selector:
matchLabels:
app: httpd
template:
metadata:
labels:
app: httpd
spec:
containers:

name: httpd
image: registry.redhat.io/rhel8/httpd-24:1-191
ports:
containerPort: 80
name: web
lifecycle:
postStart:
exec:
command:
/bin/sh
-c
curl -k https://<IP:PORT> > /tmp/urltest.txt

— Additional comment from npinaeva@redhat.com on 2022-07-27 14:02:54 UTC —

Did you replace <IP:PORT> here "curl -k https://<IP:PORT> > /tmp/urltest.txt"?

— Additional comment from zzhao@redhat.com on 2022-07-28 07:43:25 UTC —

(In reply to Nadia Pinaeva from comment #11)
> Did you replace <IP:PORT> here "curl -k https://<IP:PORT> >
> /tmp/urltest.txt"?

oh my bad

Tested again after replacing the ip and port with following

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: httpd
spec:
serviceName: "httpd"
replicas: 1
selector:
matchLabels:
app: httpd
template:
metadata:
labels:
app: httpd
spec:
containers:

name: httpd
image: registry.redhat.io/rhel8/httpd-24:1-191
ports:
containerPort: 80
name: web
lifecycle:
postStart:
exec:
command:
/bin/sh
-c
curl -k https://172.30.0.1:443 > /tmp/urltest.txt

—

on 4.12.0-0.nightly-2022-07-27-133042

$ oc get pod
NAME READY STATUS RESTARTS AGE
httpd-0 1/1 Running 0 2m28s

— Additional comment from npinaeva@redhat.com on 2022-07-29 13:02:53 UTC —

@swasthan@redhat.com yes, we are going to backport it to 4.10 (hopefully it will be faster than the fix itself )

https://github.com/openshift/sdn/pull/455

Bug OCPBUGS-286: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-storage-operator/pull/311

Bug OCPBUGS-374: openshift-tests: allow -f to match tests for any test suite

View the Description View the linked PRs

https://bugzilla.redhat.com/show_bug.cgi?id=2110723

https://github.com/openshift/origin/pull/27366

Bug OCPBUGS-403: Local Update Server link 404

View the Description View the linked PRs

+++ This bug was initially created as a clone of ~~OCPBUGSM-46761~~ +++

Description of problem:
When using the admin console, under "Cluster Settings" and choosing Upstream Configuration, the "window" that appears has a dead link to documentation for how to create a local (disconnected) update server.

The link in the window is
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.10/html/updating_clusters/installing-update-service

Assume the right link should be something like:
https://docs.openshift.com/container-platform/4.10/updating/updating-restricted-network-cluster.html#update-restricted-network-cluster-update-service

Version-Release number of selected component (if applicable):

4.10.*

— Additional comment from plarsen@redhat.com on 2022-06-17 16:33:53 UTC —

Created attachment 1890947
Screenshot showing the "window" with the 404 link

— Additional comment from rhamilto@redhat.com on 2022-06-28 19:56:17 UTC —

Thank you, Peter, for wonderfully documenting the bug!

https://github.com/openshift/console/pull/11974

Bug OCPBUGS-5990: [4.10] unit test data race with egress ip tests

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-5766~~. The following is the description of the original issue:
—
Description of problem:

Data race seen in unit tests:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1448/pull-ci-openshift-ovn-kubernetes-release-4.11-unit/1604898712423763968/artifacts/test/build-log.txt

https://github.com/openshift/ovn-kubernetes/pull/1480

Bug OCPBUGS-10244: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-174: [OVN] New pods unable to establish TCP connections and get constant timeouts causing application downtime

View the Description View the linked PRs

+++ This bug was initially created as a clone of Bug #2118717 +++
https://bugzilla.redhat.com/show_bug.cgi?id=2118717

Description of problem:
This BZ is a spin-off of BZ-2114945 so we can track possible issues with new TCP connections from pods failing to be created on the nodes leading to pods being unable to start or crash.

Version-Release number of selected component (if applicable):
OCP 4.10.24 with OVN-Kubernetes

How reproducible:
Periodically on the customer only so far.

— Additional comment from Andre Costa on 2022-08-10 16:30:00 UTC —

There are 3 must-gathers here that were gathered during the issues and after the restart of OVNk-masters which makes all these issues go away and pods start connections immediately.

This must-gather was taken at 11 AM today when they received a report from one of the customers:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/3f820e22-3d1e-4c01-b56d-e479a6cc5578

Customer reported the issue again and this time we also got sosreport and inspect from the project.
In the pod they get errors like this (the same we saw on the call last week with them where it seems no TCP connections entries are created at all. First we though it was DNS but even with IPs directly there were issues like this):
-----------------
mx-toni-dev toni-dev-build 0/1 Error 0 18m 10.195.80.253 demchdc5vvx <none> <none>

[z0003rbj-z07@stuart ~]$ oc logs toni-dev-build
time="2022-08-10T10:59:02Z" level=info msg="Start building app with registry type openshift"
time="2022-08-10T10:59:02Z" level=info msg="Adding ssl certificate /etc/ssl/certs/ca-bundle.crt"
time="2022-08-10T10:59:02Z" level=info msg="Certificate /etc/ssl/certs/ca-bundle.crt has been added successfully"
time="2022-08-10T10:59:02Z" level=info msg="Updating docker config with registry credentials"
time="2022-08-10T10:59:02Z" level=info msg="Docker config has been updated with registry credentials"
time="2022-08-10T10:59:02Z" level=info msg="Downloading MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042"
time="2022-08-10T10:59:32Z" level=error msg="Failed to build mendix app, failed to create application layer failed to download MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042, Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042\": proxyconnect tcp: dial tcp: i/o timeout: Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042\": proxyconnect tcp: dial tcp: i/o timeout"
-----------------------

This keeps happening if they continue to run the builds which they did and created the MG and sosreport:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/d10d9646-5437-4806-b34f-3acab4d83266

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/1e9e0575-66b0-4184-a3d4-e84fa245c9a2

And like we have seen so far restarting the ovnk-master pods makes these connections work immediately again:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/e03ee9c6-2ce4-40d3-a484-84da61607de5

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/652ebe47-b71f-4d63-8499-cf989b3ebfc7

— Additional comment from Andre Costa on 2022-08-10 16:30:43 UTC —

— Additional comment from Tim Rozet on 2022-08-10 22:36:06 UTC —

Thanks for the must gathers. From Flavio and I examining them, there is definitely a bug here in ovn-kube. The toni-dev-build pod is deleted/recreated multiple times, and during this time it moves to different nodes. However due to a bug in OVNK, this port is updated with the new ip address and information as if it was moving to the new node, but stays on the previous logical switch. So for example, this is what happens:

1. The pod is originally assigned to node demchdc6zax. This node's cluster subnet is 10.195.79.0/24:

2022-08-09T09:22:57.078688727+00:00 stderr F I0809 09:22:57.078632 2239319 cni.go:248] [mx-toni-dev/toni-dev-build b1a4fb0be20ff717f85fd0fffab4fb303bbcb0f8b68aced4852fb7a2465d2df1] ADD finished CNI request [mx-toni-dev/toni-dev-build b1a4fb0be20ff717f85fd0fffab4fb303bbcb0f8b68aced4852fb7a2465d2df1], result "{\"interfaces\":[

{\"name\":\"b1a4fb0be20ff71\",\"mac\":\"a6:68:38:ad:66:c8\"}

,

{\"name\":\"eth0\",\"mac\":\"0a:58:0a:c3:4f:52\",\"sandbox\":\"/var/run/netns/e354d2d5-83cb-406f-a2d9-c5f3e786bae4\"}

],\"ips\":[

{\"version\":\"4\",\"interface\":1,\"address\":\"10.195.79.82/24\",\"gateway\":\"10.195.79.1\"}

],\"dns\":{}}", err <nil

2. Over time this pod is completed, deleted, recreated many times. Until eventually it lands on demchdc5vvx the next day:
2022-08-10T08:43:58.428759111Z I0810 08:43:58.428719 1837017 cni.go:248] [mx-toni-dev/toni-dev-build 5d4f195cbda5269e5451593987be9d69ea828ee549bc447a5bbe50db847c182a] ADD finished CNI request [mx-toni-dev/toni-dev-build 5d4f195cbda5269e5451593987be9d69ea828ee549bc447a5bbe50db847c182a], result "{\"interfaces\":[

{\"name\":\"5d4f195cbda5269\",\"mac\":\"a2:22:6c:1d:43:c1\"}

,

{\"name\":\"eth0\",\"mac\":\"0a:58:0a:c3:50:1e\",\"sandbox\":\"/var/run/netns/bae8f77a-b368-4b3a-86dc-df925330fa26\"}

],\"ips\":[

{\"version\":\"4\",\"interface\":1,\"address\":\"10.195.80.30/24\",\"gateway\":\"10.195.80.1\"}

],\"dns\":{}}", err <nil>

3. Although it lands on a new node, in OVNK we update the old port (somehow the old port is not being removed) that is attached to the old switch:
[root@fedora ~]# ovn-nbctl list logical_switch_port c345cc07-8a89-4e70-beff-d8d9f4dac46a
_uuid : c345cc07-8a89-4e70-beff-d8d9f4dac46a
addresses : ["0a:58:0a:c3:50:1e 10.195.80.30"]
dhcpv4_options : []
dhcpv6_options : []
dynamic_addresses : []
enabled : []
external_ids :

{namespace=mx-toni-dev, pod="true"}

ha_chassis_group : []
name : mx-toni-dev_toni-dev-build
options :

{iface-id-ver="655cc043-53d5-4550-9c0a-74095a0fbde4", requested-chassis=demchdc5vvx}

parent_name : []
port_security : ["0a:58:0a:c3:50:1e 10.195.80.30"]
tag : []
tag_request : []
type : ""
up : false

[root@fedora ~]# ovn-nbctl lsp-list demchdc6zax | grep c345cc07-8a89-4e70-beff-d8d9f4dac46a
c345cc07-8a89-4e70-beff-d8d9f4dac46a (mx-toni-dev_toni-dev-build)
[root@fedora ~]# ovn-nbctl lsp-list demchdc5vvx | grep c345cc07-8a89-4e70-beff-d8d9f4dac46a
[root@fedora ~]#

This will cause the pod not to be able to send any traffic as its IP is in the wrong subnet for this switch.

4. Additionally the default node SNAT for this pod is in the right place:
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc6zax | grep 10.195.80.30
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5vvx | grep 10.195.80.30
snat 139.25.144.25 10.195.80.30

5. But there is no egress IP reroute or SNAT entry for this pod:
Egress IP:
status:
items:

egressIP: 139.25.144.72
node: demchdc5z6x
name: egress-mx-toni-dev

[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5z6x | grep 10.195.80.30
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5z6x | grep 139.25.144.72
snat 139.25.144.72 10.195.77.156
snat 139.25.144.72 10.195.80.40
snat 139.25.144.72 10.195.76.65
snat 139.25.144.72 10.195.76.184
snat 139.25.144.72 10.195.80.42
snat 139.25.144.72 10.195.80.156

6. We see in the ovnkube-master logs that ovnk attempts to delete this pod, but it fails because we try to delete a logical switch port that is still bound to the wrong logical switch:
2022-08-10T09:07:32.033303027Z I0810 09:07:32.033270 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.80.30]}}] Timeout:<nil> Where:[where column _uuid ==

{4fd75aab-222c-4f17-9407-3764f9b6760b}

] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[

{GoUUID:c345cc07-8a89-4e70-beff-d8d9f4dac46a}]}}] Timeout:<nil> Where:[where column _uuid == {f5073eaa-3f72-4ec2-94c3-3744d412864a}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c345cc07-8a89-4e70-beff-d8d9f4dac46a}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"

2022-08-10T09:07:32.037857343Z I0810 09:07:32.037163 1 pods_retry.go:57] [655cc043-53d5-4550-9c0a-74095a0fbde4/mx-toni-dev/toni-dev-build] teardown retry failed; will try again later

From the engineering side we need to try to reproduce this. Flavio and I think this toni-dev-build must be a stateful set running to completion. We need to investigate further to understand how we ended up trying to add the new pod without first deleting the old. Could be a bug related to completed pods logic or retry logic that was added in 4.10.

— Additional comment from Anurag saxena on 2022-08-11 04:52:37 UTC —

Thanks @trozet@redhat.com for investigation. @huirwang@redhat.com @jechen@redhat.com Probably we can follow these steps to internally repro the issue.

— Additional comment from Andre Costa on 2022-08-11 15:09:25 UTC —

Hi all,

Many thanks for the great work here and customer is relieved that we found something

In the meantime a new example that looks to be another caused by this bug. We have seen this live last week and today the customer saw it again on another project:

- Similar build pod started by their operator:

1h21m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.185]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
1h21m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc5vux
1h21m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.81.116/24] from ovn-kubernetes
1h21m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
1h21m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 622.636779ms
1h21m Normal Created pod/ngat-dev-build Created container mendix-build
1h21m Normal Started pod/ngat-dev-build Started container mendix-build
1h20m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.81.116]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {95072525-7d91-4f1f-9ad8-38a4171b978e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
1h2m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc5vvx
1h2m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.80.33/24] from ovn-kubernetes
1h2m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
1h2m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 599.594895ms
1h2m Normal Created pod/ngat-dev-build Created container mendix-build
1h2m Normal Started pod/ngat-dev-build Started container mendix-build
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
28m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
27m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
27m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
27m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 575.907145ms
27m Normal Created pod/ngat-dev-build Created container mendix-build
27m Normal Started pod/ngat-dev-build Started container mendix-build
26m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
26m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
26m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
26m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
26m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 606.11715ms
26m Normal Created pod/ngat-dev-build Created container mendix-build
26m Normal Started pod/ngat-dev-build Started container mendix-build
26m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
22m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
22m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
22m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
22m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 534.430683ms
22m Normal Created pod/ngat-dev-build Created container mendix-build
22m Normal Started pod/ngat-dev-build Started container mendix-build
21m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:}

Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)

Eventually the pod seems to start but then fails again with the same TCP timeout:

2022-08-11T08:26:24.264204783Z time="2022-08-11T08:26:24Z" level=error msg="Failed to build mendix app, failed to create application layer failed to download MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268, Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268\": proxyconnect tcp: dial tcp: i/o timeout: Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268\": proxyconnect tcp: dial tcp: i/o timeout"

Customer took a sosreport from the node where the pod terminated in case is worth it to check:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/bd3f86b2-6d2a-44f7-bb0f-eec85630586d?usePresignedUrl=true

— Additional comment from Tim Rozet on 2022-08-11 21:19:12 UTC —

We were able to reproduce the issue locally. There are two potential paths that can cause a stale logical switch port to be re-used on the wrong node:
Scenario 1: pod is created, deleted, and recreated on another node extremely quickly. This plays out like this:
Events:
1. pod toni is created on node A
2. pod toni is deleted on node A
3. pod toni is recreated on node B

What happens in ovnk:
1. pod toni is created on node B
2. pod toni is deleted on node A (fails)

This happens because we grab the latest version of the pod to add in event 1, which by the time we grab it is actually the value from event 3. Then after processing event 1, we move to event 2 and when we delete the pod, we use what was given to us in the event, which is now incorrect to remove the pod from node A, because it actually is on node B. The fix is to ignore the value in the pod spec on deletion, and use what we store internally in our internal port cache. If there is no entry in the cache, we search OVN NBDB to find the right switch (this is an expensive operation so we want to avoid it when possible).

Flavio is working on a fix for this.

Scenario 2: pod is created, runs to completion, is deleted very quickly, and then is recreated on another node:
Events:
1. pod toni is created on node A
2. pod toni runs to completion
3. pod toni immediately is deleted
4. pod toni is recreated on node B

What happens in ovnk:
1. pod toni is created on node A
2. completion causes an update event, ovnk does not delete the pod (bug)
3. deletion event is processed, but since it is a completed pod, we ignore it (since we should have already delted it in step 2)
4. pod toni is recreated on node B, ovnk sees there is already an switch port for toni on the wrong switch, and just updates that with the new information

This happens when a pod goes to completed, but is deleted very quickly. In step 2 during update event we try to grab the latest version of the pod, but it doesn't exist anymore since it was deleted. In this case we skip the update, instead of tearing down the pod. The delete code in step 3 assumes that if the pod is completed we must have already handled it in update, so the stale port stays around until an add later re-uses it. Patryk already has a fix for this:

https://github.com/ovn-org/ovn-kubernetes/pull/3071

There may still be more egress IP issues to investigate (tracked in other bugs) after fixing this, and we will look into those after fixing these fundamental pod issues.

— Additional comment from Tim Rozet on 2022-08-11 22:19:30 UTC —

Andre, re: comment 5. This looks like the same issue. If you have a must gather we can confirm. I think the sosreport from the worker node is not enough. The referential integrity violation occurs because we attempt to do these operations:
1. remove the logical switch port from the logical switch
2. delete the logical switch port

In this case:
1. we remove the logical switch port from the logical switch, but the switch port actually exists on a different switch (node) - this is a noop
2. we try to delete the logical switch port- NBDB complains this is an violation and refuses to delete it, because there is another switch that holds a reference to this object

— Additional comment from Anurag saxena on 2022-08-12 21:16:56 UTC —

@rbrattai@redhat.com Can you help verifying this and backports? feel free to re-assign to someone else if needed while I am on PTO. Thanks!

https://github.com/openshift/ovn-kubernetes/pull/1250

Bug OCPBUGS-3615: [4.10] [perf/scale] libovsdb builds transaction logs but throws them away

View the Description View the linked PRs

libovsdb builds transaction log messages for every transaction and then throws them away if the log level is not 4 or above. This wastes a bunch of CPU at scale and increases pod ready latency.

https://github.com/openshift/ovn-kubernetes/pull/1375

Bug OCPBUGS-6538: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/configmap-reload/pull/49

Story CONSOLE-2892: Allow dynamic plugins to proxy to services on the cluster

View the Description View the linked PRs

Goal

We have several use cases where dynamic plugins need to proxy to another service on the cluster. One example is the Helm plugin. We would like to move the backend code for Helm to a separate service on the cluster, and the Helm plugin could proxy to that service for its requests. This is required to make Helm a dynamic plugin. Similarly if we want to have ACM contribute any views through dynamic plugins, we will need a way for ACM to proxy to its services (e.g., for Search).

It's possible for plugins to make requests to services exposed through routes today, but that has several problems:

It requires that the service be exposed outside the cluster, which is not always desired.
It requires the service support CORS headers for the console.
There is no way to specify a CA file for the route if it's not trusted by the browser.
Plugins will not have access to the user's access token on the client, which means that there is no simple way to handle auth.

Plugins need a way to declare in-cluster services that they need to connect to. The console backend will need to set up proxies to those services on console load. This also requires that the console operator be updated to pass the configuration to the console backend.

This work will apply only to single clusters.

Open Questions

What happens when a multitenant isolated network policy is configured on the cluster?

https://docs.openshift.com/container-platform/4.7/networking/network_policy/multitenant-network-policy.html

How do we (and can we?) support this for multi-cluster where console is running on a different hub cluster?
Do we need to auth for all requests?

Acceptance Criteria

Plugins can declare a service to proxy to in the ConsolePlugin resource
Plugins can specify a CA cert for the service
Console falls back to the service signing CA if none is specified
Plugins have a way of specifying whether the user's authentication token is included in requests through the service proxy
Dynamic plugin enhancement is updated with the implementation details
Support for server-side events (SSE) for ACM
Add support, or a flag, if auth is needed for each request.

cc Ali Mobrem [~christianmvogt]

Bug OCPBUGS-2628: Worker creation fails within provider networks (as primary and secondary)

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-2508~~. The following is the description of the original issue:
—
Description of problem:

Installer fails due to Neutron policy error when creating Openstack servers for OCP master nodes.

$ oc get machines -A
NAMESPACE               NAME                          PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ostest-kwtf8-master-0         Running                               23h
openshift-machine-api   ostest-kwtf8-master-1         Running                               23h
openshift-machine-api   ostest-kwtf8-master-2         Running                               23h
openshift-machine-api   ostest-kwtf8-worker-0-g7nrw   Provisioning                          23h
openshift-machine-api   ostest-kwtf8-worker-0-lrkvb   Provisioning                          23h
openshift-machine-api   ostest-kwtf8-worker-0-vwrsk   Provisioning                          23h

$ oc -n openshift-machine-api logs machine-api-controllers-7454f5d65b-8fqx2 -c machine-controller
[...]
E1018 10:51:49.355143       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Failed to create port err: Request forbidden: [POST https://overcloud.redhat.local:13696/v2.0/ports], error message: {\"NeutronError\": {\"type\": \"PolicyNotAuthorized\", \"message\": \"(rule:create_port and (rule:create_port:allowed_address_pairs and (rule:create_port:allowed_address_pairs:ip_address and rule:create_port:allowed_address_pairs:ip_address))) is disallowed by policy\", \"detail\": \"\"}}" "name"="ostest-kwtf8-worker-0-lrkvb" "namespace"="openshift-machine-api"

Version-Release number of selected component (if applicable):

4.10.0-0.nightly-2022-10-14-023020

How reproducible:

Always

Steps to Reproduce:

1. Install 4.10 within provider networks (in primary or secondary interface)

Actual results:

Installation failure:
4.10.0-0.nightly-2022-10-14-023020: some cluster operators have not yet rolled out

Expected results:

Successful installation

Additional info:

Please find must-gather for installation on primary interface link here and for installation on secondary interface link here.

https://github.com/openshift/cluster-api-provider-openstack/pull/249

Bug OCPBUGS-10703: MTU migration configuration is cleaned up prematurely while in progress

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-9986~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-7445~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-7207~~. The following is the description of the original issue:
—
At some point in the mtu-migration development a configuration file was generated at /etc/cno/mtu-migration/config which was used as a flag to indicate to configure-ovs that a migration procedure was in progress. When that file was missing, it was assumed the migration procedure was over and configure-ovs did some cleaning on behalf of it.

But that changed and /etc/cno/mtu-migration/config is never set. That causes configure-ovs to remove mtu-migration information when the procedure is still in progress making it to use incorrect MTU values and either causing nodes to be tainted with "ovn.k8s.org/mtu-too-small" blocking the procedure itself or causing network disruption until the procedure is over.

However, this was not a problem for the CI job as it doesn't use the migration procedure as documented for the sake of saving limited time available to run CI jobs. The CI merges two steps of the procedure into one so that there is never a reboot while the procedure is in progress and hiding this issue.

This was probably not detected in QE as well for the same reason as CI.

https://github.com/openshift/machine-config-operator/pull/3622

Bug OCPBUGS-2013: 4.10: When adding nodes, the overlapped node-subnet can be allocated.

View the Description View the linked PRs

Description of problem:

When adding new nodes to the existing cluster, the newly allocated node-subnet can be overlapped with the existing node.

Version-Release number of selected component (if applicable):

openshift 4.10.30

How reproducible:

It's quite hard to reproduce but  there is a possibility it can happen any time.

Steps to Reproduce:

1. Create a OVN dual-stack cluster
2. add nodes to the existing cluster
3. check the allocated node subnet

Actual results:

Some newly added nodes have the same node-subnet and ovn-k8s-mp0 IP as some existing nodes.

Expected results:

Should have duplicated node-subnet and ovn-k8s-mp0 IP

Additional info:

Additional info can be found at the case 03329155 and the must-gather attached(comment #1) 

% omg logs ovnkube-master-v8crc -n openshift-ovn-kubernetes -c ovnkube-master | grep '2022-09-30T06:42:50.857'
2022-09-30T06:42:50.857031565Z W0930 06:42:50.857020       1 master.go:1422] Did not find any logical switches with other-config
2022-09-30T06:42:50.857112441Z I0930 06:42:50.857099       1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node worker01.ss1.samsung.local
2022-09-30T06:42:50.857122455Z I0930 06:42:50.857105       1 master.go:1003] Allocated Subnets [10.129.4.0/23 fd02:0:0:a::/64] on Node oam04.ss1.samsung.local
2022-09-30T06:42:50.857130289Z I0930 06:42:50.857122       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node worker01.ss1.samsung.local
2022-09-30T06:42:50.857140773Z I0930 06:42:50.857132       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.129.4.0/23","fd02:0:0:a::/64"]}] on node oam04.ss1.samsung.local
2022-09-30T06:42:50.857166726Z I0930 06:42:50.857156       1 master.go:1003] Allocated Subnets [10.128.2.0/23 fd02:0:0:5::/64] on Node oam01.ss1.samsung.local
2022-09-30T06:42:50.857176132Z I0930 06:42:50.857157       1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node rhel01.ss1.samsung.local
2022-09-30T06:42:50.857176132Z I0930 06:42:50.857167       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.2.0/23","fd02:0:0:5::/64"]}] on node oam01.ss1.samsung.local
2022-09-30T06:42:50.857185257Z I0930 06:42:50.857157       1 master.go:1003] Allocated Subnets [10.128.6.0/23 fd02:0:0:d::/64] on Node call03.ss1.samsung.local
2022-09-30T06:42:50.857192996Z I0930 06:42:50.857183       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node rhel01.ss1.samsung.local
2022-09-30T06:42:50.857200017Z I0930 06:42:50.857190       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.6.0/23","fd02:0:0:d::/64"]}] on node call03.ss1.samsung.local
2022-09-30T06:42:50.857282717Z I0930 06:42:50.857258       1 master.go:1003] Allocated Subnets [10.130.2.0/23 fd02:0:0:7::/64] on Node call01.ss1.samsung.local
2022-09-30T06:42:50.857304886Z I0930 06:42:50.857293       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.130.2.0/23","fd02:0:0:7::/64"]}] on node call01.ss1.samsung.local
2022-09-30T06:42:50.857338896Z I0930 06:42:50.857314       1 master.go:1003] Allocated Subnets [10.128.4.0/23 fd02:0:0:9::/64] on Node f501.ss1.samsung.local
2022-09-30T06:42:50.857349485Z I0930 06:42:50.857329       1 master.go:1003] Allocated Subnets [10.131.2.0/23 fd02:0:0:8::/64] on Node call02.ss1.samsung.local
2022-09-30T06:42:50.857371344Z I0930 06:42:50.857354       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.4.0/23","fd02:0:0:9::/64"]}] on node f501.ss1.samsung.local
2022-09-30T06:42:50.857371344Z I0930 06:42:50.857361       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.2.0/23","fd02:0:0:8::/64"]}] on node call02.ss1.samsung.local

https://github.com/openshift/ovn-kubernetes/pull/1332

Bug OCPBUGS-2142: [4.10] Rebase openshift/etcd 4.10 onto 3.5.5

View the Description View the linked PRs

Changelog between 3.5.5 and 3.5.4:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#v355-tbd

Changelog between 3.5.3 and 3.5.4:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#v354-2022-04-24

https://github.com/openshift/etcd/pull/156

Bug OCPBUGS-2746: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1336

Bug OCPBUGS-4799: OLM generates invalid component selector labels

View the Description View the linked PRs

[This issue is for a backport to 4.10.z for our CI. This issue was already addressed for 4.11+ in https://github.com/openshift/operator-framework-olm/pull/285]

Description of problem:

When installing an operator, OLM creates an "operator" Custom Resource whose status will be updated to contain a list of resources associated with the operator. This is done by labeling each resource associated with an operator with a label based off this code: https://github.com/operator-framework/operator-lifecycle-manager/blob/7eccf5342199b88f4657b6c996d4e66d9fa978fa/pkg/controller/operators/decorators/operator.go#L92-L105

Version-Release number of selected component (if applicable):

4.8

How reproducible:

Always

Steps to Reproduce:

1. Create a subscription named managed-node-metadata-operator in the openshift-managed-node-metadata-operator namespace, which causes the truncated label to end on `-`, which is an illegal character.
2. Watch the OLM Operator logs.

Actual results:

The adoption controller within OLM continuously fails to adopt the subscription due to an illegal label value:

{"level":"error","ts":1670862754.2096953,"logger":"controllers.adoption","msg":"Error adopting Subscription","request":"openshift-managed-node-metadata-operator/managed-node-metadata-operator","error":"Subscription.operators.coreos.com \"managed-node-metadata-operator\" is invalid: metadata.labels: Invalid value: \"operators.coreos.com/managed-node-metadata-operator.openshift-managed-node-metadata-\": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":1670862754.2097518,"logger":"controller.subscription","msg":"Reconciler error","reconciler group":"operators.coreos.com","reconciler kind":"Subscription","name":"managed-node-metadata-operator","namespace":"openshift-managed-node-metadata-operator","error":"Subscription.operators.coreos.com \"managed-node-metadata-operator\" is invalid: metadata.labels: Invalid value: \"operators.coreos.com/managed-node-metadata-operator.openshift-managed-node-metadata-\": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')","errorCauses":[{"error":"Subscription.operators.coreos.com \"managed-node-metadata-operator\" is invalid: metadata.labels: Invalid value: \"operators.coreos.com/managed-node-metadata-operator.openshift-managed-node-metadata-\": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')"}
],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

Expected results:

The adoption control creates a label that can be applied to the subscription so it may be "adopted" by the controller.

Additional info:

This was orginially fixed in 4.11 here: https://github.com/operator-framework/operator-lifecycle-manager/pull/2731

https://github.com/openshift/operator-framework-olm/pull/421

Bug OCPBUGS-7657: Editing Pipeline in the ocp console to get information error

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-7127~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-5016~~. The following is the description of the original issue:
—
Description of problem:

When editing any pipeline in the openshift console, the correct content cannot be obtained (the obtained information is the initial information).

Version-Release number of selected component (if applicable):

How reproducible:

100%

Steps to Reproduce:

Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> YAML view -> Cancel ->  Actions -> Edit Pipeline -> YAML view

Actual results:

displayed content is incorrect.

Expected results:

Get the content of the current pipeline, not the "pipeline create" content.

Additional info:

If cancel or save in the "Pipeline Builder" interface after "Edit Pipeline", can get the expected content.
～
Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> Pipeline builder -> Cancel ->  Actions -> Edit Pipeline -> YAML view :Display resource content normally
～

https://github.com/openshift/console/pull/12576

Bug OCPBUGS-1568: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-200: Exits with a non-zero code in case Etcd backup fails

View the Description View the linked PRs

clone of https://bugzilla.redhat.com/show_bug.cgi?id=2113927

https://github.com/openshift/cluster-etcd-operator/pull/907

Bug OCPBUGS-5294: Backport fix for OCPBUGSM-36848 to 4.10

View the Description View the linked PRs

A bug prevents the package server cert from being rotated. This was fixed for 4.11 with the release of RHSA-2022:5069 but not fixed in 4.10 or earlier. See bz 2020484 or OCPBUGSM-36848 for details.

https://github.com/openshift/operator-framework-olm/pull/428

Bug OCPBUGS-5950: wal: max entry size limit exceeded

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-5876~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-5761~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-5458~~. The following is the description of the original issue:
—
reported in https://coreos.slack.com/archives/C027U68LP/p1673010878672479

Description of problem:

Hey guys, I have a openshift cluster that was upgraded to version 4.9.58 from version 4.8. After the upgrade was done, the etcd pod on master1 isn't coming up and is crashlooping. and it gives the following error: {"level":"fatal","ts":"2023-01-06T12:12:58.709Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded, recBytes: 13279, fileSize(313430016) - offset(313418480) - padBytes(1) = entryLimit(11535)","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/main.go:40\nmain.main\n\t/remote-source/cachito-gomod-with-deps/app/server/main.go:32\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/etcd/pull/179

Bug OCPBUGS-7827: set default timeouts in etcdcli

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-7409~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-7374~~. The following is the description of the original issue:
—
Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000

The controllers sometimes get stuck on listing members in failure scenarios, this is known and can be mitigated by simply restarting the CEO.

similar BZ 2093819 with stuck controllers was fixed slightly different in https://github.com/openshift/cluster-etcd-operator/commit/4816fab709e11e0681b760003be3f1de12c9c103

This fix was contributed by lance5890, thanks a lot!

https://github.com/openshift/cluster-etcd-operator/pull/1010

Bug OCPBUGS-912: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/jenkins/pull/1530

Bug OCPBUGS-1759: Dev Catalog taking too much time to load in a complete disconnected cluster

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1523~~. The following is the description of the original issue:
—
Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.

Actual results:
Taking too much time too load

Expected results:
Time taken should be reduced

Additional info:
Attached a gif for reference

https://github.com/openshift/console/pull/12106

Bug OCPBUGS-2898: Backport fix(openshift): use env var instead of clusterversion status to OCP 4.10

View the Description View the linked PRs

This is for the OCP 4.10 backport of the bugfix for this issue.

--------------------------------------------------------------
Description of problem:
Cluster attempted to upgrade from 4.9.29 -> 4.10.15. The upgrade could not progress because of:

Last Transition Time: 2022-06-15T16:08:33Z
Message: ClusterServiceVersions blocking cluster upgrade: redhat-rhoam-operator/managed-api-service.v1.22.0 is incompatible with OpenShift minor versions greater than 4.10
Reason: IncompatibleOperatorsInstalled
Status: False
Type: Upgradeable

The mentioned CSV has olm.maxOpenShiftVersion of 4.10:

*

{ "type": "olm.maxOpenShiftVersion", "value": "4.10" }

*

Actual results:
The upgrade is blocked on this olm.maxOpenShiftVersion

Expected results:
The upgrade should not be blocked on this olm.maxOpenShiftVersion

Additional info:
The particular customer has successfully upgraded twice on different clusters on the same OCP edge and with managed-api-service.v1.22.0 installed

must-gather will be attached in a private comment

https://github.com/openshift/operator-framework-olm/pull/389

Bug OCPBUGS-2802: [release-4.10] Add required parameter --credentials-requests-dir for ccoctl gcp delete

View the Description View the linked PRs

In order to delete the correct GCP cloud resources, the "--credentials-requests-dir" parameter must be passed to "ccoctl gcp delete". This was fixed for 4.12 as part of https://github.com/openshift/cloud-credential-operator/pull/489 but must be backported for previous releases. See https://github.com/openshift/cloud-credential-operator/pull/489#issuecomment-1248733205 for discussion regarding this bug.

To reproduce, create GCP infrastructure with a name parameter that is a subset of another set of GCP infrastructure's name parameter. I will "ccoctl gcp create all" with "name=abutcher-gcp" and "name=abutcher-gcp1".

$ ./ccoctl gcp create-all \
--name=abutcher-gcp \
--region=us-central1 \
--project=openshift-hive-dev \
--credentials-requests-dir=./credrequests

$ ./ccoctl gcp create-all \
--name=abutcher-gcp1 \
--region=us-central1 \
--project=openshift-hive-dev \
--credentials-requests-dir=./credrequests

Running "ccoctl gcp delete --name=abutcher-gcp" will result in GCP infrastructure for both "abutcher-gcp" and "abutcher-gcp1" being deleted.

$ ./ccoctl gcp delete --name abutcher-gcp --project openshift-hive-dev
2022/10/24 11:30:06 Credentials loaded from file "/home/abutcher/.gcp/osServiceAccount.json"
2022/10/24 11:30:06 Deleted object .well-known/openid-configuration from bucket abutcher-gcp-oidc
2022/10/24 11:30:07 Deleted object keys.json from bucket abutcher-gcp-oidc
2022/10/24 11:30:07 OIDC bucket abutcher-gcp-oidc deleted
2022/10/24 11:30:09 IAM Service account abutcher-gcp-openshift-image-registry-gcs deleted
2022/10/24 11:30:10 IAM Service account abutcher-gcp-openshift-gcp-ccm deleted
2022/10/24 11:30:11 IAM Service account abutcher-gcp1-openshift-cloud-network-config-controller-gcp deleted
2022/10/24 11:30:12 IAM Service account abutcher-gcp-openshift-machine-api-gcp deleted
2022/10/24 11:30:13 IAM Service account abutcher-gcp-openshift-ingress-gcp deleted
2022/10/24 11:30:15 IAM Service account abutcher-gcp-openshift-gcp-pd-csi-driver-operator deleted
2022/10/24 11:30:16 IAM Service account abutcher-gcp1-openshift-ingress-gcp deleted
2022/10/24 11:30:17 IAM Service account abutcher-gcp1-openshift-image-registry-gcs deleted
2022/10/24 11:30:19 IAM Service account abutcher-gcp-cloud-credential-operator-gcp-ro-creds deleted
2022/10/24 11:30:20 IAM Service account abutcher-gcp1-openshift-gcp-pd-csi-driver-operator deleted
2022/10/24 11:30:21 IAM Service account abutcher-gcp1-openshift-gcp-ccm deleted
2022/10/24 11:30:22 IAM Service account abutcher-gcp1-cloud-credential-operator-gcp-ro-creds deleted
2022/10/24 11:30:24 IAM Service account abutcher-gcp1-openshift-machine-api-gcp deleted
2022/10/24 11:30:25 IAM Service account abutcher-gcp-openshift-cloud-network-config-controller-gcp deleted
2022/10/24 11:30:25 Workload identity pool abutcher-gcp deleted

https://github.com/openshift/cloud-credential-operator/pull/507

Bug OCPBUGS-1440: [release-4.10] Jenkins install-plugins script does not ignore updates to locked plugin versions

View the Description View the linked PRs

Description of problem:

Jenkins install-plugins.sh script does not ignore update requests for locked versions of plugins, and does not verify that the locked version was actually included in the bundle-plugins.txt file.

Version-Release number of selected component (if applicable):

How reproducible:

Run make plugins-list

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/jenkins/pull/1484

Bug OCPBUGS-1758: [4.11] ETCD Operator goes degraded when a second internal node ip is added

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1354~~. The following is the description of the original issue:
—
This was originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=2046335

—

Description of problem:

The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)

How reproducible:

Attach a NIC to a master node will trigger the issue

Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")

Actual results:

~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal -o json | jq ".status.addresses"
[

{ "address": "10.0.178.163", "type": "InternalIP" }

,

{ "address": "10.0.187.247", "type": "InternalIP" }

,

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "Hostname" }

,

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "InternalDNS" }

]

$ oc get co etcd
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
etcd 4.8.11 True False True 31h

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:47:42Z", "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]", "reason": "EtcdCertSignerController_Error", "status": "True", "type": "Degraded" }

~~~

Expected results:

To have the certificate valid also for the second IP (the newly created one "10.0.187.247")

Additional info:

Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret ~~n openshift-etcd | grep kubernetes.io/tls | grep ^etcd~~
etcd-client kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 58s

$ oc get secret ~~n openshift-etcd | grep kubernetes.io/tls | grep ^etcd~~ | awk '

{print $1}

' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:52:21Z", "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found", "reason": "AsExpected", "status": "False", "type": "Degraded" }

~~~

https://github.com/openshift/cluster-etcd-operator/pull/936

Bug OCPBUGS-10233: [4.10] Remove ETCD liviness probe.

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-7830~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-7729~~. The following is the description of the original issue:
—
Description of problem:

Etcd's liveliness probe should be removed.

Version-Release number of selected component (if applicable):

4.11

Additional info:

When the Master Hosts hit CPU load this can cause a cascading restart loop for etcd and kube-api due to the etcd liveliness probes failing. Due to this loop load on the masters stays high because the api and controllers restarting over and over again..  

There is no reason for etcd to have a liveliness probe, we removed this probe in 3.11 due issues like this.

https://github.com/openshift/cluster-etcd-operator/pull/1026

Bug OCPBUGS-1323: Cluster Autoscaler not scaling down nodes which seem to qualify for scale-down [NEEDINFO]

View the Description View the linked PRs

Description of problem:

The `clusterautoscaler` in his cluster to scale up and down the nodes automatically as per the requirement and load.
However, it has been noticed that when removing test load Pods, nodes do not scale down.

Version-Release number of selected component (if applicable):
OpenShift Version: 4.10.20

How reproducible:
Easily reproducible: https://docs.openshift.com/container-platform/4.10/rest_api/autoscale_apis/clusterautoscaler-autoscaling-openshift-io-v1.html#specification

Expected results:

When there is no load, I waited a very long time for the ClusterAutoscaler to scale down, but this never occurs. The ClusterAutoscaler controller Pod's logs keep saying no nodes are eligible to scale down despite the nodes very quite idle.

Additional info:

This Bugzilla is looking similar:
https://bugzilla.redhat.com/show_bug.cgi?id=2053343

Interestingly, I could not find the OLM and redhat-operator pod scheduled on any node other than the master.
Kindly have a look at the attached file for pod details.

https://github.com/openshift/operator-framework-olm/pull/392

Bug OCPBUGS-2894: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/12220

Bug OCPBUGS-2940: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-etcd-operator/pull/961

Bug OCPBUGS-382: [OSD]Should disable the ability to add idp from console and make ClusterVersion YAML editor should be read onl

View the Description View the linked PRs

Description of problem:
OSD cluster, cluster admin is not allowed to update ClusterVersion details, however console is rendering an editable YAML editor

Version-Release number of selected component (if applicable):
4.10.18

How reproducible:
Always

Steps to Reproduce:
1. navigate to ClusterVersion YAML page /k8s/cluster/config.openshift.io~v1~ClusterVersion/version, click on YAML tab
2. cluster-admin is able to do some changes in YAML editor, however when saving the changes it will report
An error occurred
admission webhook "regular-user-validation.managed.openshift.io" denied the request: Prevented from accessing Red Hat managed resources. This is in an effort to prevent harmful actions that may cause unintended consequences or affect the stability of the cluster. If you have any questions about this, please reach out to Red Hat support at https://access.redhat.com/support

Actual results:
2. cluster admin user is able to edit but not allowed to save the changes

Expected results:

ISSUE 2:

Steps to Reproduce:
1.On OSD console, cluster admin user adds idp from "Administration"~~>"Cluster Settings"~~>"Configuration"->"OAuth",
2.
3.

Actual results:
1.Could add idp successfully.

Expected results:
1. Should disable the function to add idp from OSD console.

Created from:

https://github.com/openshift/console/pull/11922

Bug OCPBUGS-4095: Various Jenkins CVEs for October 2022 [openshift-4.10.z]

View the Description View the linked PRs

CVE-2022-45047
CVE-2022-45379
CVE-2022-45381
CVE-2022-45380
CVE-2022-43403
CVE-2022-43409
CVE-2022-43408
CVE-2022-43407
CVE-2022-43404
CVE-2022-43401
CVE-2022-43402
CVE-2022-43405
CVE-2022-43406
CVE-2022-42889
CVE-2022-25857
CVE-2022-36885
CVE-2022-36884
CVE-2022-36883
CVE-2022-34174
CVE-2022-30954
CVE-2022-30953
CVE-2022-30946
CVE-2022-36882
CVE-2022-30948
CVE-2022-30945
CVE-2021-26291
CVE-2022-29047

Bug OCPBUGS-724: Update blueocean-autofavorite to 1.2.5

View the Description View the linked PRs

Update blueocean-autofavorite to version 1.2.5, which does not bundle the groovy-sandbox library

https://github.com/openshift/jenkins/pull/1478

Bug OCPBUGS-6039: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/229

Bug OCPBUGS-647: Prefer local dns does not work expectedly on OCPv4.10

View the Description View the linked PRs

Description of problem:

When queried dns hostname from certain pod on the certain node, responded from random coredns pod, not prefer local one. Is it expected result ?

# In OCP v4.8.13 case
// Ran dig command on the certain node which is running the following test-7cc4488d48-tqc4m pod.
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
:
07:16:33 :172.217.175.238
07:16:34 :172.217.175.238 <--- Refreshed the upstream result
07:16:36 :142.250.207.46
07:16:37 :142.250.207.46

// The dig results is matched with the running node one as you can see the above one.
$ oc rsh  test-7cc4488d48-tqc4m bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
:
07:16:35 :172.217.175.238 
07:16:36 :172.217.175.238 <--- At the same time, the pod dig result is also refreshed.
07:16:37 :142.250.207.46
07:16:38 :142.250.207.46


But in v4.10 case, in contrast, the dns query result is various and responded randomly regardless local dns results on the node as follows.

# In OCP v4.10.23 case, pod's response from DNS services are not consistent.
$ oc rsh test-848fcf8ddb-zrcbx  bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
07:23:00 :142.250.199.110
07:23:01 :142.250.207.46
07:23:02 :142.250.207.46
07:23:03 :142.250.199.110
07:23:04 :142.250.199.110
07:23:05 :172.217.161.78

# Even though the node which is running the pod keep responding the same IP...
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
07:23:00 :172.217.161.78
07:23:01 :172.217.161.78
07:23:02 :172.217.161.78
07:23:03 :172.217.161.78
07:23:04 :172.217.161.78
07:23:05 :172.217.161.78

Version-Release number of selected component (if applicable):

v4.10.23 (ROSA)
SDN: OpenShiftSDN

How reproducible:

You can always reproduce this issue using "dig google.com" from both any pod and the node the pod running according to the above "Description" details.

Steps to Reproduce:

1. Run any usual pod, and check which node the pod is running on.
2. Run dig google.com on the pod and the node.
3. Check the IP is consistent with the running node each other.

Actual results:

The response IPs are not consistent and random IP is responded.

Expected results:

The response IP is kind of consistent, and aware of prefer local dns.

Additional info:

This issue affects EgressNetworkPolicy dnsName feature.

Bug OCPBUGS-713: Add user coreydaley to list of approvers for openshift/jenkins

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-709~~. The following is the description of the original issue:
—
Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/jenkins/pull/1477

Bug OCPBUGS-11579: startupProbe for UWM prometheus is still 15m

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-11465~~. The following is the description of the original issue:

—

This is a clone of issue ~~OCPBUGS-11404~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-11333~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-10690~~. The following is the description of the original issue:
—

Description of problem:

according to PR: https://github.com/openshift/cluster-monitoring-operator/pull/1824, startupProbe for UWM prometheus/platform prometheus should be 1 hour, but startupProbe for UWM prometheus is still 15m after enabled UWM, platform promethues does not have issue, startupProbe is increased to 1 hour

$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 60
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3
...

$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 240
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-19-052243

How reproducible:

always

Steps to Reproduce:

1. enable UWM, check startupProbe for UWM prometheus/platform prometheus
2.
3.

Actual results:

startupProbe for UWM prometheus is still 15m

Expected results:

startupProbe for UWM prometheus should be 1 hour

Additional info:

since startupProbe for platform prometheus is increased to 1 hour, and no similar bug for UWM prometheus, won't fix the issue is OK.

https://github.com/openshift/cluster-monitoring-operator/pull/1942

Bug OCPBUGS-4882: [2117255] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket"

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-3116~~. The following is the description of the original issue:
—
Description of problem:
-----------------------
On dualstack baremetal IPI cluster next error message is present in ovnkube logs:

oc logs -n openshift-ovn-kubernetes ovnkube-node-rvggh -c ovnkube-node
...

E0810 02:12:46.343460 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:13:16.347603 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:13:46.351108 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:14:16.355047 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:14:46.358950 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:15:13.313945 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Service total 9 items received
E0810 02:15:16.362737 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:15:46.366490 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:16:16.369963 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:16:24.306561 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Endpoints total 560 items received
E0810 02:16:46.373482 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:17:16.377497 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:17:46.380726 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:18:15.325871 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 50 items received
E0810 02:18:16.384732 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:18:38.299738 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 9 items received
E0810 02:18:46.388162 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:19:16.391669 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
~~OCP-4~~.10.26

ovn-2021-21.12.0-58.el8fdp.x86_64
ovn-2021-host-21.12.0-58.el8fdp.x86_64
ovn-2021-central-21.12.0-58.el8fdp.x86_64
ovn-2021-vtep-21.12.0-58.el8fdp.x86_64

How reproducible:
-----------------
so far spotted on 2 different clusters

Steps to Reproduce:
-------------------
1. Deploy dualstack baremetal IPI cluster with OVNKubernetesHybrid network(add next to cluster's config before running cluster install):

defaultNetwork:
type: OVNKubernetes
ovnKubernetesConfig:
hybridOverlayConfig:
hybridClusterNetwork: []

Actual results:
---------------
Error message in logs

Expected results:
-----------------
No error message in logs

Additional info:
----------------
Baremetal dualstack setup with 3 masters and 4 workers, bonding configured for baremetal network on masters and workers

https://github.com/openshift/ovn-kubernetes/pull/1450

Bug OCPBUGS-5425: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-etcd-operator/pull/979

Bug OCPBUGS-466: PDB warning alert when CR replica count is set to zero (edit)

View the Description View the linked PRs

copy of BZ https://bugzilla.redhat.com/show_bug.cgi?id=2053622

Description of problem:

PodDisruptionBudgetAtLimit Warning alert when CR replica count is zero.

Version-Release number of selected component (if applicable):
4.7

How reproducible: Everytime

Steps to Reproduce:
1. oc new-project test

2. oc new-app httpd

3. oc create -f pdb

$ cat pdb.yaml

~~~
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: my-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
deployment: httpd
~~~

$ oc get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
my-pdb N/A 0 0 3h27m

4. oc scale deployment httpd --replicas=0

5. Wait for some time alert will be triggered at the console.

Actual results: unexpected warning alert

Expected results: As we are intentionally dropping down the replicas it should not generate an alert.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/648

Bug OCPBUGS-5663: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/6766

Bug OCPBUGS-6814: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1499

Bug OCPBUGS-1465: E2E: intermittent failure is seen on tests for devfile

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1410~~. The following is the description of the original issue:
—
Clone of https://bugzilla.redhat.com/show_bug.cgi?id=2106803 to backport the e2e fix to 4.11 and 4.10.

Description of problem: E2E: intermittent failure is seen on tests for devfile due to network call to devfile registry

Deploy git workload with devfile from topology page: A-04-TC01

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/11726/pull-ci-openshift-console-master-e2e-gcp-console/1547046629775773696/artifacts/e2e-gcp-console/test/artifacts/gui_test_screenshots/cypress/screenshots/e2e/add-flow-ci.feature/Deploy%20git%20workload%20with%20devfile%20from%20topology%20page%20A-04-TC01%20--%20after%20each%20hook%20(failed).png

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/11768/pull-ci-openshift-console-master-e2e-gcp-console/1547061730478133248

Version-Release number of selected component (if applicable):

How reproducible: Intermittent

Steps to Reproduce:
1. Run test for add-flow-ci.feature to test Deploy git workload with devfile from topology page: A-04-TC01

Actual results:

Expected results: Show always pass

Additional info:

https://github.com/openshift/console/pull/12062

Bug OCPBUGS-3656: [release-4.10] DUALSTACK Intermittent service reach-ability failures on OVN-K amid allow from same namespace networkpolicy

View the Description View the linked PRs

Description of problem:

clone of https://bugzilla.redhat.com/show_bug.cgi?id=2076307

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/1378

Bug OCPBUGS-5864: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/1866

Bug OCPBUGS-64: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/1617

Bug OCPBUGS-6627: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/jenkins/pull/1624

Bug OCPBUGS-6754: Topology gets stuck loading

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-3235~~. The following is the description of the original issue:
—

Description of problem:

Frequently we see the loading state of the topology view, even when there aren't many resources in the project.

Including an example

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

load topology
if it loads successfully, keep trying until it fails to load

Actual results:

topology will sometimes hang with the loading indicator showing indefinitely

Expected results:

topology should load consistently without fail

Reproducibility (Always/Intermittent/Only Once):

intermittent

Build Details:

4.9

Additional info:

https://github.com/openshift/console/pull/12486

Bug OCPBUGS-2546: Remove policy/v1beta1 in 4.10 and later

View the Description View the linked PRs

~~OCPBUGS-1251~~ landed an admin-ack gate in 4.11.z to help admins prepare for Kubernetes 1.25 API removals which are coming in OpenShift 4.12. Poking around in a 4.12.0-ec.2 cluster where APIRemovedInNextReleaseInUse is firing:

$ oc --as system:admin adm must-gather -- /usr/bin/gather_audit_logs
$ zgrep -h v1beta1/poddisruptionbudget must-gather.local.1378724704026451055/quay*/audit_logs/kube-apiserver/*.log.gz | jq -r '.verb + " " + (.user | .username + " " + (.extra["authentication.kubernetes.io/pod-name"] | tostr
ing))' | sort | uniq -c
parse error: Invalid numeric literal at line 29, column 6
     28 watch system:serviceaccount:openshift-machine-api:cluster-autoscaler ["cluster-autoscaler-default-5cf997b8d6-ptgg7"]

Finding the source for that container:

$ oc --as system:admin -n openshift-machine-api get -o json pod cluster-autoscaler-default-5cf997b8d6-ptgg7 | jq -r '.status.containerStatuses[].image'
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed
$ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed | grep github
             SOURCE_GIT_URL=https://github.com/openshift/kubernetes-autoscaler
             io.openshift.build.commit.url=https://github.com/openshift/kubernetes-autoscaler/commit/1dac0311b9842958ec630273428b74703d51c1c9
             io.openshift.build.source-location=https://github.com/openshift/kubernetes-autoscaler

Poking about in the source:

$ git clone --depth 30 --branch master https://github.com/openshift/kubernetes-autoscaler.git
$ cd kubernetes-autoscaler
$ find . -name vendor
./addon-resizer/vendor
./cluster-autoscaler/vendor
./vertical-pod-autoscaler/e2e/vendor
./vertical-pod-autoscaler/vendor

Lots of vendoring. I haven't checked to see how new the client code is in the various vendor packages. But the main issue seems to be the v1beta1 in:

$ git grep policy cluster-autoscaler/core cluster-autoscaler/utils | grep policy.*v1beta1
cluster-autoscaler/core/scaledown/actuation/actuator_test.go:   policyv1beta1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/actuation/actuator_test.go:                                   eviction := createAction.GetObject().(*policyv1beta1.Eviction)
cluster-autoscaler/core/scaledown/actuation/drain.go:   policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/actuation/drain_test.go:      policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/legacy/legacy.go:     policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/legacy/wrapper.go:    policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/scaledown.go: policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/static_autoscaler_test.go:      policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/drain/drain.go:        policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/drain/drain_test.go:   policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/kubernetes/listers.go: policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/kubernetes/listers.go: v1policylister "k8s.io/client-go/listers/policy/v1beta1"

The main change from v1beta1 to v1 involves spec.selector; I dunno if that's relevant to the autoscaler use-case or not.

Do we run autoscaler CI? I was poking around a bit, but did not find a 4.12 periodic excercising the autoscaler that might have turned up this alert and issue.

https://github.com/openshift/kubernetes-autoscaler/pull/245

Bug OCPBUGS-1659: Whereabouts should allow non default interfaces to Pod IP list [backport 4.10]

View the Description View the linked PRs

Description of problem:

Whereabouts doesn't allow the use of network interface names that are not preceded by the prefix "net", see https://github.com/k8snetworkplumbingwg/whereabouts/issues/130.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Define two Pods, one with the interface name 'port1' and the other with 'net-port1':

test-ip-removal-port1:	
              k8s.v1.cni.cncf.io/networks:
                [
                  {
                    "name": "test-sriovnd",
                    "interface": "port1",
                    "namespace": "default"
                  }
                ]

test-ip-removal-net-port1:
              k8s.v1.cni.cncf.io/networks:
                [
                  {
                    "name": "test-sriovnd",
                    "interface": "net-port1",
                    "namespace": "default"
                  }
                ]

2. IP allocated in the IPPool:

kind: IPPool
...
spec:
  allocations:
    "16":
      id: ...
      podref: test-ecoloma-1/test-ip-removal-port1
    "17":
      id: ...
      podref: test-ecoloma-1/test-ip-removal-net-port1

3. When the ip-reconciler job is run, the allocation for the port with the interface name 'port1' is removed:

[13:29][]$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        14m             11d

[13:29][]$ oc get ippools.whereabouts.cni.cncf.io -n openshift-multus   2001-1b70-820d-2610---64 -o yaml
apiVersion: whereabouts.cni.cncf.io/v1alpha1
kind: IPPool
metadata:
...
spec:
  allocations:
    "17":
      id: ...
      podref: test-ecoloma-1/test-ip-removal-net-port1
  range: 2001:1b70:820d:2610::/64

[13:30][]$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        9s              11d

Actual results:

The network interface with a name that doesn't have a 'net' prefix is removed from the ip-reconciler cronjob.

Expected results:

The network interface must not be removed, regardless of the name.

Additional info:

Upstream PR @ https://github.com/k8snetworkplumbingwg/whereabouts/pull/147 master PR @ https://github.com/openshift/whereabouts-cni/pull/94

https://github.com/openshift/whereabouts-cni/pull/98

Bug OCPBUGS-235: ClusterVersion availableUpdates is stale: PromQL conditional risks vs. slow/stuck Thanos

View the Description View the linked PRs

Description of problem:
We are seeing customer's upgrade cannot kickoff due to the availableUpdates is null in clusterversion CR
Version-Release number of the following components:

How reproducible:
sometime
Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:

https://github.com/openshift/cluster-version-operator/pull/825

Bug OCPBUGS-2583: User settings resources (ConfigMap, Role, RB) should be deleted when a user is deleted

View the Description View the linked PRs

This bug is a backport clone of [Bugzilla Bug 2019564](https://bugzilla.redhat.com/show_bug.cgi?id=2019564). The following is the description of the original bug:
—
Description of problem:
On OpenShift Dev Sandbox users are automatically deleted after a month. Also on other clusters we should cleanup resources when the user are deleted.

Version-Release number of selected component (if applicable):
4.8+

How reproducible:
Always

Steps to Reproduce:
1. Create a user
2. Use console (this should automatically create 3 resources in the namespace "openshift-console-user-settings" (a Role, a RoleBinding and a ConfigMap)
3. Delete the user

Actual results:
The three created resources in namespace "openshift-console-user-settings" stays forever

Expected results:
The three created resources in namespace "openshift-console-user-settings" should be automatically removed

Additional info:
None

https://github.com/openshift/console/pull/12189

Bug OCPBUGS-6079: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/jenkins/pull/1639

Bug OCPBUGS-6690: OLM details page crashes on incomplete ClusterServiceVersion resource

View the Description View the linked PRs

Description of problem:
When creating a incomplete ClusterServiceVersion resource the OLM details page crashes (on 4.11).

apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
  name: minimal-csv
  namespace: christoph
spec:
  apiservicedefinitions:
    owned:
      - group: A
        kind: A
        name: A
        version: v1
  customresourcedefinitions:
    owned:
      - kind: B
        name: B
        version: v1
  displayName: My minimal CSV
  install:
    strategy: ''

Version-Release number of selected component (if applicable):
Crashes on 4.8-4.11, work fine from 4.12 onwards.

How reproducible:
Alway

Steps to Reproduce:
1. Apply the ClusterServiceVersion YAML from above
2. Open the Admin perspective > Installed Operator > Operator detail page

Actual results:
Details page crashes on tab A and B.

Expected results:
Page should not crash

Additional info:
Thi is a follow up on https://bugzilla.redhat.com/show_bug.cgi?id=2084287

https://github.com/openshift/console/pull/12481

Story MON-1679: use static authorizer feature of kube-rbac-proxy

View the Description View the linked PRs

The static authorizer feature has landed in upstream kube-rbac-proxy. Lets use it by configuring a static authorizer for all requests that hit a /metrics endpoint.

DoD:

Downstream kube-rbac-proxy is synced.
All CMO operands are configured with static authorization.
Bugzillas created for all non-monitoring components using kube-rbac-proxy for metrics authn/authz.

https://github.com/openshift/cluster-monitoring-operator/pull/1318

Bug OCPBUGS-1517: [OCP 4.10] Fix generate script in CBO

View the Description View the linked PRs

plus smaller fixes
this is a backport of https://github.com/openshift/cluster-baremetal-operator/pull/263

https://github.com/openshift/cluster-baremetal-operator/pull/291

Bug OCPBUGS-1987: Insight Operator keeps rebooting on Openshift 4.10.18 in disconnected environment

View the Description View the linked PRs

Description of problem:

Insight Operator keeps rebooting on Openshift 4.10.18 in disconnected environment

Version-Release number of selected component (if applicable):

Openshift Container Platform 4.10.18

How reproducible:

Install Openshift 4.10.18 in disconnected environment

Steps to Reproduce:

1. Download the global cluster pull secret to your local file system.
 2. In a text editor, edit the .dockerconfigjson file that was downloaded. 3. Remove the cloud.openshift.com JSON entry
4. Observe that insight operator keeps rebooting

Actual results:

Insight Operator keeps rebooting

Expected results:

Insight Operator should be stable

Additional info:

Disconnected Environment

https://github.com/openshift/insights-operator/pull/718

Bug OCPBUGS-249: Incorrect NAT when using cluster networking in control-plane nodes to install a VRRP Cluster

View the Description View the linked PRs

+++ This bug was initially created as a clone of
Bug #2070318
+++

Description of problem:
In OCP VRRP deployment (using OCP cluster networking), we have an additional data interface which is configured along with the regular management interface in each control node. In some deployments, the kubernetes address 172.30.0.1:443 is nat’ed to the data management interface instead of the mgmt interface (10.40.1.4:6443 vs 10.30.1.4:6443 as we configure the boostrap node) even though the default route is set to 10.30.1.0 network. Because of that, all requests to 172.30.0.1:443 were failed. After 10-15 minutes, OCP magically fixes it and nat’ing correctly to 10.30.1.4:6443.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.Provision OCP cluster using cluster networking for DNS & Load Balancer instead of external DNS & Load Balancer. Provision the host with 1 management interface and an additional interface for data network. Along with OCP manifest, add manifest to create a pod which will trigger communication with kube-apiserver.

2.Start cluster installation.

3.Check on the custom pod log in the cluster when the first 2 master nodes were installing to see GET operation to kube-apiserver timed out. Check nft table and chase the ip chains to see the that the data IP address was nat'ed to kubernetes service IP address instead of the management IP. This is not happening all the time, we have seen 50:50 chance.

Actual results:
After 10-15 minutes OCP will correct that by itself.

Expected results:
Wrong natting should not happen.

Additional info:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895
ClusterVersion: Stable at "4.8.29"
ClusterOperators:
clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/baremetal is degraded because metal3 deployment inaccessible
clusteroperator/console is not available (RouteHealthAvailable: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because RouteHealthDegraded: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
clusteroperator/insights is degraded because Unable to report: unable to build request to connect to Insights server: Post "
https://cloud.redhat.com/api/ingress/v1/upload
": dial tcp: lookup cloud.redhat.com on 172.30.0.10:53: read udp 10.128.0.26:53697->172.30.0.10:53: i/o timeout
clusteroperator/network is progressing: DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)

— Additional comment from
bnemec@redhat.com
on 2022-03-30 20:00:25 UTC —

This is not managed by runtimecfg, but in order to route the bug correctly I need to know which CNI plugin you're using - OpenShiftSDN or OVNKubernetes. Thanks.

— Additional comment from
lpbinh@gmail.com
on 2022-03-31 08:09:11 UTC —

Hi Ben,

We were deploying Contrail CNI with OCP. However, this issue happens at very early deployment time, right after the bootstrap node is started
and there's no SDN/CNI there yet.

— Additional comment from
bnemec@redhat.com
on 2022-03-31 15:26:23 UTC —

Okay, I'm just going to send this to the SDN team then. They'll be able to provide more useful input than I can.

— Additional comment from
trozet@redhat.com
on 2022-04-04 15:22:21 UTC —

Can you please provide the iptables rules causing the DNAT as well as the routes on the host? Might be easiest to get a sosreport during initial bring up during that 10-15 min when the problem occurs.

— Additional comment from
lpbinh@gmail.com
on 2022-04-05 16:45:13 UTC —

All nodes have two interfaces:

eth0: 10.30.1.0/24
eth1: 10.40.1.0/24

machineNetwork is 10.30.1.0/24
default route points to 10.30.1.1

The kubeapi service ip is 172.30.0.1:443

all Kubernetes services are supposed to be reachable via machineNetwork (10.30.1.0/24)

To make the kubeapi service ip reachable in hostnetwork, something (openshift installer?) creates a set of nat rules which translates the service ip to the real ip of the nodes which have kubeapi active.

Initially kubeapi is only active on the bootstrap node so there should be a nat rule like

172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)

However, what we see is
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)

The rule is configured on the controller nodes and lead to asymmetrical routing as the controller sends a packet FROM machineNetwork (10.30.1.x) to 172.30.0.1 which is then translated and forwarded to 10.40.1.10 which then tries to reply back on the 10.40.1.0 network which fails as the request came from 10.30.1.0 network.

So, we want to understand why openshift installer picks the 10.40.1.x ip address rather than the 10.30.1.x ip for the nat rule. What's the mechanism for getting the ip in case the system has multiple interfaces with ips configured.

Note: after a while (10-20 minutes) the bootstrap process resets itself and then it picks the correct ip address from the machineNetwork and things start to work.

— Additional comment from
smerrow@redhat.com
on 2022-04-13 13:55:04 UTC —

Note from Juniper regarding requested SOS report:

In reference to
https://bugzilla.redhat.com/show_bug.cgi?id=2070318
that @Binh Le has been working on. The mustgather was too big to upload for this Bugzilla. Can you access this link?
https://junipernetworks-my.sharepoint.com/:u:/g/personal/sleigon_juniper_net/ETOrHMqao1tLm10Gmq9rzikB09H5OUwQWZRAuiOvx1nZpQ

Making note private to hide partner link

— Additional comment from
smerrow@redhat.com
on 2022-04-21 12:24:33 UTC —

Can we please get an update on this BZ?

Do let us know if there is any other information needed.

— Additional comment from
trozet@redhat.com
on 2022-04-21 14:06:00 UTC —

Can you please provide another link to the sosreport? Looks like the link is dead.

— Additional comment from
smerrow@redhat.com
on 2022-04-21 19:01:39 UTC —

See mustgather here:
https://drive.google.com/file/d/16y9IfLAs7rtO-SMphbYBPgSbR4od5hcQ
— Additional comment from
trozet@redhat.com
on 2022-04-21 20:57:24 UTC —

Looking at the must-gather I think your iptables rules are most likely coming from the fact that kube-proxy is installed:

[trozet@fedora must-gather.local.288458111102725709]$ omg get pods -n openshift-kube-proxy
NAME READY STATUS RESTARTS AGE
openshift-kube-proxy-kmm2p 2/2 Running 0 19h
openshift-kube-proxy-m2dz7 2/2 Running 0 16h
openshift-kube-proxy-s9p9g 2/2 Running 1 19h
openshift-kube-proxy-skrcv 2/2 Running 0 19h
openshift-kube-proxy-z4kjj 2/2 Running 0 19h

I'm not sure why this is installed. Is it intentional? I don't see the configuration in CNO to enable kube-proxy. Anyway the node IP detection is done via:
https://github.com/kubernetes/kubernetes/blob/f173d01c011c3574dea73a6fa3e20b0ab94531bb/cmd/kube-proxy/app/server.go#L844
Which just looks at the IP of the node. During bare metal install a VIP is chosen and used with keepalived for kubelet to have kapi access. I don't think there is any NAT rule for services until CNO comes up. So I suspect what really is happening is your node IP is changing during install, and kube-proxy is getting deployed (either intentionally or unintentionally) and that is causing the behavior you see. The node IP is chosen via the node ip configuration service:
https://github.com/openshift/machine-config-operator/blob/da6494c26c643826f44fbc005f26e0dfd10513ae/templates/common/_base/units/nodeip-configuration.service.yaml
This service will determine the node ip via which interfaces have a default route and which one has the lowest metric. With your 2 interfaces, do they both have default routes? If so, are they using dhcp and perhaps its random which route gets installed with a lower metric?

— Additional comment from
trozet@redhat.com
on 2022-04-21 21:13:15 UTC —

Correction: looks like standalone kube-proxy is installed by default when the provider is not SDN, OVN, or kuryr so this looks like the correct default behavior for kube-proxy to be deployed.

— Additional comment from
lpbinh@gmail.com
on 2022-04-25 04:05:14 UTC —

Hi Tim,

You are right, kube-proxy is deployed by default and we don't change that behavior.

There is only 1 default route configured for the management interface (10.30.1.x) , we used to have a default route for the data/vrrp interface (10.40.1.x) with higher metric before. As said, we don't have the default route for the second interface any more but still encounter the issue pretty often.

— Additional comment from
trozet@redhat.com
on 2022-04-25 14:24:05 UTC —

Binh, can you please provide a sosreport for one of the nodes that shows this behavior? Then we can try to figure out what is going on with the interfaces and the node ip service. Thanks.

— Additional comment from
trozet@redhat.com
on 2022-04-25 16:12:04 UTC —

Actually Ben reminded me that the invalid endpoint is actually the boostrap node itself:
172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)

vs
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)

So maybe a sosreport off that node is necessary? I'm not as familiar with the bare metal install process, moving back to Ben.

— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:33:45 UTC —

Created attachment 1875023 [details]sosreport

— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:34:59 UTC —

Created attachment 1875024 [details]sosreport-part2

Hi Tim,

We observe this issue when deploying clusters using OpenStack instances as our infrastructure is based on OpenStack.

I followed the steps here to collect the sosreport:
https://docs.openshift.com/container-platform/4.8/support/gathering-cluster-data.html
Got the sosreport which is 22MB which exceeds the size permitted (19MB), so I split it to 2 files (xaa and xab), if you can't join them then we will need to put the collected sosreport on a share drive like we did with the must-gather data.

Here are some notes about the cluster:

First two control nodes are below, ocp-binhle-8dvald-ctrl-3 is the bootstrap node.

[core@ocp-binhle-8dvald-ctrl-2 ~]$ oc get node
NAME STATUS ROLES AGE VERSION
ocp-binhle-8dvald-ctrl-1 Ready master 14m v1.21.8+ed4d8fd
ocp-binhle-8dvald-ctrl-2 Ready master 22m v1.21.8+ed4d8fd

We see the behavior that wrong nat'ing was done at the beginning, then corrected later:

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 3 bytes 180 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 3 bytes 180 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }

}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 3 bytes 180 dnat to 10.40.1.7:6443 }

}
sh-4.4#
sh-4.4#
<....after a while....>
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 0 bytes 0 jump KUBE-SEP-X33IBTDFOZRR6ONM }
}
sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 0 bytes 0 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y { counter packets 0 bytes 0 jump KUBE-SEP-X33IBTDFOZRR6ONM }

}
sh-4.4# nft list chain ip nat KUBE-SEP-X33IBTDFOZRR6ONM
table ip nat {
chain KUBE-SEP-X33IBTDFOZRR6ONM

{ ip saddr 10.30.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 0 bytes 0 dnat to 10.30.1.7:6443 }

}
sh-4.4#

— Additional comment from
lpbinh@gmail.com
on 2022-05-12 17:46:51 UTC —

@
trozet@redhat.com
May we have an update on the fix, or the plan for the fix? Thank you.

— Additional comment from
lpbinh@gmail.com
on 2022-05-18 21:27:45 UTC —

Created support Case 03223143.

— Additional comment from
vkochuku@redhat.com
on 2022-05-31 16:09:47 UTC —

Hello Team,

Any update on this?

Thanks,
Vinu K

— Additional comment from
smerrow@redhat.com
on 2022-05-31 17:28:54 UTC —

This issue is causing delays in Juniper's CI/CD pipeline and makes for a less than ideal user experience for deployments.

I'm getting a lot of pressure from the partner on this for an update and progress. I've had them open a case [1] to help progress.

Please let us know if there is any other data needed by Juniper or if there is anything I can do to help move this forward.

[1]
https://access.redhat.com/support/cases/#/case/03223143
— Additional comment from
vpickard@redhat.com
on 2022-06-02 22:14:23 UTC —

@
bnemec@redhat.com
Tim mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=2070318#c14
that this issue appears to be at BM install time. Is this something you can help with, or do we need help from the BM install team?

— Additional comment from
bnemec@redhat.com
on 2022-06-03 18:15:17 UTC —

Sorry, I missed that this came back to me.

(In reply to Binh Le from
comment #16
)> We observe this issue when deploying clusters using OpenStack instances as
> our infrastructure is based on OpenStack.This does not match the configuration in the must-gathers provided so far, which are baremetal. Are we talking about the same environments?

I'm currently discussing this with some other internal teams because I'm unfamiliar with this type of bootstrap setup. I need to understand what the intended behavior is before we decide on a path forward.

— Additional comment from
rurena@redhat.com
on 2022-06-06 14:36:54 UTC —

(In reply to Ben Nemec from
comment #22
)> Sorry, I missed that this came back to me.
>
> (In reply to Binh Le from comment #16)
> > We observe this issue when deploying clusters using OpenStack instances as
> > our infrastructure is based on OpenStack.
>
> This does not match the configuration in the must-gathers provided so far,
> which are baremetal. Are we talking about the same environments?
>
> I'm currently discussing this with some other internal teams because I'm
> unfamiliar with this type of bootstrap setup. I need to understand what the
> intended behavior is before we decide on a path forward.I spoke to the CU they tell me that all work should be on baremetal. They were probably just testing on OSP and pointing out that they saw the same behavior.

— Additional comment from
bnemec@redhat.com
on 2022-06-06 16:19:37 UTC —

Okay, I see now that this is an assisted installer deployment. Can we get the cluster ID assigned by AI so we can take a look at the logs on our side? Thanks.

— Additional comment from
lpbinh@gmail.com
on 2022-06-06 16:38:56 UTC —

Here is the cluster ID, copied from the bug description:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895

In regard to your earlier question about OpenStack & baremetal (2022-06-03 18:15:17 UTC):

We had an issue with platform validation in OpenStack earlier. Host validation was failing with the error message “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”

It's found out that there is no platform type "OpenStack" available in [
https://github.com/openshift/assisted-service/blob/master/models/platform_type.go#L29
] so we set "baremetal" as the platform type on our computes. That's the reason why you are seeing baremetal as the platform type.

Thank you

— Additional comment from
ercohen@redhat.com
on 2022-06-08 08:00:18 UTC —

Hey, first you are currect, When you set 10.30.1.0/24 as the machine network, the bootstrap process should use the IP on that subnet in the bootstrap node.

I'm trying to understand how exactly this cluster was installed.
You are using on-prem deployment of assisted-installer (podman/ACM)?
You are trying to form a cluster from OpenStack Vms?
You set the platform to Baremetal where?
Did you set user-managed-netwroking?

Some more info, when using OpenStack platform you should install the cluster with user-managed-netwroking.
And that's what the failing validation is for.

— Additional comment from
bnemec@redhat.com
on 2022-06-08 14:56:53 UTC —

Moving to the assisted-installer component for further investigation.

— Additional comment from
lpbinh@gmail.com
on 2022-06-09 07:37:54 UTC —

@Eran Cohen:

Please see my response inline.

You are using on-prem deployment of assisted-installer (podman/ACM)?
--> Yes, we are using on-prem deployment of assisted-installer.

You are trying to form a cluster from OpenStack Vms?
--> Yes.

You set the platform to Baremetal where?
--> It was set in the Cluster object, Platform field when we model the cluster.

Did you set user-managed-netwroking?
--> Yes, we set it to false for VRRP.

— Additional comment from
itsoiref@redhat.com
on 2022-06-09 08:17:23 UTC —

@
lpbinh@gmail.com
can you please share assisted logs that you can download when cluster is failed or installed?
Will help us to see the full picture

— Additional comment from
ercohen@redhat.com
on 2022-06-09 08:23:18 UTC —

OK, as noted before when using OpenStack platform you should install the cluster with user-managed-netwroking (set to true).
Can you explain how you workaround this failing validation? “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”
What does this mean exactly? 'we set "baremetal" as the platform type on our computes'

To be honest I'm surprised that the installation was completed successfully.

@
oamizur@redhat.com
I thought installing on OpenStack VMs with baremetal platform (user-managed-networking=false) will always fail?

— Additional comment from
lpbinh@gmail.com
on 2022-06-10 16:04:56 UTC —

@
itsoiref@redhat.com
: I will reproduce and collect the logs. Is that supposed to be included in the provided must-gather?
@
ercohen@redhat.com
:

user-managed-networking set to true when we use external Load Balancer and DNS server. For VRRP we use OpenShift's internal LB and DNS server hence it's set to false, following the doc.
As explained OpenShift returns platform type as 'none' for OpenStack:
https://github.com/openshift/assisted-service/blob/master/models/platform_type.go#L29
, therefore we set the platformtype as 'baremetal' in the cluster object for provisioning the cluster using OpenStack VMs.

— Additional comment from
itsoiref@redhat.com
on 2022-06-13 13:08:17 UTC —

@
lpbinh@gmail.com
you will have download_logs link in UI. Those logs are not part of must-gather

— Additional comment from
lpbinh@gmail.com
on 2022-06-14 18:52:02 UTC —

Created attachment 1889993 [details]cluster log per need info request - Cluster ID caa475b0-df04-4c52-8ad9-abfed1509506

Attached is the cluster log per need info request.
Cluster ID: caa475b0-df04-4c52-8ad9-abfed1509506
In this reproduction, the issue is not resolved by OpenShift itself, wrong NAT still remained and cluster deployment failed eventually

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 2 bytes 120 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 2 bytes 120 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }
}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 17:40:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 17:59:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 9 bytes 540 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 18:17:38 UTC 2022
sh-4.4#
sh-4.4#
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 7 bytes 420 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 18:49:28 UTC 2022
sh-4.4#

— Additional comment from
lpbinh@gmail.com
on 2022-06-14 18:56:06 UTC —

Created attachment 1889994 [details]cluster log per need info request - Cluster ID caa475b0-df04-4c52-8ad9-abfed1509506

Please find the cluster-log attached per your request. In this deployment the wrong NAT was not automatically resolved by OpenShift hence the deployment failed eventually.

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 2 bytes 120 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y { counter packets 2 bytes 120 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }

}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 17:40:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 17:59:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 9 bytes 540 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 18:17:38 UTC 2022
sh-4.4#
sh-4.4#
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 7 bytes 420 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 18:49:28 UTC 2022
sh-4.4#

— Additional comment from
itsoiref@redhat.com
on 2022-06-15 15:59:22 UTC —

@
lpbinh@gmail.com
just for the protocol, we don't support baremetal ocp on openstack that's why validation is failing

— Additional comment from
lpbinh@gmail.com
on 2022-06-15 17:47:39 UTC —

@
itsoiref@redhat.com
as explained it's just a workaround on our side to make OCP work in our lab, and from my understanding on OCP perspective it will see that deployment is on baremetal only, not related to OpenStack (please correct me if I am wrong).

We have been doing thousands of OCP cluster deployments in our automation so far, if it's why validation is failing then it should be failing every time. However it only occurs occasionally when nodes have 2 interfaces, using OCP internal DNS and Load balancer, and sometime resolved by itself and sometime not.

— Additional comment from
itsoiref@redhat.com
on 2022-06-19 17:00:01 UTC —

For now i can assume that this endpoint is causing the issue:
{
"apiVersion": "v1",
"kind": "Endpoints",
"metadata": {
"creationTimestamp": "2022-06-14T17:31:10Z",
"labels":

{ "endpointslice.kubernetes.io/skip-mirror": "true" }

,
"name": "kubernetes",
"namespace": "default",
"resourceVersion": "265",
"uid": "d8f558be-bb68-44ac-b7c2-85ca7a0fdab3"
},
"subsets": [
{
"addresses": [

{ "ip": "10.40.1.7" }

],
"ports": [
{
"name": "https",
"port": 6443,
"protocol": "TCP"
}
]
}
]
},

— Additional comment from
itsoiref@redhat.com
on 2022-06-21 17:03:51 UTC —

The issue is that kube-api service advertise wrong ip but it does it cause kubelet chooses the one arbitrary and we currently have no mechanism to set kubelet ip, especially in bootstrap flow.

— Additional comment from
lpbinh@gmail.com
on 2022-06-22 16:07:29 UTC —

@
itsoiref@redhat.com
how do you perform OCP deployment in setups that have multiple interfaces if letting kubelet chooses an interface arbitrary instead of configuring a specific IP address for it to listen on? With what you describe above chance of deployment failure in system with multiple interfaces would be high.

— Additional comment from
dhellard@redhat.com
on 2022-06-24 16:32:26 UTC —

I set the Customer Escalation flag = Yes, per ACE EN-52253.
The impact is noted by the RH Account team: "Juniper is pressing and this impacts the Unica Next Project at Telefónica Spain. Unica Next is a critical project for Red Hat. We go live the 1st of July and this issue could impact the go live dates. We need clear information about the status and its possible resolution.

— Additional comment from
itsoiref@redhat.com
on 2022-06-26 07:28:44 UTC —

I have sent an image with possible fix to Juniper and waiting for their feedback, once they will confirm it works for them we will proceed with the PRs.

— Additional comment from
pratshar@redhat.com
on 2022-06-30 13:26:26 UTC —

=== In Red Hat Customer Portal Case 03223143 ===
— Comment by Prateeksha Sharma on 6/30/2022 6:56 PM —

//EMT note//

Update from our consultant Manuel Martinez Briceno -

====
on 28th June, 2022 the last feedback from Juniper Project Manager and our Partner Manager was that they are testing the fix. They didn't give an Estimate Time to finish, but we will be tracking this closely and let us know of any news.
====

Thanks & Regards,
Prateeksha Sharma
Escalation Manager | RHCSA
Global Support Services, Red Hat

https://github.com/openshift/installer/pull/6231

Bug OCPBUGS-550: Input values in Instantiate Template are disappeared randomly in the developer console

View the Description View the linked PRs

#Description of problem:

Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template

Input values in Instantiate Template are disappeared randomly.

#Version-Release number of selected component (if applicable):

Customer ENV

OCP4.10.9
Developer Console
Edge 88x. / Edge 85.0.x / Chrome 97.x /Chrome 88.x
Internet Disconnected OCP cluster

quicklab test ENV

Developer Console
OCP4.10.12
Chrome 100

#How reproducible:

I reproduced this issue in ocp410ovn shared cluster in the quicklab

https://console-openshift-console.apps.sharedocp4upi410ovn.lab.upshift.rdu2.redhat.com

Select Apache HTTP Server > Input name "test" in Application Hostname box
After several seconds, the value has disappeared in the web console.

#Steps to Reproduce:

0. Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template

1. Input values in the box of template menu.

2. The values are disappeared after several seconds later. (20s~ or randomly)

3. Many users have experienced this issue.

The web browser version of users experiencing this issue.

Customer: Edge 88x. / Edge 85.0.x / Chrome 97.x /Chrome 88.x
My browser: Chrome 102.x

==> the browser version doesn't matter.

#Actual results:

Input values in "Instantiate Template" are disappeared randomly.
Users can't use the Initiate Template feature in the Dev console.

#Expected results:
Input values remain in the web console and users creat the object by the "Instantiate Template"

#Additional info:

See "Application Name" has disappeared in the video I attached.

https://github.com/openshift/console/pull/11983

Task MON-1890: update openshift/kube-state-metric to 2.2.0

View the Description View the linked PRs

New release https://github.com/kubernetes/kube-state-metrics/releases

https://github.com/openshift/kube-state-metrics/pull/61

Bug OCPBUGS-3161: "CatalogSource not found" in the virtualization overview

View the Description View the linked PRs

Description of problem:

In the WebUI for the Virtualization Overview, the details of "Service name", "Provider", and "Update Channel" have no value displayed. The "OpenShift Virtulization version" is showing "Cannot update CatalogSource not found"

Version-Release number of selected component (if applicable):

v4.10.4

How reproducible:

In 3 environments that have recently been deployed all show the same thing. 100%

Steps to Reproduce:

1. Install the OpenShift Virtualization operator from the WedUI
2. Use the suggested options
3.

Actual results:

The Details card is showing a warning that it cannot update

Expected results:

The Details card should have all values provided.

Additional info:

https://github.com/openshift/console/pull/12202

Bug OCPBUGS-4810: MCO does not sync kubeAPIServerServingCAData to controllerconfig if there are not ready nodes

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-3882~~. The following is the description of the original issue:
—
This bug is a backport clone of [Bugzilla Bug 2034883](https://bugzilla.redhat.com/show_bug.cgi?id=2034883). The following is the description of the original bug:
—
Description of problem:

Situation (starting point):

There is an ongoing change to the machine-config-daemon daemonset being applied by the machine-config-operator pod. It is waiting for the daemonset to roll out.
There are some nodes not ready, so daemonset rollout never ends and waiting on that ends in timeout error.

Problem:

Machine-config-operator pods stops trying to reconcile stuff whenever it finds timeout error in waiting for the machine-config-daemon rollout
This implies that the `spec.kubeAPIServerServingCAData` field of controllerconfig/machine-config-controller object is not updated when the kube-apiserver-operator updates kube-apiserver-to-kubelet-client-ca configmap.
Without that field updated, a kube-apiserver-to-kubelet-client-ca change is never rolled out to the nodes.
That ultimately leads to cluster-wide unavailability of "oc logs", "oc rsh" etc. commands when the kube-apiserver-operator starts using a client cert signed by the new kube-apiserver-to-kubelet-client-ca to access kubelet ports.

Version-Release number of MCO (Machine Config Operator) (if applicable):

4.7.21

Platform (AWS, VSphere, Metal, etc.): (not relevant)

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y

How reproducible:

Always if the said conditions are met.

Steps to Reproduce:
1. Have some nodes not ready
2. Force a change that requires machine-config-daemon daemonset rollout (I think that changing proxy settings would work for this)
3. Wait until a new kube-apiserver-to-kubelet-client-ca is rolled out by kube-apiserver-operator

Actual results:

New kube-apiserver-to-kubelet-client-ca not forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca not deployed on nodes

Expected results:

kube-apiserver-to-kubelet-client-ca forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca deployed to nodes.

Additional info:

In comments

https://github.com/openshift/machine-config-operator/pull/3454

Bug OCPBUGS-8091: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/512

Bug OCPBUGS-1676: Improve prometheus-adapter consistency

View the Description View the linked PRs

Description of problem:

This bugs purpose is to enable a feature backport of https://issues.redhat.com/browse/MON-1949.

https://github.com/openshift/cluster-monitoring-operator/pull/1777

Bug OCPBUGS-5112: Fails to deprovision cluster when swift omits 'content-type' and there are empty containers

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-5078~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-5019~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4941~~. The following is the description of the original issue:
—
Description of problem: This is a follow-up to ~~OCPBUGS-3933~~.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses, and a container is empty.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/6726

Bug OCPBUGS-2717: Devfile Catalog and Import a Devfile on a fully disconnected cluster should fail directly instead of timeout after 30sec

View the Description View the linked PRs

Description of problem:
Users on a fully-disconnected cluster could not see Devfiles in the developer catalog or import a Devfiles. That's fine.

But the API calls /api/devfile/samples/ and /api/devfile/ takes 30 seconds until they fail with a 504 Gateway timeout error.

If possible they should fail immediately.

Version-Release number of selected component (if applicable):
This might happen since 4.8

Tested this yet only on 4.12.0-0.nightly-2022-09-07-112008

How reproducible:
Always

Steps to Reproduce:

Start a disconnected cluster with a proxy
Open the browser network inspector and filter for /api/devfile
Switch to Developer perspective
1. Navigate to Add > Developer Catalog (All Services) > Devfiles
2. Or Add > Import from Git > and enter https://github.com/devfile-samples/devfile-sample-go-basic.git

Actual results:

Network call fails after 30 seconds
Import doesn't work

Expected results:

Network calls should fail immediately
We doesn't expect that the import will work

Additional info:
The console Pod log contains this error:

E0909 10:28:18.448680 1 devfile-handler.go:74] Failed to parse devfile: failed to populateAndParseDevfile: Get "https://registry.devfile.io/devfiles/go": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

https://github.com/openshift/console/pull/12191

Bug OCPBUGS-5012: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-openstack/pull/158

Bug OCPBUGS-660: [release-4.10] OVN master trying to deleteLogicalPort for object which is already gone

View the Description View the linked PRs

Description of problem:
This is a clone of https://issues.redhat.com/browse/OCPBUGS-658

Description of problem: Numerous erroreneous logs in OVN master

I0823 18:00:11.163491 1 obj_retry.go:1063] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163546 1 obj_retry.go:1096] Removing old object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163555 1 pods.go:124] Deleting pod: openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163631 1 obj_retry.go:1103] Retry delete failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k, will try again later: deleteLogicalPort failed for pod openshift-operator-lifecycle-manager_collect-profiles-27687900-hlp6k: unable to locate portUUID+nodeName for pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k: error getting logical port <nil>: object not found
W0823 18:00:41.163633 1 obj_retry.go:1031] Dropping retry entry for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k: exceeded number of failed attempts

Must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.2234927131259452300/

Version-Release number of selected component (if applicable): 4.12.0-0.nightly-2022-08-23-031342

How reproducible: Always

Steps to Reproduce:
1. Bring up OVN cluster on 4.12
2.
3.

Actual results: deleteLogicalPort failed for already gone object

Expected results: deleteLogicalPort should not keep retrying post object deletion

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/1257

Bug OCPBUGS-6697: Uninstall fails with Observed a panic: runtime.boundsError

View the Description View the linked PRs

+++ This bug was initially created as a clone of
Bug #2102632
+++

Version:
4.11.0-0.nightly-2022-06-28-160049

$ ./openshift-install version
./openshift-install 4.11.0-0.nightly-2022-06-28-160049
built from commit 6daed68b9863a9b2ecebdf8a4056800aa5c60ad3
release image registry.ci.openshift.org/ocp/release@sha256:b79b1be6aa4f9f62c691c043e0911856cf1c11bb81c8ef94057752c6e5a8478a
release architecture amd64

Platform:
GCP

IPI (automated install with `openshift-install`.

What happened?
During uninstall, the cluster uninstall I received:

E0630 13:17:58.830361 271713 runtime.go:78] Observed a panic: runtime.boundsError{x:22, y:21, signed:true, code:0x1} (runtime error: slice bounds out of range [:22] with length 21)
goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x41d43c0?, 0xc0010637e8})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x86
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x18?})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x41d43c0, 0xc0010637e8})
/usr/lib/golang/src/runtime/panic.go:838 +0x207
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).formatClusterIDForStorage(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:25
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageIDFilter(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:29
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageLabelOrClusterIDFilter(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:39 +0x1fe
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).listDisks(0xc0015dc900?)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:43 +0x1e
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyDisks(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:116 +0x36
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyCluster(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:174 +0x78e
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x18, 0xc000700000})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:220 +0x1b
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x19f06638?, 0xc0000721c0?}, 0xc00047d888?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233 +0x57
k8s.io/apimachinery/pkg/util/wait.poll({0x19f06638, 0xc0000721c0}, 0xc8?, 0x1108485?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:580 +0x38
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext({0x19f06638, 0xc0000721c0}, 0x40d687?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:566 +0x49
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite(0x19f06670?, 0xc00008b8c0?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:555 +0x46
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).Run(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:130 +0x519
main.runDestroyCmd({0x7fffe6a88d87, 0x9}, 0x0)
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:67 +0x92
main.newDestroyClusterCmd.func1(0xc000536780?, {0xc000906100?, 0x2?, 0x2?})
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:53 +0x7f
github.com/spf13/cobra.(*Command).execute(0xc000536780, {0xc0009060c0, 0x2, 0x2})
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0xc00098db80)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:902
main.installerMain()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:60 +0x29e
main.main()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
panic: runtime error: slice bounds out of range [:22] with length 21 [recovered]
panic: runtime error: slice bounds out of range [:22] with length 21

goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x18?})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x41d43c0, 0xc0010637e8})
/usr/lib/golang/src/runtime/panic.go:838 +0x207
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).formatClusterIDForStorage(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:25
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageIDFilter(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:29
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageLabelOrClusterIDFilter(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:39 +0x1fe
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).listDisks(0xc0015dc900?)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:43 +0x1e
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyDisks(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:116 +0x36
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyCluster(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:174 +0x78e
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x18, 0xc000700000})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:220 +0x1b
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x19f06638?, 0xc0000721c0?}, 0xc00047d888?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233 +0x57
k8s.io/apimachinery/pkg/util/wait.poll({0x19f06638, 0xc0000721c0}, 0xc8?, 0x1108485?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:580 +0x38
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext({0x19f06638, 0xc0000721c0}, 0x40d687?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:566 +0x49
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite(0x19f06670?, 0xc00008b8c0?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:555 +0x46
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).Run(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:130 +0x519
main.runDestroyCmd({0x7fffe6a88d87, 0x9}, 0x0)
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:67 +0x92
main.newDestroyClusterCmd.func1(0xc000536780?, {0xc000906100?, 0x2?, 0x2?})
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:53 +0x7f
github.com/spf13/cobra.(*Command).execute(0xc000536780, {0xc0009060c0, 0x2, 0x2})
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0xc00098db80)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:902
main.installerMain()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:60 +0x29e
main.main()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff

Anything else we need to know?

Uninstall with openshift-install binary from OCP 4.10.16 worked fine.

— Additional comment from
jmencak@redhat.com
on 2022-06-30 12:15:49 UTC —

Created attachment 1893636 [details]Install/uninstall directory tar ball.

Adding install/uninstall directory tar ball.

— Additional comment from
padillon@redhat.com
on 2022-07-01 14:11:22 UTC —

Can we get an install config for the failing destroy?

— Additional comment from
padillon@redhat.com
on 2022-07-01 14:13:30 UTC —

Sorry. I see the install config is in the attachment. I thought that was only the destroy log.

— Additional comment from
padillon@redhat.com
on 2022-07-01 14:28:10 UTC —

Marking this as blocker+. It looks like
https://github.com/openshift/installer/pull/5976
introduced a regression when destroying disks. We should have a PR to fix up today.

— Additional comment from
eparis@redhat.com
on 2022-07-01 15:00:11 UTC —

This bug sets blocker+ without setting a Target Release. This is an invalid state as it is impossible to determine what is being blocked. Please be sure to set Priority, Severity, and Target Release before you attempt to set blocker+

— Additional comment from
padillon@redhat.com
on 2022-07-01 17:56:37 UTC —

For QE: This error would occur after installing and provisioning PV.

— Additional comment from
aos-team-art-private@redhat.com
on 2022-07-05 04:27:36 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.

— Additional comment from
jmencak@redhat.com
on 2022-07-07 08:38:59 UTC —

I still see the same issue with the latest nightly 4.11.0-0.nightly-2022-07-06-145812. Is the fix included there?

— Additional comment from
padillon@redhat.com
on 2022-07-07 12:51:25 UTC —

We didn't cherry-pick this fix into 4.11 so it is not in the nightlies. You should be able to check it against a master build. We will cherry-pick to 4.11 now.

$ oc adm release extract --tools registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-06-145812
$ tar -xvf openshift-install-linux-4.11.0-0.nightly-2022-07-06-145812.tar.gz
README.md
openshift-install
$ ./openshift-install version
./openshift-install 4.11.0-0.nightly-2022-07-06-145812
built from commit b2e7be726e400022e71ef3b8bd01a2093e53bc5a
release image registry.ci.openshift.org/ocp/release@sha256:616c5fefa87d116dd2440c75d9832c462078d635ed155c8d6cd486dd09540184
release architecture amd64

$ git show b2e7be726e400022e71ef3b8bd01a2093e53bc5a
commit b2e7be726e400022e71ef3b8bd01a2093e53bc5a (upstream/release-4.11)
Merge: 6daed68b9 2426260d5
Author: openshift-ci[bot] <75433959+openshift-ci[bot]@users.noreply.github.com>
Date: Thu Jun 30 22:11:44 2022 +0000

Merge pull request #6060 from mike-nguyen/dnm_411_test
Bug 2093126
: bump RHCOS 4.11 boot image metadata

https://github.com/openshift/installer/pull/6808

Bug OCPBUGS-8513: [4.10] egress firewall acls are deleted on restart

View the Description View the linked PRs

Description of problem:

During restart egress firewall acls will be deleted and re-created from scratch, meaning that egress firewall rules won't be applied for some time during restart

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/1562

Bug OCPBUGS-1446: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/1778

Bug OCPBUGS-2464: Add unit-test and gofmt support for ovn-kubernetes

View the Description View the linked PRs

Description of problem:

Add the ability to run utests and linter jobs in downstream ovn-kubernetes

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/1299

Bug OCPBUGS-2106: intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy

View the Description View the linked PRs

Description of problem:

intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy

Version-Release number of selected component (if applicable):

OpenShift 4.10.12

How reproducible:

Always

Steps to Reproduce:
1. Define deny all network policy for egress an ingress in a namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

2. Define the following network policy to allow the traffic between the pods in the namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-intra-namespace-001
spec:
  egress:
  - to:
    - podSelector: {}
  ingress:
  - from:
    - podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

3. Test the connectivity between two pods from the namespace.

Actual results:

The connectivity is not allowed

Expected results:

The connectivity should be allowed between pods from the same namespace.

Additional info:

After performing a test and analyzing SDN flows for the namespace:

sh-4.4# ovs-ofctl dump-flows -O OpenFlow13 br0 | grep --color 0x964376 
 cookie=0x0, duration=99375.342s, table=20, n_packets=14, n_bytes=588, priority=100,arp,in_port=21,arp_spa=10.128.2.20,arp_sha=00:00:0a:80:02:14/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=1681.845s, table=20, n_packets=11, n_bytes=462, priority=100,arp,in_port=24,arp_spa=10.128.2.23,arp_sha=00:00:0a:80:02:17/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=99375.342s, table=20, n_packets=135610, n_bytes=759239814, priority=100,ip,in_port=21,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=20, n_packets=2006, n_bytes=12684967, priority=100,ip,in_port=24,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=99375.342s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30
 cookie=0x0, duration=99375.342s, table=70, n_packets=145260, n_bytes=11722173, priority=100,ip,nw_dst=10.128.2.20 actions=load:0x964376->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=1681.845s, table=70, n_packets=2336, n_bytes=191079, priority=100,ip,nw_dst=10.128.2.23 actions=load:0x964376->NXM_NX_REG1[],load:0x18->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=975.129s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=output:NXM_NX_REG2[]

We see that the following rule doesn't match because `reg1` hasn't been defined:

 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30

https://github.com/openshift/sdn/pull/465

Bug OCPBUGS-2902: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/coredns/pull/80

Bug OCPBUGS-4341: oc get dc fails when AllRequestBodies audit-profile is set in apiserver

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-501~~. The following is the description of the original issue:
—
Description of problem:

Version-Release number of selected component (if applicable): 4.10.16

How reproducible: Always

Steps to Reproduce:
1. Edit the apiserver resource and add spec.audit.customRules field

$ oc get apiserver cluster -o yaml
spec:
audit:
customRules:

group: system:authenticated:oauth
profile: AllRequestBodies
group: system:authenticated
profile: AllRequestBodies
profile: Default

2. Allow the kube-apiserver pods to rollout new revision.
3. Once the kube-apiserver pods are in new revision execute $ oc get dc

Actual results:

Error from server (InternalError): an error on the server ("This request caused apiserver to panic. Look in the logs for details.") has prevented the request from succeeding (get deploymentconfigs.apps.openshift.io)

Expected results: The command "oc get dc" should display the deploymentconfig without any error.

Additional info:

https://github.com/openshift/openshift-apiserver/pull/337

Bug OCPBUGS-6498: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/operator-framework-olm/pull/432

Bug OCPBUGS-1638: Increase default value of cache TTL to 6 hours

View the Description View the linked PRs

Description of problem:

Running the discovery cache every 10 minutes has a significant productivity impact on using kubectl on clusters with many CRDs as it is taking time to run these unnecessary requests.
The discovery cache doesn't really have to run every 10 minutes, as CRDs don't change that often. A lot of unnecessary load is created on clients and servers.

Version-Release number of selected component (if applicable):

4.10.z

How reproducible:

Cluster with a lot of CRDs

Steps to Reproduce:

https://github.com/kubernetes/kubernetes/issues/107130

Actual results:

Significant grow of kubectl request completion every 10 minutes

Expected results:

Significant grow of kubectl request completion every 24 hours

Additional info:

https://github.com/kubernetes/kubernetes/issues/107130

https://github.com/openshift/oc/pull/1241

Question	Outcome

4.10.57

Changes from 4.9.59

Complete Features

Problem Alignment

The Problem

High-Level Approach

Goal & Success

Solution Alignment

Key Capabilities

Key Flows

Feature Overview

Requirements

Questions to answer…

Out of Scope

Background, and strategic fit

Documentation Considerations

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

User Story

Acceptance Criteria

Docs Impact

QE Impact

PX Impact

Notes

User Story

Acceptance Criteria

Docs Impact

Notes

Feature Overview

Goals

Requirements

(Optional) Use Cases

Out of Scope

Background, and strategic fit

Assumptions

Customer Considerations

Documentation Considerations

Questions

Problem:

Goal:

Why is it important?

Use cases:

Acceptance criteria:

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

Acceptance Criteria

Additional Details:

Incomplete Features

Feature Overview

Goals

Out of Scope

Requirements

Feature Overview

Goals

Requirements

(Optional) Use Cases

Questions to answer…

Out of Scope

Background, and strategic fit

Assumptions

Customer Considerations

Documentation Considerations

Goals

Requirements

(Optional) Use Cases

Questions to answer…

Out of Scope

Background, and strategic fit

Assumptions

Customer Considerations

Documentation Considerations