NSX – BEWARE ! VTEP Port can NOT be used by ANY virtual machine on an ESXi host running NSX

So I would call this a bug/design issue however VMWare have just noted it in a KB but – BEWARE of use of the VXLAN Tunnel End Point Port (UDP 8472 or 4789 by default) by ANY virtual machine that is hosted on a NSX cluster (regardless of if it is on a VXLAN segment or a standard Port Group) as the traffic will be dropped by the host with a VLAN mismatch. This affects all outbound traffic (i.e. connections from machines inside ESXi with a destination Port that matches the VTEP Port e.g.. UDP 4789)
VMWare today have updated KB2079386 to state “VXLAN port 8472 is reserved or restricted for VMware use, any virtual machine cannot use this port for other purpose or for any other application.” This was the result of a very long running support call involving a VM on a VLAN backed Port Group was having traffic on UDP 8472 being silently dropped without explanation – the KB is not quite accurate; it should read “VTEP Port is reserved or restricted for VMware use, any virtual machine cannot use this port for other purpose or for any other application.” – this is because the hypervisor will drop outbound packets with the destination set to the VTEP Port regardless of if its 8472 or 9871 etc.

Why is this an issue ?

The VXLAN standard (described initially in RFC 7344 has been implemented by a number of vendors for things other than NSX; one such example is physical Sophos Wireless Access points which use the VXLAN standard for “Wireless Zones” and communicates with the Sophos UTM (which can be a virtual machine) on UDP Port 8472. If the UTM is running on a host that has NSX deployed it simply won’t work even if it is running on a Port Group that has nothing to do with NSX.

There are surely other products using this port which begs the question; as a cloud provider or even as an IT group how do you explain to customers that they can’t host that product on ESXi if NSX is deployed when NSX VXLAN even if the traffic is not even on the VLAN with the VXLAN encapsulation operating ?!? The feedback from VMWare support regarding this issue has been that these are reserved ports and should not be used…

What’s going On ?


As a proof of concept I ran up the following lab:

  • Host is ESXi 6.5a running NSX 6.3
  • My VTEP Port is set to 4789 (UDP)
  • NSX is enabled on cluster “NSX”
  • I have a VM “VLANTST-INSIDE” running on host labesx1 (which is part of NSX cluster) running on dvSwitch Port Group “dv-VLAN-101″ which is a VLAN backed (non-VXLAN) Port Group
  • I have a VM “VLANTEST” running outside of ESXi on the same VLAN

With a UDP Server running on the test machine inside ESXi on UDP 4789 the external machine can connect without dramas:

When the roles are reversed the behaviour changes; with a UDP Server running on the machine running External to ESXi on UDP 4789 the initial connection can be initially seen but no traffic observed:
 
When attempting on any other port; no issues:

So if we run pktcap-uw –capture Drop on the ESXi host labesx1.pigeonnuggets.com we can see that the packets are being dropped by the Hypervisor with the Drop Reason ‘VlanTag Mismatch’

It appears that the Network stack is inspecting packets for VTEP UDP Port and filtering them if they do not match the VLAN which is carrying VXLAN regardless of if the payload matches; if the Port Number is the VTEP Port and it’s a VXLAN packet it will be dropped.

What are the options ?

So the only option I have found to resolve this is to change your VTEP Port which is not ideal but there is not really many options at this time. So if a product is conflicting; logon to vCenter and select Networking & Security > Installation > Logical Network Preparation > VXLAN Transport > Change

This is a non-disruptive change and won’t affect your VXLAN payloads. Hopefully this will be fixed at some point….

BUGFIX : Federated Identity in vCloud Director – Cannot remove Entity ID for SAML identification for org when Regenerate of Certificate

UPDATED 20/05/2017: This issue has now been fixed in vCloud Director for Service Providers 8.20.0.1 – upgrading to this version will solve this issue.

Happy Friday; a quick write up on a bug affecting vCloud Director SAML Identity Provider component. The bug manifested after an Identity Provider was configured for one ADFS Server and then changed to another. After the change when attempting to perform the Regenerate Certificate function Cannot remove Entity ID for SAML identification for org was thrown and HTTP 500 ERROR java.servlet.ServletException : Error initializing metadata when accessing Metadata for Federation

 

A bug exists that is known to occur if Federation has been configured previously and then changed to a new identity provider.

Known Affected: vCloud Director for Service Providers (all versions including 8.20)

VMWare Support have advised that this is a known issue and Engineering have a fix which will be implemented in the next release. For now the following will get you back up and running.

The following assumes your vCloud database is running on MSSQL and named vcloud; substitute queries as required to meet your environment.

Workaround:
Step 1. Take a backup of the vCloud Director database
Step 2. Logon to the tenancy and uncheck the Use SAML Identity Provider

Step 3. Execute the following query to get the OrgId for the affected Organization

SELECT [org_id] ,[name],[description]
FROM [vcloud].[dbo].[organization]
  

Step 4. Identify the SAML Policy Id by executing the following query against the Identity_Provider table

SELECT [id], [org_id], [provider_type],[provider_definition_id],[is_enabled]
FROM [vCloud].[dbo].[identity_provider]
WHERE [org_id] = <OrgId>

Step 5. Set the metadata to A blank value for the provider definition id by executing the following:

UPDATE saml_id_provider_settings set metadata = ” where id = <Provider_definition_id from Step 4.>
 

Verify by executing the query

SELECT [id], [metadata]
FROM [vCloud].[dbo].[saml_id_provider_settings]
WHERE id = <Provider_definition_id from Step 4.>

Step 6. Execute the following query and verify that the entity_id is set to a blank value and not set to NULL for the Organisation  

SELECT [org_id], [expiration_date],[is_cert_expiry_notified],[entity_id],[role_attribute]
FROM [vCloud].[dbo].[federation_settings]
WHERE [org_id] = <Org Id>

Step 7. Set the value to NULL by performing an UPDATE

UPDATE federation_settings SET entity_id = NULL where org_id = <OrgId from Step 3>

Step 8. Log back into vCloud Director and click Regenerate on the affected Org

Step 9. Verify the change has been successful by clicking the Metadata link; the metadata should generate correctly and all functions should now be restored without throwing a HTTP 500

Step 10. Setup your SAML Identity Provider; QED

ADFS 4.0 Nuggets/Gotchas

Today I had my first ADFS 4.0 (Windows Server 2016) deployment for a customer and found a few little gotcha’s that you might run into all with some pretty quick fixes;

Issue 1: IdpInitiatedSignonPage is disabled by default
This is usually the first test performed to check if ADFS is working as expected; to fire-up a browser and navigate to  https://domain.tld/adfs/ls/IdpInitiatedSignon.aspx – this will throw “An error occurred”. On your ADFS Server you will see Event 364 in the Event log with the critical piece of information in the Exception “IdPInitiatedSignonPageDisabledException”

Resolution: Logon to the Farm Primary and execute Set-AdfsProperties -EanbleIdpInitiatedSignOnPage $true

Issue 2: When attempting to add a new ADFS Server to the Farm during the Pre-requisite check you receive “The HTTP request was forbidden with client authentication scheme ‘Anonymous’” and “Unable to retrieve configuration from the primary server. The HTTP request was forbidden with client authentication scheme ‘Anonymous’

Resolution: There is some kind of introspection of the traffic; in my case there was a HTTP Proxy configured on the server, remove the proxy and no issues. Don’t forget to check netsh winhttp proxy as well

Issue 3: The next one came about after an upgrade/migrating of ADFS to a new Windows Server 2016 server ADFS throwing 400 and Kerberos errors in the event log (Event ID 4)

Resolution: As the error indicates this is an SPN issue; find the Service Account and update the servicePricnipalName attribute to include the value which is causing the issue (http/XXXXXXXXX)

Otherwise speaking ADFS 4.0 is generally very similar to ADFS in Windows Server 2012 R2 and is a pretty straight forward deployment/upgrade. Happy implementing !

vCloud Director for SP 8.X – Unable to add disks – PBM error occurred during PreMigrationCheckCallback: vmodl.fault.InvalidArgument

UPDATE: 6/5/2018 : A script is now available that can report/resolve this issue which is available at the end of this post.

So a bug exists in the upgrade from vCenter 5.5 to vCenter 6.0 which affects vCloud; for a large of virtual machines when attempting to add disks through vCloud Director the task would fail with A general system error occurred: PBM error occurred during PreMigrateCheckCallback: vmodl.fault.InvalidArgument. This seems to affect a bunch of machines at random and didn’t seem to matter which Storage Tier or which vCenter they were hosted on. 

So after some further investigation it has been discovered that this is due to a configuration mismatch between the Storage Profile on the VM in vCenter and the storage profile in vCenter. It appears that vCloud is attempting to Storage vMotion the machine to the correct tier which fails and shouldn’t occur as its already in the right spot.

To confirm this log into the vCenter and open the vpxd.log (C:\ProgramData\VMware\vCenterServer\logs\vmware-vpx) for Windows and look for the Exception “Invalid home profile setting. Host and datastore are not changed. May use VM reconfig instead”

The Fix

The VMWare vCloud support team has confirmed this as a bug affecting multiple customers caused by the vCenter upgrade but no official guidance exists to resolve. A tested solution that I have verified actually resolves this is to perform the following;

  1.  From vCloud Director check the Default Storage Profile (General tab) of the  on the affected Virtual Machine and if there are any overrides present (Hardware tab)
  2. Logon to the vCenter Server which host the machine and from the Manage menu of the affected VM select Policies and click Edit VM Storage Policies
  3. Select the correct Storage Profile and click Apply to all NOTE: If it was already set to the correct Tier; select something else (eg. Change it to Datastore Default) then click Apply to all followed be reselecting the correct settings; this is to ensure the UI recognizes a change has actually occurred and make the calls to the Virtual Center Service to apply the changes
  4. Go back into vCloud and add the disk; and volla !

Automated Report and Fix

I have developed a script which can be run against vCloud Director to determine which machines are affected and also automate resolving this issue. Enjoy.

 

vCloud Director Edge Gateway – High Availability

Ok so this is just a quick write explaining at a high level the process of enabling the High Availability feature of an Edge Gateways in vCloud Director and some things that you should know if deploying them. vCloud Edges are fundamentally the same as NSX edges however they are controlled by vCloud and are nowhere near (although they are catching up) as rich as full NSX edges. They do however have the High Availability flag exposed  allowing for device redundancy that is pretty essential for these devices. When enabled if a fault occurs and the Edge crashes or becomes unavailable; a redundant device will seamlessly take over after 15 seconds.

How do I enable it?

So to enable High Availability on an Edge Gateway this is done from the Properties of the Edge Gateway (Administration > Org VDC > Edge Gateways) and select Enable High Availability

How does it work ?

Edge Gateway peers communicate with each other for heartbeat messages using one of the internal interfaces; this is important as at least one internal interface/network be configured (discussed later). vCloud does not expose the HA configuration parameters and as such in NSX 6.3.0 the default dead time is 15 seconds which means that in the event of a failure it takes 15 seconds for the secondary to kick in.

When they are deployed the process that is happening behind the scenes is;

  1. vCloud Director makes an API call to NSX to enable High Availability on the Edge
  1. NSX will deploy an edge under the System vDC Resource Pool vse-EdgeGatwayName-1 which will initially be named based on the Edge Id in NSX 
  1. The Edge will be setup and Powered On in vCenter
  1. Finally the Edge will be renamed in vSphere and there will be two Edges in the System vDC in vCenter and labelled “-0″ and “-1″

So once HA has been enabled it is important to note that this does not mean that it is doing anything. There may be two VMs deployed but that doesn’t mean anything. The Edge Gateway HA Status of the Edge has three Status:

  • Disabled
  • Not Active – This means that High Availability checkbox has been checked but HA is disabled (discussed later)
  • Up – When HA is actually configured correctly  

Until the Org VDC Networks are added the Secondary node just sits there un-configured and will just consume CPU cycles.

After a Org VDC Network which is set as a Create a routed network through an existing Edge Gateway is added the status will change from Not Active to Up

At this point HA will be operating.

How do I verify that it is running?

Logon to the Edge Gateway Console and from the CLI execute show service highavailability this will show the status of the node (Standby for the non-Active Node and Active for the current master) as well as the status of the cluster and the configuration.

 When a failover occurs; after the dead time expires the surviving node will take over as per the below 

When no vApp Networks are present show service highavailability will show Disabled ; when its disabled if the Edge dies; the surviving Edge does not update its configuration and just sits there doing nothing.

High Availability does nothing unless a vApp Network is connected; why does this matter ?

Ok so this seems fairly logical right; if you just have External Networks attached to the Edge and no vApp Org Networks then you don’t need High Availability…but there is a use case for an Edge Gateway with External Networks and the way its displayed an admin might thing that HA is working even though its not ! The reason why it doesn’t operate is because the heart-beating is done via the Internal NICs and if there isn’t one then obviously it can’t operate.

In 99% of use case you will always have a vApp Network connected to an Edge Gateway however Edge Gateways have a bunch of awesome network features that can be leveraged without connection to an Org VDC network.

One such use case (which is how this post came about) is if a customer consuming IaaS using a vCloud has some requirement for some physical servers to be installed in VLAN backed physical networks to be plumbed into vCloud with a firewall. An Edge is a great use case for this as the customer can manage the firewall rules for this service and two Org Networks can be bound with the Edge acting as a firewall.  There are other ways to do this but an Edge is a cheap and easy way to achieve this.

So if you use Edges in this manner and require HA create a dummy vApp Org Network (eg. Just a dummy network labelled HA-Heartbeat) and attach it to the Edge.

Summary

  • Edge Gateway HA only operates if a VDC Org Network is attached to the Edge
  • Deadtime/failover in the event of a failure is by default 15 seconds in NSX 6.2.4/NSX 6.3.0
  • If you do need it; its pretty low maintenance set and forget
  • Don’t enable it if you don’t need it; consumes CPU and Memory

Offical Launch !

Welcome to the official launch of PigeonNuggets.com ! I know how excited you must be, I mean who wouldn’t be ! So basically the premise of the site is that lets be honest, at some point in the near future, chickens will be extinct.

Think about it, we eat chickens and their eggs ! They really don’t have a fighting chance ! When that happens this domain will be worth millions as the world searches for suitable poultry to replace chicken and invariably pigeons are quickly integrated into our diets. 

“The content of the nuggets, like a chicken salad a restaurant serves, must be taken with blind faith or it loses its flavor” – No-one ever

Until then, Pigeon Nuggets will be used to host various technical content and ramblings of the author. Watch this space !