BUGFIX : Federated Identity in vCloud Director – Cannot remove Entity ID for SAML identification for org when Regenerate of Certificate

UPDATED 20/05/2017: This issue has now been fixed in vCloud Director for Service Providers 8.20.0.1 – upgrading to this version will solve this issue.

Happy Friday; a quick write up on a bug affecting vCloud Director SAML Identity Provider component. The bug manifested after an Identity Provider was configured for one ADFS Server and then changed to another. After the change when attempting to perform the Regenerate Certificate function Cannot remove Entity ID for SAML identification for org was thrown and HTTP 500 ERROR java.servlet.ServletException : Error initializing metadata when accessing Metadata for Federation

 

A bug exists that is known to occur if Federation has been configured previously and then changed to a new identity provider.

Known Affected: vCloud Director for Service Providers (all versions including 8.20)

VMWare Support have advised that this is a known issue and Engineering have a fix which will be implemented in the next release. For now the following will get you back up and running.

The following assumes your vCloud database is running on MSSQL and named vcloud; substitute queries as required to meet your environment.

Workaround:
Step 1. Take a backup of the vCloud Director database
Step 2. Logon to the tenancy and uncheck the Use SAML Identity Provider

Step 3. Execute the following query to get the OrgId for the affected Organization

SELECT [org_id] ,[name],[description]
FROM [vcloud].[dbo].[organization]
  

Step 4. Identify the SAML Policy Id by executing the following query against the Identity_Provider table

SELECT [id], [org_id], [provider_type],[provider_definition_id],[is_enabled]
FROM [vCloud].[dbo].[identity_provider]
WHERE [org_id] = <OrgId>

Step 5. Set the metadata to A blank value for the provider definition id by executing the following:

UPDATE saml_id_provider_settings set metadata = ” where id = <Provider_definition_id from Step 4.>
 

Verify by executing the query

SELECT [id], [metadata]
FROM [vCloud].[dbo].[saml_id_provider_settings]
WHERE id = <Provider_definition_id from Step 4.>

Step 6. Execute the following query and verify that the entity_id is set to a blank value and not set to NULL for the Organisation  

SELECT [org_id], [expiration_date],[is_cert_expiry_notified],[entity_id],[role_attribute]
FROM [vCloud].[dbo].[federation_settings]
WHERE [org_id] = <Org Id>

Step 7. Set the value to NULL by performing an UPDATE

UPDATE federation_settings SET entity_id = NULL where org_id = <OrgId from Step 3>

Step 8. Log back into vCloud Director and click Regenerate on the affected Org

Step 9. Verify the change has been successful by clicking the Metadata link; the metadata should generate correctly and all functions should now be restored without throwing a HTTP 500

Step 10. Setup your SAML Identity Provider; QED

vCloud Director for SP 8.X – Unable to add disks – PBM error occurred during PreMigrationCheckCallback: vmodl.fault.InvalidArgument

UPDATE: 6/5/2018 : A script is now available that can report/resolve this issue which is available at the end of this post.

So a bug exists in the upgrade from vCenter 5.5 to vCenter 6.0 which affects vCloud; for a large of virtual machines when attempting to add disks through vCloud Director the task would fail with A general system error occurred: PBM error occurred during PreMigrateCheckCallback: vmodl.fault.InvalidArgument. This seems to affect a bunch of machines at random and didn’t seem to matter which Storage Tier or which vCenter they were hosted on. 

So after some further investigation it has been discovered that this is due to a configuration mismatch between the Storage Profile on the VM in vCenter and the storage profile in vCenter. It appears that vCloud is attempting to Storage vMotion the machine to the correct tier which fails and shouldn’t occur as its already in the right spot.

To confirm this log into the vCenter and open the vpxd.log (C:\ProgramData\VMware\vCenterServer\logs\vmware-vpx) for Windows and look for the Exception “Invalid home profile setting. Host and datastore are not changed. May use VM reconfig instead”

The Fix

The VMWare vCloud support team has confirmed this as a bug affecting multiple customers caused by the vCenter upgrade but no official guidance exists to resolve. A tested solution that I have verified actually resolves this is to perform the following;

  1.  From vCloud Director check the Default Storage Profile (General tab) of the  on the affected Virtual Machine and if there are any overrides present (Hardware tab)
  2. Logon to the vCenter Server which host the machine and from the Manage menu of the affected VM select Policies and click Edit VM Storage Policies
  3. Select the correct Storage Profile and click Apply to all NOTE: If it was already set to the correct Tier; select something else (eg. Change it to Datastore Default) then click Apply to all followed be reselecting the correct settings; this is to ensure the UI recognizes a change has actually occurred and make the calls to the Virtual Center Service to apply the changes
  4. Go back into vCloud and add the disk; and volla !

Automated Report and Fix

I have developed a script which can be run against vCloud Director to determine which machines are affected and also automate resolving this issue. Enjoy.

 

vCloud Director Edge Gateway – High Availability

Ok so this is just a quick write explaining at a high level the process of enabling the High Availability feature of an Edge Gateways in vCloud Director and some things that you should know if deploying them. vCloud Edges are fundamentally the same as NSX edges however they are controlled by vCloud and are nowhere near (although they are catching up) as rich as full NSX edges. They do however have the High Availability flag exposed  allowing for device redundancy that is pretty essential for these devices. When enabled if a fault occurs and the Edge crashes or becomes unavailable; a redundant device will seamlessly take over after 15 seconds.

How do I enable it?

So to enable High Availability on an Edge Gateway this is done from the Properties of the Edge Gateway (Administration > Org VDC > Edge Gateways) and select Enable High Availability

How does it work ?

Edge Gateway peers communicate with each other for heartbeat messages using one of the internal interfaces; this is important as at least one internal interface/network be configured (discussed later). vCloud does not expose the HA configuration parameters and as such in NSX 6.3.0 the default dead time is 15 seconds which means that in the event of a failure it takes 15 seconds for the secondary to kick in.

When they are deployed the process that is happening behind the scenes is;

  1. vCloud Director makes an API call to NSX to enable High Availability on the Edge
  1. NSX will deploy an edge under the System vDC Resource Pool vse-EdgeGatwayName-1 which will initially be named based on the Edge Id in NSX 
  1. The Edge will be setup and Powered On in vCenter
  1. Finally the Edge will be renamed in vSphere and there will be two Edges in the System vDC in vCenter and labelled “-0″ and “-1″

So once HA has been enabled it is important to note that this does not mean that it is doing anything. There may be two VMs deployed but that doesn’t mean anything. The Edge Gateway HA Status of the Edge has three Status:

  • Disabled
  • Not Active – This means that High Availability checkbox has been checked but HA is disabled (discussed later)
  • Up – When HA is actually configured correctly  

Until the Org VDC Networks are added the Secondary node just sits there un-configured and will just consume CPU cycles.

After a Org VDC Network which is set as a Create a routed network through an existing Edge Gateway is added the status will change from Not Active to Up

At this point HA will be operating.

How do I verify that it is running?

Logon to the Edge Gateway Console and from the CLI execute show service highavailability this will show the status of the node (Standby for the non-Active Node and Active for the current master) as well as the status of the cluster and the configuration.

 When a failover occurs; after the dead time expires the surviving node will take over as per the below 

When no vApp Networks are present show service highavailability will show Disabled ; when its disabled if the Edge dies; the surviving Edge does not update its configuration and just sits there doing nothing.

High Availability does nothing unless a vApp Network is connected; why does this matter ?

Ok so this seems fairly logical right; if you just have External Networks attached to the Edge and no vApp Org Networks then you don’t need High Availability…but there is a use case for an Edge Gateway with External Networks and the way its displayed an admin might thing that HA is working even though its not ! The reason why it doesn’t operate is because the heart-beating is done via the Internal NICs and if there isn’t one then obviously it can’t operate.

In 99% of use case you will always have a vApp Network connected to an Edge Gateway however Edge Gateways have a bunch of awesome network features that can be leveraged without connection to an Org VDC network.

One such use case (which is how this post came about) is if a customer consuming IaaS using a vCloud has some requirement for some physical servers to be installed in VLAN backed physical networks to be plumbed into vCloud with a firewall. An Edge is a great use case for this as the customer can manage the firewall rules for this service and two Org Networks can be bound with the Edge acting as a firewall.  There are other ways to do this but an Edge is a cheap and easy way to achieve this.

So if you use Edges in this manner and require HA create a dummy vApp Org Network (eg. Just a dummy network labelled HA-Heartbeat) and attach it to the Edge.

Summary

  • Edge Gateway HA only operates if a VDC Org Network is attached to the Edge
  • Deadtime/failover in the event of a failure is by default 15 seconds in NSX 6.2.4/NSX 6.3.0
  • If you do need it; its pretty low maintenance set and forget
  • Don’t enable it if you don’t need it; consumes CPU and Memory