Configuring Active Directory Integration with Lightwave on the VMware Photon Platform 1.2

So for the month of June I have decided to lab the Photon Platform 1.2 by VMWare and will be posting a bunch of content related to the product. The Photon Platform leverages Project Lightwave as its directory service.

Project Lightwave is an open source project comprised of enterprise-grade, identity and access management services targeting critical security, governance, and compliance challenges for Cloud-Native Apps within the enterprise. For vSphere Admins Lightwave performs many of the same functions as the Platform Services Controller;

  • Lightwave Directory Service – standards based, multi-tenant, multi-master, highly scalable LDAP v3 directory service
  • Lightwave Certificate Authority – directory integrated certificate authority helps to simplify certificate-based operations and key management across the infrastructure.
  • Lightwave Certificate Store – endpoint certificate store to store certificate credentials.
  • Lightwave Authentication Services – cloud authentication services with support for Kerberos, OAuth 2.0/OpenID Connect, SAML and WSTrust
  • Lightwave Domain Name Services – directory integrated domain name service to ensure Kerberos Authentication to the Directory Service and Authentication Service (STS)
The following outlines how to configure the Lightwave Domain Controllers to use Active Directory Domain Services as an identity provider which will enable users to leverage their Active Directory credentials to access and administer the Platform.

Before you begin you will need the following:

  1. A service account (just a Domain User account with ability to read the Active Directory Domain)
  2. The domain LDAPS certificate for the domain controller; to obtain this open the Local Computer\Personal\Certificates store on the domain controller and export the Certificate (without the private key) for the Certificate using the Certificate Template Domain Controller in Base64 Format

Step 1. Navigate to the LightWave Domain Controller administration page (https://lightwavefqdn/lightwaveui/) and enter the LightWave domain and when prompted enter the LightWave administrator account

Step 2. Select Identify Sources from the side menu and click Add

Step 3. Select Active Directory as LDAP and click Next

Please Note: At the time of writing the option Active Directory (Integrated Windows Authentication) does not appear to function/there is no UI options to add the machine to the domain; I will investigate further at a later time but I imagine that the Lightwave machines need to be added to the domain via the CLI first.

Step 4. Enter the details for the LDAPS service and the Base DN for the Users and Groups; I have just used the root of the domain however you can scope these to Containers further down your tree as per your requirements

Step 5. Select Choose File and select the certificate for the Domain Controllers LDAPS service and click Next

Step 6. Enter the Service Account credentials (Username in the UPN format) and click Test Connection if the connection is successful click Next

Step 7. Finally review and click Save to complete the configuration


Step 8.
Next select Users & Groups from the side-menu and under Groups select Administrators and click Membership

Step 9. Select the domain from the drop-down menu and locate the User or Group to grant the permissions (in the below example the group R-Photon-Admins), check the checkbox next to the object and select Add Member followed by Save

Finally; sign out of the Platform and Sign back in with the Active Directory account entering the Domain Account Username and Password and clicking Login

NOTE: Do not use the “Use Windows session authentication” it doesn’t work during testing (throws “Internal Processing error”). And voila your Active Directory environment can be leveraged for identity. 

Copy-VMGuestFile “The request was aborted” Exception – Mystery Solved !

Today I thought I would examine an issue that has come up from time to time over the past  6 or 7 years or so which I have never sat down and examined properly; when running the PowerCLI Copy-VMGuestFile cmdlet for larger files an exception is thrown “The request was aborted: The request was cancelled”.

The fix is simple: The WebOperationTimeoutSeconds parameter for PowerCLI needs to be increased from the Default of 300 seconds to a larger value to allow enough time for the transfer to complete or set to infinite.  A big thanks to @vmkdaily on the VMWare {Code} Slack for the info for the solution.

To set to infinite simply execute Set-PowerCLIConfiguration -WebOperationTimeoutSeconds -1 as an Administrator, restart PowerCLI and then rerun the cmdlet.

Some further information
Below is just a summary of the mechanism of how the GuestFileManager and other Guest Processes works for transferring files between the Guest and the local machine via the API. The GuestOperationsManager is a really important and versatile library as it allows for two-way execution and information passing between guests and Orchestration platforms WITHOUT network connectivity between the management plain and the virtual machine which becomes particularly for operations of scale. I use the Invoke-VMScript and Copy-VMGuestFile quite a bit for everything to automating VM Hardware Upgrades and guest reconfiguration/customisation to shipping and executing patches or build automation for customers in a Service Provider environment where I have no network inter-connectivity to their environments.

So the VIX API operations were moved into the vSphere API in vSphere 5 and the process flow for a call is as follows;

  1. A vSphere Web services client program calls a function in the vSphere API.
  2. The client sends a SOAP command over https (port 443) to vCenter Server.
  3. The vCenter Server passes the command to the host agent process hostd, which sends it to VMX.
  4. VMX relays the command to VMware Tools in the guest.
  5. VMware Tools has the Guest OS execute the guest operation.

In PowerCLI this is exposed via the GuestOperationsManager. For the Copy-VMGuestFile cmdlet a call is made to the $GuestOperationsManager.FileManager.InitateFileTransferXXXX methods with the Virtual Machine object, the Guest Authentication details, the destination where the file will be placed on the guest (and filename and attributes) and the size of the file. The returned value is a unique URI with a Token/Session ID which is used via a HTTP PUT to upload the file. The below is an example implementation if you wish to have a play around:

Further information

PowerCLI : Get/Set cmdlets for DNS Configuration of the vCenter Server Appliance

This week after responding to a post on the VMWare Community forums I started playing around with the vSphere REST API. I have quite a bit of experience with the vCloud REST API which is the basis for all of the vCloud PowerCLI code but not a lot with JSON or the vSphere API. So I have thrown together some methods with an aim for reuse for all vSphere APIs that I thought I would share. The module is named Module-vCSA-Administration.psm1 and contains at present three main methods;

  • Connect-VIServerREST : Establishes a connection to the vSphere REST API on a vCenter Server
  • Get-VCSANetworkConfigDNS : Get the DNS Networking configuration for the vCSA
  • Set-VCSANetworkConfigDNS : Amend the DNS Networking configuration for the vCSA

The support methods in the module (Get-vSphereRESTResponseJSON, Update-vSphereRESTDataJSON, Publish-vSphereRESTDataJSON) provide a very simple wrapper that can be used to make any REST API call and extend the functionality. I hope someone finds these useful; I will probably extend these modules further but was a good introduction into playing with JSON based REST API for vSphere.

Usage
I generally try to document as much as possible so you can use the get-help cmdlet command to review the help for each of the cmdlets (e.g. Get-help Connect-VIServerREST).

In order to use the module you must load the module using the Import-Module <Path to Module> cmdlet

Use the Connect-VIServerREST to connect to the REST API. The REST API uses the vSphere SSO Service to authenticate sessions so use the credentials (in the format you using use) to logon to the REST API. The method will connect to the REST API and stores the session information in a Global Variable $DefaultVIServerREST for use by other methods.



Once the connection is established you can use the other methods to retrieve or set the DNS configuration either setting to DHCP or setting static servers using the Get-VCSANetworkConfigDNS or Set-VCSANetworkConfigDNS cmdlets.

Module is available from: https://raw.githubusercontent.com/AdrianBegg/vSphere/master/Module-vCSA-Administration.psm1

 

vCloud Director 8.20.0.1 released

It’s been a big week of announcements and probably the most exciting for me is the release of vCloud 8.20.0.1 yesterday not because of any massive set of new features but because the maintenance release addresses a number of known bugs that have existed in the product for some time which I have previously written on this blog about. One notable new feature is support for VVol (Virtual Volume) datastores which is another great inclusion. The release of vCloud Director 8.20 increasing capability and a maintenance release within 3 months addressing a swagger of bugs is a very positive sign for the product allowing providers to remain competitive and hopefully greater integration between the vCloud product and the rest of the VMWare product suite. A full list of release notes is available here.

Upgrade:

  • Download vCloud Director for Service Providers 8.20.0.1
  • Review the release notes and upgrade guide
  • Review the VMware Product Interoperability Matrices for your environment (if your running 8.20 already there are no changes)
  • Take a backup of your vCloud Director database and cells
  • Change the permissions on the binary to allow execution (chmod +x vmware-vcloud-director-distribution-8.20.0-5515092.bin)
  • Stop user access to the cells by executing the /opt/vmware/vcloud-director/bin/vmware-vcd-cell maintenance command
  • Execute the installer to upgrade the cells
  • Stop the vCloud Director cells if they are running and run the database schema upgrade tool (/opt/vmware-vcloud-director/bin/upgrade) and walk through the wizard
  • Restart cells and monitor the cell start (tail -f /opt/vmware/vcloud-director/logs/cell.log) and once the load is complete log onto vCloud Director and check the version

#longlivevcd

PowerCLI module to manage Organisation Rights in vCloud Director 8.20

UPDATE: 16/09/2017 – Updated the Module and added new cmdlets and improved error checking. Tested on vCloud Director 8.20.01 and available from GitHub - see this post for new cmdlets.

Good day ! Today I wanted to quickly write up a post about some modules I have been working on for PowerCLI to expose/automate/simplify manipulation of the vCloud Organisation Rights. A really cool addition to vCloud Director 8.20 is the NSX feature parity and the HTML5 User Interface for Edge Gateways that comes with it and a change to how the vCloud Organisation roles work which is now granular for each Organisation (use to be rights enabled for all or none).

So presently if you wish to turn on Distributed Firewall and Advanced Networking Services Rights you have to do this via the API for each organisation in vCloud. (Presently no GUI access to turn these features On) which is described in VMWare KB2149016. These modules (available on GitHub) are designed to make this process a bit easier.

Some quick notes about enabling these Services:

  1. Check your licensing  – the Advanced Networking capabilities might be there but you might not be able to provide them to customers depending on your Service Provider entitlements
  2. The new features require an Edge to be upgraded to an Advanced Edge – you may for many reasons not want customers to be able to do this however by default customers have the ability to Convert Edges (Org Admin)

This code need work and I will be making a revision and extending these as I explore the new APIs and the gaps in the builtin cmdlets however please enjoy any bugs/feedback is appreciated. Load via “Import-Module Module-vCloud-RightsManagement.psm1″

Below are a quick summary of the main user functions;

Export-CIOrgRights
The cmdlet is designed to allow the Org Rights to be exported to a CSV for manipulation by other tools which can be later imported back into vCloud. This could be used for third party compliance/reporting or just for loading into Excel to enable/disable selected services in a nicer interface then direct API calls. The CSV is just in the format of the Role Name and if it is enabled (true) or disabled (False) for the Org.

Import-CIOrgRights
The cmdlet will replace the OrgRights with those set to enabled in the provided CSV (which should be generated from an Export-CIOrgRights cmdlet.

Get-CIOrgRights
Returns a collection of the available rights in the cloud and if they are enabled for the provided Organisation

Get-CIRights
Returns a collection of the available rights in the Global cloud infrastructure.

Add-CIOrgRight
Adds a single vCloud Director right to an Organisation

Remove-CIOrgRight
Removes a single vCloud Director right from an Organisation

 

Increasing VMware vCloud Director for Service Providers REST API XML element limit

vCloud Director for Service Providers has a pre-configured limit for REST API calls that restricts POST calls to 4,096 XML elements. The behaviour is designed as a security precaution and is not a technical/addressing limitation and as such it can be increased if required. Most API calls will not exceed this limit however if you are modifying Edge Gateway services or similar and have large firewall rulesets you will quickly exceed the limit and be presented with HTTP/1.1 500 Server Error with an exception com.vmware.vcloud.api.rest.toolkit.exceptions.InternalServerErrorRestApiException: com.vmware.vcloud.security.filters.ValidityException: The provided XML has too many elements: XXXX. The maximum is 4,096.

The setting is not exposed in the UI  and is not documented anywhere to my knowledge but the process is pretty straightforward to amend; to increase the limit the following query is required to be run against the vCloud DB.

Step 1. Take a backup of the vCloud Director database
Step 2. Logon to your DBMS and execute the following query against your vCloud Database to determine if a custom value has been set; if no value is returned the default will be used (4,096)

Step 3a. If no value is returned execute the following query to set a higher limit replacing the value with the desired element limit; in the below example I will set the allowed XML elements for a REST API call to 24,384:

Step 3b. If a value was returned by the query in set 2 then the setting has been previously set; to increase or decrease the value use the following query; in the below example I will set the allowed XML elements for a REST API call to 6,096:

Once the setting has been added/amended the change will take effect immediately; there is no requirement to restart the cells just re-run the REST API call and it should succeed.

Identify VMs affected by VMware Tools and RSS Incompatibility Issue

Last month VMWare published a blog post (https://blogs.vmware.com/apps/2017/03/rush-post-vmware-tools-rss-incompatibility-issues.html) about an issue affecting Windows Server 2012/2012 R2 and newer Guest OS’s running VMware Tools versions 9.10.0 up to 10.1.5, the paravirtual VMXNet3 driver. Basically Windows Receive Side Scaling (RSS) is not functional. RSS enables network adapters to distribute the kernel-mode network processing load across multiple processor cores in multi-core computers. The distribution of this processing makes it possible to support higher network traffic loads than would be possible if only a single core were to be used. This still hasn’t been fixed but when it does you may need to determine which machines are still impacted.

In order to determine which machines are exposed/potentially impacted in your environment by this bug (and also check if RSS is flagged as enabled in guest) the following script can be used. Once this bug is fixed; if you have machines that don’t have RSS enabled that have multicore you should consider enabling it !

For the latest version please refer to https://github.com/AdrianBegg/vSphere/

 

NSX – BEWARE ! VTEP Port can NOT be used by ANY virtual machine on an ESXi host running NSX

So I would call this a bug/design issue however VMWare have just noted it in a KB but – BEWARE of use of the VXLAN Tunnel End Point Port (UDP 8472 or 4789 by default) by ANY virtual machine that is hosted on a NSX cluster (regardless of if it is on a VXLAN segment or a standard Port Group) as the traffic will be dropped by the host with a VLAN mismatch. This affects all outbound traffic (i.e. connections from machines inside ESXi with a destination Port that matches the VTEP Port e.g.. UDP 4789)
VMWare today have updated KB2079386 to state “VXLAN port 8472 is reserved or restricted for VMware use, any virtual machine cannot use this port for other purpose or for any other application.” This was the result of a very long running support call involving a VM on a VLAN backed Port Group was having traffic on UDP 8472 being silently dropped without explanation – the KB is not quite accurate; it should read “VTEP Port is reserved or restricted for VMware use, any virtual machine cannot use this port for other purpose or for any other application.” – this is because the hypervisor will drop outbound packets with the destination set to the VTEP Port regardless of if its 8472 or 9871 etc.

Why is this an issue ?

The VXLAN standard (described initially in RFC 7344 has been implemented by a number of vendors for things other than NSX; one such example is physical Sophos Wireless Access points which use the VXLAN standard for “Wireless Zones” and communicates with the Sophos UTM (which can be a virtual machine) on UDP Port 8472. If the UTM is running on a host that has NSX deployed it simply won’t work even if it is running on a Port Group that has nothing to do with NSX.

There are surely other products using this port which begs the question; as a cloud provider or even as an IT group how do you explain to customers that they can’t host that product on ESXi if NSX is deployed when NSX VXLAN even if the traffic is not even on the VLAN with the VXLAN encapsulation operating ?!? The feedback from VMWare support regarding this issue has been that these are reserved ports and should not be used…

What’s going On ?


As a proof of concept I ran up the following lab:

  • Host is ESXi 6.5a running NSX 6.3
  • My VTEP Port is set to 4789 (UDP)
  • NSX is enabled on cluster “NSX”
  • I have a VM “VLANTST-INSIDE” running on host labesx1 (which is part of NSX cluster) running on dvSwitch Port Group “dv-VLAN-101″ which is a VLAN backed (non-VXLAN) Port Group
  • I have a VM “VLANTEST” running outside of ESXi on the same VLAN

With a UDP Server running on the test machine inside ESXi on UDP 4789 the external machine can connect without dramas:

When the roles are reversed the behaviour changes; with a UDP Server running on the machine running External to ESXi on UDP 4789 the initial connection can be initially seen but no traffic observed:
 
When attempting on any other port; no issues:

So if we run pktcap-uw –capture Drop on the ESXi host labesx1.pigeonnuggets.com we can see that the packets are being dropped by the Hypervisor with the Drop Reason ‘VlanTag Mismatch’

It appears that the Network stack is inspecting packets for VTEP UDP Port and filtering them if they do not match the VLAN which is carrying VXLAN regardless of if the payload matches; if the Port Number is the VTEP Port and it’s a VXLAN packet it will be dropped.

What are the options ?

So the only option I have found to resolve this is to change your VTEP Port which is not ideal but there is not really many options at this time. So if a product is conflicting; logon to vCenter and select Networking & Security > Installation > Logical Network Preparation > VXLAN Transport > Change

This is a non-disruptive change and won’t affect your VXLAN payloads. Hopefully this will be fixed at some point….

BUGFIX : Federated Identity in vCloud Director – Cannot remove Entity ID for SAML identification for org when Regenerate of Certificate

UPDATED 20/05/2017: This issue has now been fixed in vCloud Director for Service Providers 8.20.0.1 – upgrading to this version will solve this issue.

Happy Friday; a quick write up on a bug affecting vCloud Director SAML Identity Provider component. The bug manifested after an Identity Provider was configured for one ADFS Server and then changed to another. After the change when attempting to perform the Regenerate Certificate function Cannot remove Entity ID for SAML identification for org was thrown and HTTP 500 ERROR java.servlet.ServletException : Error initializing metadata when accessing Metadata for Federation

 

A bug exists that is known to occur if Federation has been configured previously and then changed to a new identity provider.

Known Affected: vCloud Director for Service Providers (all versions including 8.20)

VMWare Support have advised that this is a known issue and Engineering have a fix which will be implemented in the next release. For now the following will get you back up and running.

The following assumes your vCloud database is running on MSSQL and named vcloud; substitute queries as required to meet your environment.

Workaround:
Step 1. Take a backup of the vCloud Director database
Step 2. Logon to the tenancy and uncheck the Use SAML Identity Provider

Step 3. Execute the following query to get the OrgId for the affected Organization

SELECT [org_id] ,[name],[description]
FROM [vcloud].[dbo].[organization]
  

Step 4. Identify the SAML Policy Id by executing the following query against the Identity_Provider table

SELECT [id], [org_id], [provider_type],[provider_definition_id],[is_enabled]
FROM [vCloud].[dbo].[identity_provider]
WHERE [org_id] = <OrgId>

Step 5. Set the metadata to A blank value for the provider definition id by executing the following:

UPDATE saml_id_provider_settings set metadata = ” where id = <Provider_definition_id from Step 4.>
 

Verify by executing the query

SELECT [id], [metadata]
FROM [vCloud].[dbo].[saml_id_provider_settings]
WHERE id = <Provider_definition_id from Step 4.>

Step 6. Execute the following query and verify that the entity_id is set to a blank value and not set to NULL for the Organisation  

SELECT [org_id], [expiration_date],[is_cert_expiry_notified],[entity_id],[role_attribute]
FROM [vCloud].[dbo].[federation_settings]
WHERE [org_id] = <Org Id>

Step 7. Set the value to NULL by performing an UPDATE

UPDATE federation_settings SET entity_id = NULL where org_id = <OrgId from Step 3>

Step 8. Log back into vCloud Director and click Regenerate on the affected Org

Step 9. Verify the change has been successful by clicking the Metadata link; the metadata should generate correctly and all functions should now be restored without throwing a HTTP 500

Step 10. Setup your SAML Identity Provider; QED

ADFS 4.0 Nuggets/Gotchas

Today I had my first ADFS 4.0 (Windows Server 2016) deployment for a customer and found a few little gotcha’s that you might run into all with some pretty quick fixes;

Issue 1: IdpInitiatedSignonPage is disabled by default
This is usually the first test performed to check if ADFS is working as expected; to fire-up a browser and navigate to  https://domain.tld/adfs/ls/IdpInitiatedSignon.aspx – this will throw “An error occurred”. On your ADFS Server you will see Event 364 in the Event log with the critical piece of information in the Exception “IdPInitiatedSignonPageDisabledException”

Resolution: Logon to the Farm Primary and execute Set-AdfsProperties -EanbleIdpInitiatedSignOnPage $true

Issue 2: When attempting to add a new ADFS Server to the Farm during the Pre-requisite check you receive “The HTTP request was forbidden with client authentication scheme ‘Anonymous’” and “Unable to retrieve configuration from the primary server. The HTTP request was forbidden with client authentication scheme ‘Anonymous’

Resolution: There is some kind of introspection of the traffic; in my case there was a HTTP Proxy configured on the server, remove the proxy and no issues. Don’t forget to check netsh winhttp proxy as well

Issue 3: The next one came about after an upgrade/migrating of ADFS to a new Windows Server 2016 server ADFS throwing 400 and Kerberos errors in the event log (Event ID 4)

Resolution: As the error indicates this is an SPN issue; find the Service Account and update the servicePricnipalName attribute to include the value which is causing the issue (http/XXXXXXXXX)

Otherwise speaking ADFS 4.0 is generally very similar to ADFS in Windows Server 2012 R2 and is a pretty straight forward deployment/upgrade. Happy implementing !