Admins Experience

Friday, November 23, 2012

Windows Server 2008 R2 Clustering Technologies

Windows Server 2008 R2 provides two clustering technologies, which are both included on the Enterprise and Datacenter Editions. Clustering is the grouping of independent server nodes that are accessed and viewed on the network as a single system. When a service and/or application is run from a cluster, the end user can connect to a single cluster node to perform his work, or each request can be handled by multiple nodes in the cluster. In cases where data is read-only, the client might request data from one server in the cluster and the next request might be made to a different server and the client would never know the difference. Also, if a single node on a multiple node cluster fails, the remaining nodes will continue to service client requests and only the clients that were originally connected to the failed node may notice either a slight interruption in service, or their entire session might need to be restarted depending on the service or application in use and the particular clustering technology that is in use for that cluster.

The first clustering technology provided with Windows Server 2008 R2, Enterprise and Datacenter Editions is failover clustering. Failover clusters provide system fault tolerance through a process called failover. When a system or node in the cluster fails or is unable to respond to client requests, the clustered services or applications that were running on that particular node are taken offline and moved to another available node where functionality and access are restored. Failover clusters, in most deployments, require access to shared data storage and are best suited, but not necessarily limited to, the deployment of the following services and applications:

File services— File services deployed on failover clusters provide much of the same functionality a standalone Windows Server 2008 R2 system can provide, but when deployed as clustered file services, a single data storage repository can be presented and accessed by clients through the currently assigned and available cluster node without replicating the file data.
Print services— Print services deployed on failover clusters have one main advantage over a standalone print server: If the active print server fails, each of the shared printers is made available to clients using another designated print server in the cluster. Although deploying and replacing printers to computers and users is easily managed using Group Policy deployed printers, when standalone print servers fail, the impact can be huge, especially when servers, devices, services, and applications that cannot be managed with group policies access these printers.
Database services— When large organizations deploy line-of-business applications, e-commerce, or any other critical services or applications that require a back-end database system that must be highly available, deploying database services on failover clusters is the preferred method. Also, in many cases configuring enterprise database services can take hours and the size of the databases, indexes, and logs can be huge, so deploying database services on a standalone system encountering a system failure may results in several hours of undesired downtime during repair or restore, instead of quick recovery as with a failover cluster.
Back-end enterprise messaging systems— For many of the same reasons as cited previously for deploying database services, enterprise messaging services have become critical to many organizations and are best deployed in failover clusters.
Hyper-V virtual machines— As many organizations move toward server consolidation and conversion of physical servers to virtual servers, providing a means to also maintain high availability and reliability has become even more essential when a single physical Hyper-V host has several critical virtual machines running on it.

The second Windows Server 2008 R2 clustering technology is Network Load Balancing (NLB), which is best suited to provide fault tolerance for front-end web applications and websites, Remote Desktop Services Session Host server systems, VPN servers, streaming media servers, and proxy servers. NLB provides fault tolerance by having each server in the cluster individually run the network services or applications, removing any single points of failure. Depending on the particular needs of the service or application deployed on an NLB cluster, there are different configuration or affinity options to determine how clients will be connected to the back-end NLB cluster nodes. For example, on a read-only website, client requests can be directed to any of the NLB cluster nodes; during a single visit to a website, a client might be connected to different NLB cluster nodes. As another example, when a client attempts to utilize an e-commerce application to purchase goods or services provided through a web-based application on an NLB cluster, the client session should be initiated and serviced by a single node in the cluster, as this session will most likely be using Secure Sockets Layer (SSL) encryption and will also contain specific session data, including the contents of the shopping cart and the end-user specific information.

Note

Microsoft does not support running failover clusters and Network Load Balancing on the same Windows Server 2008 R2 system.

Windows Server 2008 R2 Cluster Terminology

Before failover or NLB clusters can be designed and implemented, the administrator deploying the solution should be familiar with the general terms used to define the clustering technologies. The following list contains the many terms associated with Windows Server 2008 R2 clustering technologies:

Cluster— A cluster is a group of independent servers (nodes) that are accessed and presented to the network as a single system.
Node— A node is an individual server that is a member of a cluster.
Cluster resource— A cluster resource is a service, application, IP address, disk, or network name defined and managed by the cluster. Within a cluster, cluster resources are grouped and managed together using cluster resource groups, now known as Services and Applications groups.
Services and Applications group— Cluster resources are contained within a cluster in a logical set called a Services and Applications group or historically referred to as a cluster group. Services and Applications groups are the units of failover within the cluster. When a cluster resource fails and cannot be restarted automatically, the Services and Applications group this resource is a part of will be taken offline, moved to another node in the cluster, and the group will be brought back online.
Client Access Point— A Client Access Point is a term used in Windows Server 2008 R2 failover clusters that represents the combination of a network name and associated IP address resource. By default, when a new Services and Applications group is defined, a Client Access Point is created with a name and an IPv4 address. IPv6 is supported in failover clusters but an IPv6 resource either needs to be added to an existing group or a generic Services and Applications group needs to be created with the necessary resources and resource dependencies.
Virtual cluster server— A virtual cluster server is a Services or Applications group that contains a Client Access Point, a disk resource, and at least one additional service or application-specific resource. Virtual cluster server resources are accessed either by the domain name system (DNS) name or a NetBIOS name that references an IPv4 or IPv6 address. A virtual cluster server can in some cases also be directly accessed using the IPv4 or IPv6 address. The name and IP address remain the same regardless of which cluster node the virtual server is running on.
Active node— An active node is a node in the cluster that is currently running at least one Services and Applications group. A Services and Applications group can only be active on one node at a time and all other nodes that can host the group are considered passive for that particular group.
Passive node— A passive node is a node in the cluster that is currently not running any Services and Applications groups.
Active/passive cluster— An active/passive cluster is a cluster that has at least one node running a Services and Applications group and additional nodes the group can be hosted on, but are currently in a waiting state. This is a typical configuration when only a single Services and Applications group is deployed on a failover cluster.
Active/active cluster— An active/active cluster is a cluster in which each node is actively hosting or running at least one Services and Applications group. This is a typical configuration when multiple groups are deployed on a single failover cluster to maximize server or system usage. The downside is that when an active system fails, the remaining system or systems need to host all of the groups and provide the services and/or applications on the cluster to all necessary clients.
Cluster heartbeat— The cluster heartbeat is a term used to represent the communication that is kept between individual cluster nodes that is used to determine node status. Heartbeat communication can occur on a designated network but is also performed on the same network as client communication. Due to this internode communication, network monitoring software and network administrators should be forewarned of the amount of network chatter between the cluster nodes. The amount of traffic that is generated by heartbeat communication is not large based on the size of the data but the frequency of the communication might ring some network alarm bells.
Cluster quorum— The cluster quorum maintains the definitive cluster configuration data and the current state of each node, each Services and Applications group, and each resource and network in the cluster. Furthermore, when each node reads the quorum data, depending on the information retrieved, the node determines if it should remain available, shut down the cluster, or activate any particular Services and Applications groups on the local node. To extend this even further, failover clusters can be configured to use one of four different cluster quorum models and essentially the quorum type chosen for a cluster defines the cluster. For example, a cluster that utilizes the Node and Disk Majority Quorum can be called a Node and Disk Majority cluster.
Cluster witness disk or file share— The cluster witness or the witness file share are used to store the cluster configuration information and to help determine the state of the cluster when some, if not all, of the cluster nodes cannot be contacted.
Generic cluster resources— Generic cluster resources were created to define and add new or undefined services, applications, or scripts that are not already included as available cluster resources. Adding a custom resource provides the ability for that resource to be failed over between cluster nodes when another resource in the same Services and Applications group fails. Also, when the group the custom resource is a member of moves to a different node, the custom resource will follow. One disadvantage or lack of functionality with custom resources is that the Failover Clustering feature cannot actively monitor the resource and, therefore, cannot provide the same level of resilience and recoverability as with predefined cluster resources. Generic cluster resources include the generic application, generic script, and generic service resource.
Shared storage— Shared storage is a term used to represent the disks and volumes presented to the Windows Server 2008 R2 cluster nodes as LUNs. In particular, shared storage can be accessed by each node on the cluster, but not simultaneously.
Cluster Shared Volumes— A Cluster Shared Volume is a disk or LUN defined within the cluster that can be accessed by multiple nodes in the cluster simultaneously. This is unlike any other cluster volume, which normally can only be accessed by one node at a time, and currently the Cluster Shared Volume feature is only used on Hyper-V clusters but its usage will be extended in the near future to any failover cluster that will support live migration.
LUN— LUN stands for Logical Unit Number. A LUN is used to identify a disk or a disk volume that is presented to a host server or multiple hosts by a shared storage array or a SAN. LUNs provided by shared storage arrays and SANs must meet many requirements before they can be used with failover clusters but when they do, all active nodes in the cluster must have exclusive access to these LUNs.
Failover — Failover is the process of a Services and Applications group moving from the current active node to another available node in the cluster when a cluster resource fails. Failover occurs when a server becomes unavailable or when a resource in the cluster group fails and cannot recover within the failure threshold.
Failback— Failback is the process of a cluster group automatically moving back to a preferred node after the preferred node resumes operation. Failback is a nondefault configuration that can be enabled within the properties of a Services and Applications group. The cluster group must have a preferred node defined and a failback threshold defined as well, for failback to function. A preferred node is the node you would like your cluster group to be running or hosted on during regular cluster operation when all cluster nodes are available. When a group is failing back, the cluster is performing the same failover operation but is triggered by the preferred node rejoining or resuming cluster operation instead of by a resource failure on the currently active node.
Live Migration— Live Migration is a new feature of Hyper-V that is enabled when Virtual Machines are deployed on a Windows Server 2008 R2 failover cluster. Live Migration enables Hyper-V virtual machines on the failover cluster to be moved between cluster nodes without disrupting communication or access to the virtual machine. Live Migration utilizes a Cluster Shared Volume that is accessed by all nodes in the group simultaneously and it transfers the memory between the nodes during active client communication to maintain availability. Live Migration is currently only used with Hyper-V failover clusters but will most likely extend to many other Microsoft services and applications in the near future.
Quick Migration— With Hyper-V virtual machines on failover clusters, Quick Migration provides the option for failover cluster administrators to move the virtual machine to another node without shutting the virtual machine off. This utilizes the virtual machine’s shutdown settings options and if set to Save, the default setting, performing a Quick Migration will save the current memory state, move the virtual machine to the desired node, and resume operation shortly. End users should only encounter a short disruption in service and should reconnect without issue depending on the service or application hosted within that virtual machine. Quick Migration does not require Cluster Shared Volumes to function.
Geographically dispersed clusters— These are clusters that span physical locations and sometimes networks to provide failover functionality in remote buildings and data centers, usually across a WAN link. These clusters can now span different networks and can provide failover functionality, but network response and throughput must be good and data replication is not handled by the cluster.
Multisite cluster— Geographically dispersed clusters are commonly referred to as multisite clusters as cluster nodes are deployed in different Active Directory sites. Multisite clusters can provide access to resources across a WAN and can support automatic failover of Services and Applications groups defined within the cluster.
Stretch clusters— A stretch cluster is a common term that, in some cases, refers to geographically dispersed clusters in which different subnets are used but each of the subnets is part of the same Active Directory site—hence, the term stretch, as in stretching the AD site across the WAN. In other cases, this term is used to describe a geographically dispersed cluster, as in the cluster stretches between geographic locations.

Troubleshooting Network Load Balancing Clusters

Troubleshooting Network Load Balancing Clusters

This topic has not yet been rated - Rate this topic

Applies To: Windows Server 2008, Windows Server 2008 R2, Windows Server 2012

This section lists some common issues that you might encounter when using Network Load Balancing (NLB) clusters.

Note

The NLB functionality in Windows Server 2012 is generally the same as in Windows Server 2008 R2. However, some task details are changed in Windows Server 2012. For information on new ways to do tasks in Windows Server 2012, see Common Management Tasks and Navigation.

What problem are you having?

After installing Network Load Balancing and restarting a cluster host, a message appears: "The system has detected an IP address conflict with another system on the network..."

Cause: The same IP address already exists on the network.
Solution: Choose a new IP address, or remove the duplicate address.
Cause: You have configured different cluster operation modes (Unicast or Multicast) on the hosts, which causes two different MAC addresses to map to the same IP address.
Solution: Ensure that all hosts are configured with the same cluster operation mode.
Cause: You configured the cluster's IP address before NLB was bound to the network adapter.
Solution: Remove the cluster's IP address from TCP/IP properties, enable NLB on the proper adapter, and then configure the cluster's IP address.
Cause: You added the cluster's IP address to a network adapter that has not been enabled for NLB.
Solution: Remove the cluster's IP address from the incorrect adapter's TCP/IP properties, enable NLB on the proper adapter, and then configure the cluster's IP address.

For more information about enabling NLB, see Installing Network Load Balancing

There is no response when you use ping to access the cluster's IP address from an outside network.

Verify that you can use ping to access the dedicated IP addresses for the cluster hosts from a computer outside the router. If this test fails, and you are using multiple network adapters, the issue is not related to NLB. If you are using a single network adapter for the dedicated and cluster IP addresses, consider the following causes:

Cause: If you are using multicast support, you might find that your router has difficulty resolving the primary IP address into a multicast media access control (MAC) address by using the Address Resolution Protocol (ARP).
Solution: Verify that you can use ping to access the cluster from a client on the cluster's subnet and to access the cluster hosts' dedicated IP addresses from a computer outside the router. If these tests work properly, the router is probably at fault. You should be able to add a static ARP entry to the router to circumvent the issue. You can also turn off NLB multicast support and use a unicast network address without a hub.
Cause: When using NLB in multicast or unicast mode, routers need to accept proxy ARP responses (IP-to-network address mappings that are received with a different network source address in the Ethernet frame).
Solution: Make sure that your router has proxy ARP support turned on. You can also set a static ARP entry to keep proxy ARP support disabled in the router.
Cause: Internet control message protocol (ICMP) to the cluster is blocked by a router or firewall.
Solution: Allow ICMP traffic through the router or firewall. Be aware that this may expose your system to additional security risk.

There is no response when using ping to access a host's dedicated IP addresses from another cluster host.

Cause: When using NLB in multicast or unicast mode, routers need to accept proxy ARP responses (IP-to-network address mappings that are received with a different network source address in the Ethernet frame).
Solution: Make sure that your router has proxy ARP support turned on. You can also set a static ARP entry to keep proxy ARP support disabled in the router.
Cause: Internet control message protocol (ICMP) to the cluster is blocked by a router or firewall.
Solution: Allow ICMP traffic through the firewall or router. Be aware that this may expose your system to additional security risk.

When attempting to use Network Load Balancing Manager to connect to a host in your cluster, you receive the error "Host unreachable."

Cause: Internet control message protocol (ICMP) to the host is either blocked by a router or firewall or disabled on the host's network adapter.
Solution: Enable ICMP on the host's network adapter or allow ICMP traffic through the firewall or router. Be aware that this may expose your system to additional security risk. You can also use NLB Manager's /noping option.

When using Telnet or attempting to browse a computer outside the cluster from a cluster host, there is no response.

Cause: Verify that you can use ping to access the computer outside the cluster. If this test is successful, you might not have listed the host's dedicated IP address first in the TCP/IP properties.
Solution: If ping fails to access the computer outside of the cluster, refer to the following issues (described earlier in this Troubleshooting topic):

When invoking the Network Load Balancing remote control commands from a computer outside the cluster, there is no response from one or more cluster hosts.

Cause: Remote control commands are not being sent to the cluster's IP address.
Solution: Commands must be sent to the cluster's primary IP address, which was assigned in the Network Load Balancing Properties dialog box. Be sure that you send remote commands to the correct IP address.
Cause: The remote control traffic is being encrypted by Internet Protocol security (IPSec). NLB remote control commands will not work correctly if they are sent from a computer that has IPSec configured so that the remote control traffic is encrypted by IPSec.
Solution: Disable IPSec.

For more information, see the Internet Protocol Security (IPSec) Help content.
Cause: NLB UDP control ports are protected incorrectly by a firewall. By default, remote control commands are sent to UDP ports 1717 and 2504 at the cluster IP address.
Solution: Be sure that these ports have not been blocked incorrectly by a router or firewall. You can also change the port number by modifying the corresponding NLB parameter.

There is no reply when you use the dedicated IP address of a host to specify it as a target for a remote control command. However, specifying the host by its priority (ID) works.

Cause: None of the hosts have a dedicated IP address.
Solution: Assign a dedicated IP address to each host. For more information, see Configure Network Load Balancing Host Parameters.

Connectivity to the cluster is denied to some users, but not all.

Cause: An application that is being load balanced is not responding.
Solution: This is an application-specific issue that is not related to NLB. Refer to your application's documentation to correct this issue. You may need to stop and restart the application.
Cause: If your cluster is configured for unicast mode, a switch might have learned the NLB network adapter's MAC address.
Solution: Clear the switch's port to MAC address mapping.
Cause: The cluster's IP address was not added to TCP/IP on one or more of the hosts.
Solution: If you do not use NLB Manager to configure your cluster, you must manually configure TCP/IP with the cluster's IP address.
Cause: A host is leaving the cluster because of a drainstop or stop command, but convergence did not complete correctly.
Solution: Wait for the convergence to complete. If the convergence does not complete, see the following issue later in this Troubleshooting topic:

After the cluster hosts start, they begin converging, but they never complete convergence.

You cannot view or change the Network Load Balancing properties by using net config and Windows Management Instrumentation (WMI).

Cause: To view or change Network Load Balancing properties, you must be a member of the Administrators group.
Solution: Log on as a user who is in the local Administrators group of the computer that is running NLB.

An unusual number of TCP connections to the cluster's IP address are being reset by the server or the client.

Cause: The HTTP keep-alive values are enabled on the NLB hosts and keep-alive value-enabled clients are connecting to the cluster.
Solution: Disable HTTP keep-alive values. For more information about HTTP keep-alive values and Internet Information Services (IIS), refer to the IIS documentation set.

To view the IIS documentation set from your desktop, install IIS, then click Start, click Run, and type the following command in the Open text box:

%windir%\help\iisrv.chm
Cause: Low system resources on the server are causing TCP to reject the connections.
Solution: Free system resources by, for example, adding additional system memory or closing unnecessary applications.
Cause: The cluster has diverged into two separately converged clusters, which causes more than one node to claim ownership of every connection.
Solution: Remove the two clusters, then recreate a single cluster.

Virtual Private Network (VPN) calls fail when you make a change that causes convergence (such as adding a host, removing a host, or draining a host).

Cause: When using NLB to load balance VPN traffic, you must configure the port rules that govern the ports handling the VPN traffic (TCP port 1723 for PPTP/GRE and UDP port 500 for IPSEC/L2TP) to use either Single or Network affinity.
Solution: Configure the port rules that govern ports 500 and 1723 to use Single or Network affinity. For more information, see Network Load Balancing Manager Properties.

After the cluster hosts start, they begin converging, but they never complete convergence.

Cause: A different number of port rules or incompatible port rules on different cluster hosts were entered. This will inhibit convergence.
Solution: Open the Network Load Balancing Properties dialog box on each cluster host and verify that all hosts have identical port rules.
Cause: You have a bad network adapter or cable.
Solution: Use the ping command to test connectivity. Enter the host's fully qualified domain name. You can also learn more about the issue by using the ping command to search your domain controller by IP address and other network servers by name and IP address.
Cause: Duplex settings on a switch or hub are mismatched.
Solution: Confirm that the duplex settings in each of your switches and hubs are configured appropriately.
Cause: The dedicated IP address that you used for one of the hosts already exists on the network.
Solution: Choose a new IP address, or remove the duplicate address.
Cause: Your cluster contains hosts that are running Windows 2000.
Solution: Your cluster must be running Windows Server 2008 on all hosts. An NLB cluster environment that contains hosts with Windows Server 2003 and Windows Server 2008 is supported only when performing a rolling upgrade to Windows Server 2008. Mixing Windows Server 2003 and Windows Server 2008 in the same cluster is not supported for long periods of time.
Cause: You have configured different cluster operation modes (unicast and multicast) on the hosts.
Solution: Use NLB Manager to ensure that all hosts are configured with the same cluster operation mode.

Note

You can also view the Windows event logs to check for errors and warnings. For more information see Installing Network Load Balancing.

The cluster moves in and out of a converged state.

Cause: Heartbeats are being missed due to intermittent network connectivity caused by a bad network adapter or cable or other network problems.
Solution: Use the ping command to test connectivity. Enter the host's fully qualified domain name. You can also learn more about the issue by using the ping command to search your domain controller by IP address and other network servers by name and IP address.

After the cluster hosts start, Network Load Balancing reports that convergence has finished, but more than one host is a default host.

Cause: The cluster hosts have become members of different subnets, so all the hosts are not accessible on the same network.
Solution: Be sure that all cluster hosts can communicate with each other.
Cause: A layer-three switch is being used.
Solution: Put a layer-two switch between the hosts and the layer-three switch.
Cause: A break in a redundant switch caused the cluster to separate into two clusters, creating two default hosts.
Solution: Remove the two clusters, then create a single cluster.
Cause: Your switch is configured to reject broadcast packets.
Solution: Configure your switch to accept broadcast packets (be aware that this might introduce certain security risks), or configure your NLB cluster to use multicast mode.
Cause: One host is unable to send or receive heartbeats.
Solution: Use the ping command to test connectivity to each of the hosts. Enter the hosts' fully-qualified domain name.
Cause: A host is plugged into the wrong port on the switch.
Solution: Use the correct port on the switch.

Network Load Balancing is not load balancing applications, and the default host handles all the network traffic.

Cause: A port rule is missing. By default, NLB directs all incoming network traffic that is not governed by port rules to the default host—this ensures that applications that you do not want load balanced behave properly.
Solution: To load balance an application across the cluster, create a port rule on every cluster host for the TCP/IP ports that are serviced by the application.
Cause: You added a second host to a single host cluster, but the second host is not configured correctly. The cluster never converges and the original host continues to handle all of the traffic.
Solution: Carefully review (and if necessary, correct) each of the settings on the second host—for example, the cluster IP address, dedicated IP address, and port rules.
Cause: If your cluster is configured for unicast mode, a switch might have learned the NLB network adapter's MAC address.
Solution: Clear the switch's port to MAC address mapping.
Cause: A proxy server is sending all connections that are using a single IP address to your cluster in single affinity mode.
Solution: Configure your proxy server to use multiple IP addresses.

Traffic alternates unexpectedly between the cluster hosts, and it breaks TCP connections.

Cause: Unicast network addresses are causing issues with the switching hub. If you are using a switching hub to interconnect the cluster hosts, you must use NLB multicast support. Otherwise, the switch can behave erratically when the same unicast network is used on multiple switch ports.
Solution: Check that you have selected multicast support in the Network Load Balancing Properties dialog box. If you do not want to use multicast support, you can interconnect the cluster hosts with a hub or coaxial cable instead of with a switch.

Network traffic does not appear to load balance evenly among the cluster hosts.

Cause: The network traffic is coming from a limited number of IP addresses, possibly due to the setting on a proxy server.
Solution: Configure your proxy server to use multiple IP addresses.

When you are using Network Load Balancing with Microsoft Internet Security and Acceleration (ISA) Server, one cluster host logs blocked packets that are directed to the dedicated Internet Protocol (IP) address of another host.

Cause: One of the cluster hosts is configured with a host priority identifier equal to 1.
Solution: Do not configure any cluster host with a host priority identifier of 1. Use numbers that are greater than 1. For more information, see Configure Network Load Balancing Host Parameters.

You are unable to create a Network Load Balancing cluster in a 64-bit version environment.

Cause: You might not be running the appropriate NLB version for your environment. NLB cannot form a cluster when the 32-bit version of NLB is used on a 64-bit version computer. This issue might have gone undetected because 32-bit NLB components (nlb.exe, wlbs.exe, and nlbmgr.exe) appear to run correctly in the 64-bit version environment.
Solution: If you plan to use a 64-bit version computer environment, you must use the 64-bit NLB version.

Note

The following topics describe several common issues that you might encounter when installing and initially using NLB. The topics describe the likely reasons for each issue and one or more suggested remedies. These topics assume that your system and applications meet the minimum requirements for NLB. For more information, see: Overview of Network Load Balancing and Installing Network Load Balancing.

You should test your network and all network adapters for proper operation before installing NLB. Be sure to follow all installation steps, and check that the cluster parameters and port rules are identically set for all cluster hosts. If an issue occurs, always check the Windows event log for a message from the NLB driver. For more information, see the sections titled Cluster parameters, Host parameters, and Port rules in Network Load Balancing Manager Properties.