Admins Experience: MICROSOFT WINDOWS SERVER 2008 FAILOVER CLUSTER

Microsoft Windows Server 2008 Failover Cluster

This template assesses the status and overall performance of a Microsoft Windows 2008 Failover Cluster by retrieving information from performance counters and the Windows System Event Log. For more information, refer to the following Microsoft article: http://technet.microsoft.com/en-us/library/cc720058%28WS.10%29.aspx.

Prerequisites: WMI access to the target server.

Credentials: Windows Administrator on the target server.

Note: All Windows Event Log monitors should return zero values. Returned values other than zero indicates an abnormality. Examining the Windows system log files should provide information pertaining to the issue. Detailed information about these events can be found here: http://technet.microsoft.com/en-us/library/dd353290(WS.10).aspx.

Monitored Components

Note: You need to set thresholds for counters according to your environment. It is recommended to monitor counters for some period of time to understand potential value ranges and then set the thresholds accordingly. For

more information, seehttp://knowledgebase.solarwinds.com/kb/questions/2415.

Service: Windows Time

This monitor returns the CPU and memory usage of the Windows Time service. This service maintains date and time synchronization on all clients and servers in the network. If this service is stopped, date and time synchronization will be unavailable. If this service is disabled, any services that explicitly depend on it will fail to start.

Service: Cluster Service

This monitor returns the CPU and memory usage of the Cluster service. This service enables servers to work together as a cluster to keep server-based applications highly available, regardless of individual component failures. If this service is stopped, clustering will be unavailable. If this service is disabled, any services that explicitly depend on it will fail to start.

Network Reconnections: Reconnect Count

This monitor returns the number of times the nodes have reconnected.

Note: The instance field is installation-specific. You need to specify the hostname of your cluster node (for example: node1). By default, this component monitor is disabled and should only be enabled for troubleshooting purposes.

Network Reconnections: Normal Message Queue Length

This monitor returns the number of normal messages that are in the queue waiting to be sent. Normally this number is 0, but if the TCP connection breaks, you might observe it is going up until the TCP connection is reestablished and we can send all of them through.

Note: The instance field is installation-specific. You need to specify the hostname of your cluster node (for example: node1). By default, this component monitor is disabled and should only be enabled for troubleshooting purposes.

Network Reconnections: Urgent Message Queue Length

This monitor returns the number of urgent messages that are in the queue waiting to be sent. Normally this number is 0, but if the TCP connection breaks, you might observe it going up until the TCP connection is re-established, thereby allowing all messages to be sent.

Messages Outstanding

This monitor returns the number of cluster MRR outstanding messages. The returned value should be near zero.

Resource Control Manager: Groups Online

This monitor returns the number of online cluster resource groups on this node. The returned value should be above zero at all times.

Resource Control Manager: RHS Processes

This monitor returns the number of running resource host subsystem processes (rhs.exe). The returned value should be above zero at all times.

Resource Control Manager: RHS Restarts

This monitor returns the number of resource host subsystem process (rhs.exe) restarts.

Note: By default, this component monitor is disabled and should only be enabled for troubleshooting purposes.

Resources: Resource Failure

This monitor returns the number of resource failures. The returned value should be as low as possible.

Resources: Resource Failure Access Violation

This monitor returns the number of resource failures caused by access violation. The returned value should be as low as possible.

Note: By default, this component monitor is disabled and should only be enabled for troubleshooting purposes.

Resources: Resource Failure Deadlock

This monitor returns the number of resource failures caused by deadlock. Deadlocks are usually caused by the resource taking too long to execute certain operations. The returned value should be as low as possible.

Note: By default, this component monitor is disabled and should only be enabled for troubleshooting purposes.

Backup and Restore Functionality Problems

This monitor returns the number of events that occur when:

o The backup operation for the cluster configuration data has been aborted because quorum for the cluster has not yet been achieved;

o The restore request for the cluster configuration data has failed during the "pre-restore" or "post-restore" stage.

Type of event: Error. Event ID: 1541, 1542, 1543.

Check for the following pre-conditions to make sure they have been met, and then retry the backup or restore operation:

o The cluster must achieve quorum. In other words, enough nodes must be running and communicating (perhaps with a witness disk or witness file share, depending on the quorum configuration) that the cluster has achieved a majority, that is, quorum.

o The account used by the person performing the backup must be in the local Administrators group on each clustered server, and must be a domain account, or must have been delegated the equivalent authority.

During a restore, the restore software must obtain exclusive access to the cluster configuration database on a given node. If other software has access (open handles to the database), the restore cannot be performed.

Cluster Network Connectivity Problems

This monitor returns the number of events that occur when:

o The Cluster network interface for some cluster node on a special network failed;

o The Cluster network is partitioned and some attached failover cluster nodes cannot communicate with each other over the network;

o The Cluster network is down;

o The Cluster IP address resource failed to come online;

o Attempting to use IPv4 for a special network adapter failed.

Type of event: Warning and Error. Event ID: 1127, 1129, 1130, 1360, 1555.

Run the Validate a Configuration Wizard, selecting only the network tests. Also check network devices (adapters, cables, hubs, switches, etc) and quorum configuration.

Compare the properties of the IP Address resource with the properties of the corresponding network to ensure that the network and subnet information match. If this is an IPv6 resource, make sure that the cluster network for this resource has at least one IPv6 prefix that is not link-local or tunnel.

Cluster Service Startup Problems

This monitor returns the number of events that occur when:

o The Cluster service suffered an unexpected fatal error;

o The Cluster service was halted due to incomplete connectivity with other cluster nodes;

o The Cluster service was halted to prevent an inconsistency within the failover cluster;

o The Cluster resource host subsystem (RHS) stopped unexpectedly;

o The Cluster resource either crashed or deadlocked;

o The Cluster service encountered an unexpected problem and will be shut down;

o The Cluster service has prevented itself from starting on this node. (This node does not have the latest copy of cluster configuration data.)

o The membership engine detected that the arbitration process for the quorum device has stalled.

Type of event: Error. Event ID: 1000, 1006, 1073, 1146, 1230, 1556, 1561, 1178.

There are various software or hardware related causes that can prevent the Cluster service from starting on a node. Sometimes the Cluster service can restart successfully after it has been interrupted by one of those causes. Review the event logs for indications of the problem.

Check network hardware and configuration. Use the Validate a Configuration Wizard to review the network configuration.

Check to see which resource DLL is causing the issue and report the problem to the resource vendor. Consider configuring the resource to run in its own Resource Monitor. Note that while a problem with a resource DLL will not stop the Cluster service from running, it can prevent other resource DLLs from running unless the resource runs in its own Resource Monitor.

Try starting the Cluster service on all other nodes in the cluster. If the Cluster service can be started on a node with the latest copy of the cluster configuration data, then the node that previously could not be started will probably be able to obtain the latest copy and then join the cluster successfully.

Cluster Shared Volume Functionality Problems

This monitor returns the number of events that occur when:

o The Cluster Shared Volume is no longer available on this node;

o The Cluster Shared Volume is no longer directly accessible from this cluster node;

o The Cluster service failed to create the Cluster Shared Volumes root directory;

o The Cluster service failed to set the permissions (ACL) on the Cluster Shared Volumes root directory;

o The Cluster Shared Volume is no longer accessible from this cluster node;

o The Cluster service failed to create a cluster identity token for Cluster Shared Volumes.

Type of event: Error. Event ID: 5120, 5121, 5123, 5134, 5135, 5142, 5200.

Review events related to communication with the volume.

Check storage and network configuration.

Check Cluster Shared Volumes folder creation and permissions.

Check communication between domain controllers and nodes.

Cluster Storage Functionality Problems

This monitor returns the number of events that occur when:

o The Cluster Physical Disk resource cannot be brought online because the associated disk could not be found;

o While the disk resource was being brought online, access to one or more volumes failed with an error;

o The file system for one or more partitions on the disk for the resource may be corrupt;

o The Cluster disk resource indicates corruption for specific volume;

o The Cluster disk resource contains an invalid mount point.

Type of event: Error. Event ID: 1034, 1035, 1037, 1066, 1208.

Confirm that the affected disk is available.

Check the underlying storage hardware and confirm that the device is being presented correctly to the cluster nodes.

If you have problems with partitions on the disk or corruption, we recommend that you run Chkdsk so that it can correct any problems with the file system.

Confirm that the mounted disk is configured according to the following guidelines:

Clustered disks can only be mounted onto clustered disks (not local disks);

The mounted disk and the disk it is mounted onto must be part of the same clustered service or application. They cannot be in two different clustered services or applications, and they cannot be in the general pool of Available Storage in the cluster.

Cluster Witness Problems

This monitor returns the number of events that occur when:

o The Cluster service failed to update the cluster configuration data on the witness resource due to resource inaccessibility;

o The Cluster service detected a problem with the witness resource;

o The File Share Witness resource failed a periodic health check;

o The File Share Witness resource failed to come online;

o The File Share Witness resource failed to arbitrate for the specific file share;

o The node failed to form a cluster because the witness was not accessible.

Type of event: Error. Event ID: 1557, 1558, 1562, 1563, 1564, 1573.

Confirm witness accessibility by viewing the quorum configuration of a failover cluster and the status of a witness disk.

Configuration Availability Problems

This monitor returns the number of events that occur when:

o The cluster configuration database could not be loaded or unloaded;

o The cluster service cannot start due to failed attempts to read configuration data.

Type of event: Error. Event ID: 1057, 1090, 1574, 1575, 1593.

When the cluster configuration on a node is missing or corrupt, the Cluster service cannot load the configuration and therefore cannot start. Where possible, the Cluster service will obtain the latest cluster configuration from other nodes in the cluster. Ensure that other nodes are started. If the only node or nodes that can be started appear to have a missing or corrupt cluster configuration database, you will probably need to restore one of the nodes from a system state backup. (For a failover cluster node, the system state backup includes the cluster configuration.) Sometimes when the node attempts to unload the cluster configuration database, the action does not fully complete. Try stopping and restarting the Cluster service. If this does not succeed, restart the operating system on the affected node.

DFS Namespace Resource Availability Problems

This monitor returns the number of events that occur when:

o The creation of DFS namespace root failed with error;

o The resynchronization of DFS root target failed with error;

o The cluster file share resource for DFS Namespace cannot be brought online due to error.

Type of event: Error. Event ID: 1138, 1141, 1142.

Check DFS namespace configuration.

Encrypted Settings for Cluster Resource Could not Applied

This monitor returns the number of events when encrypted settings for a cluster resource could not be successfully applied to the container on this node.

Type of event: Error. Event ID: 1121.

Close any application that might have an open handle to the registry checkpoint indicated by the event. This will allow the registry key to be replicated as configured with the resource properties. If necessary, contact the application vendor about this problem. You can use a utility calledHandle with the -a option to view handles to the registry.

Failed to Form Cluster

This monitor returns the number of Failed to Form cluster events.

Type of event: Error. Event ID: 1092, 1009.

You might be able to correct this issue by restarting the Cluster service.

File Share Resource Availability Problems

This monitor returns the number of events that occur when:

o The Cluster File Share cannot be brought online because a file share could not be created;

o The retrieving of information for a specific share returned an error code;

o The retrieving of information for a specific share indicated that the share does not exist;

o The Creation of a file share failed due to an error;

o The Cluster file share resource has detected shared folder conflicts;

o The Cluster file server resource failed a health check because some of its shared folders were inaccessible.

Type of event: Warning and Error. Event ID: 1053, 1054, 1055, 1068, 1560, 1585, 1586, 1587, 1588.

Confirm that the share exists and that the permissions allow access to the share.

If possible, determine whether the path to the share has been changed. If so, recreate the share with the correct name.

View all the resources in the clustered file server instance to ensure that they are coming online, and review the dependencies among the resources. Reconfigure as necessary to correct any problems.

Ensure that no two shared folders have the same share name.

Check shared folder accessibility and the State of Server service.

Generic Application Could not be Brought Online

This monitor returns the number of events that occur when a generic application could not be brought online during an attempt to create the process due to; the application not being present on this node, an incorrect path name, or an incorrect binary name.

Type of event: Error. Event ID: 1039.

Confirm that the following are true for the application used by the clustered Generic Application instance:

o The application is fully installed on all nodes that are possible owners of the Generic Application resource;

o The configuration for the Generic Application resource specifies the correct application and path;

o The configuration for the Generic Application resource specifies the appropriate parameters and settings for registry replication.

Generic Service Resource Availability Problems

This monitor returns the number of events that occur when:

o The generic service is either not installed or the specified service name is invalid;

o The specified generic service parameters might be invalid;

o The generic service failed with an error.

Type of event: Error. Event ID: 1040, 1041, 1042.

Confirm that the correct service is specified in the configuration for the Generic Service resource and confirm that the service is fully installed on all nodes that are possible owners of the resource.

Check service operation and examine the application event log.

IP address Resource Availability Problems

This monitor returns the number of events that occur when:

o The Cluster IP address resource cannot be brought online because the subnet mask value is invalid;

o The Cluster IP address resource cannot be brought online because the address value is invalid;

o The configuration data for the network adapter corresponding to the cluster network interface could not be determined;

o The Cluster IP address resource cannot be brought online because a duplicate IP address was detected on the network;

o The Cluster IP address resource cannot be brought online because WINS registration;

o The lease of the IP address associated with the cluster IP address resource has expired or is about to expire, and currently cannot be renewed;

o The IPv6 Tunnel address resource failed to come online because it does not depend on an IP Address (IPv4) resource;

o The Cluster network associated with dependent IP address (IPv4) resource does not support ISATAP tunneling.

Type of event: Error. Event ID: 1046, 1047, 1048, 1049, 1078, 1242, 1361, 1363.

Check the address, subnet, and network properties of the IP Address resource.

If the resource is an IPv6 Tunnel address resource, make sure it depends on at least one IP Address (IPv4) resource. Also make sure the network supports Intra-Site Automatic Tunnel Addressing Protocol (ISATAP) tunneling.

If the IP Address resource appears to be configured correctly, check the condition of network adapters and other network components used by the cluster.

Network Connectivity and Configuration Problems

This monitor returns the number of events that occur when:

o The Cluster Service was unable to access the network adapter or the cluster node has no network connectivity;

o The Cluster node has no network connectivity;

o The Cluster node has lost all network connectivity;

o The failover cluster virtual adapter failed to initialize the miniport adapter.

Type of event: Error. Event ID: 1289, 1553, 1554, 4871.

Correct any problems with the physical network adapters and the cluster virtual adapter. If a previous change in the configuration is interfering with the function of the cluster virtual adapter, it might be necessary to reinstall the failover clustering feature on the node. Also, use the Validate a Configuration Wizardto review the network configuration.

Node Failed to Join Cluster

This monitor returns the number of events that occur when the node failed to join the failover cluster due to an error.

Type of event: Error. Event ID: 1070.

You might be able to correct this issue by restarting the Cluster service.

Problems with Cluster Service

This monitor returns the number of events that occur when:

o The cluster resource in the Clustered service or application failed;

o The Cluster service failed to bring the Clustered service or application completely online or offline and one or more resources may be in a failed state.

Type of event: Warning and Error. Event ID: 1039, 1205.

Check and correct any problems with the application or service associated with the resource.

Check and correct any problems with cables or cluster-related devices.

Adjust the properties for the resource in the cluster configuration, especially the value for the Pending Timeout for the resource. This value must allow enough time for the associated application or service to start.

Check the state of all resources in the clustered service or application.

Quorum was Lost

This monitor returns the number of events that occur when the Cluster service is shutting down because quorum was lost.

Type of event: Error. Event ID: 1177.

This can occur when network connectivity is lost between some or all of the nodes in the cluster, or the witness disk fails over. It can also occur if you make a change in the cluster configuration such as increasing the number of nodes, when the number of nodes currently online is too few to achieve quorum in the new configuration. Run the Validate a Configuration Wizard, selecting only the network tests. Also check network devices (adapters, cables, hubs, switches, etc.) and quorum configuration.

Registry Checkpoint Could not be Restored to Registry Key

This monitor returns the number of events that occur when the Registry Checkpoint for Cluster resource could not be restored to a registry key.

Type of event: Error. Event ID: 1024.

System is not being Responsive

This monitor returns the number of events that occur when the Failover cluster virtual adapter has lost contact with the process.

Type of event: Error. Event ID: 4869, 4870.

Use Resource Monitor to determine, in real time, how many system resources a service or application is utilizing. This may take several minutes if the system is critically low on resources.

Admins Experience

Friday, November 23, 2012

MICROSOFT WINDOWS SERVER 2008 FAILOVER CLUSTER

No comments:

Post a Comment