Microsoft Windows Server 2008
Failover Cluster
Microsoft Windows Server 2008 Failover
Cluster
This template assesses the status and overall
performance of a Microsoft Windows 2008 Failover Cluster by retrieving
information from performance counters and the Windows System Event Log. For
more information, refer to the following Microsoft article: http://technet.microsoft.com/en-us/library/cc720058%28WS.10%29.aspx.
Prerequisites: WMI access to the target
server.
Credentials: Windows Administrator on
the target server.
Note: All Windows Event Log
monitors should return zero values. Returned values other than zero indicates
an abnormality. Examining the Windows system log files should provide
information pertaining to the issue. Detailed information about these events
can be found here: http://technet.microsoft.com/en-us/library/dd353290(WS.10).aspx.
Monitored
Components
Note: You need to set
thresholds for counters according to your environment. It is recommended to
monitor counters for some period of time to understand potential value ranges
and then set the thresholds accordingly. For
Service: Windows Time
This monitor returns the CPU and memory usage
of the Windows Time service. This service maintains date and time
synchronization on all clients and servers in the network. If this service is
stopped, date and time synchronization will be unavailable. If this service is
disabled, any services that explicitly depend on it will fail to start.
Service: Cluster Service
This monitor returns the CPU and memory usage
of the Cluster service. This service enables servers to work together as a
cluster to keep server-based applications highly available, regardless of
individual component failures. If this service is stopped, clustering will be
unavailable. If this service is disabled, any services that explicitly depend
on it will fail to start.
Network Reconnections: Reconnect Count
This monitor returns the number of times the
nodes have reconnected.
Note: The instance field is
installation-specific. You need to specify the hostname of your cluster node
(for example: node1). By default, this component monitor is disabled and should
only be enabled for troubleshooting purposes.
Network Reconnections: Normal Message Queue
Length
This monitor returns the number of normal
messages that are in the queue waiting to be sent. Normally this number is 0,
but if the TCP connection breaks, you might observe it is going up until the
TCP connection is reestablished and we can send all of them through.
Note: The instance field is installation-specific.
You need to specify the hostname of your cluster node (for example: node1). By
default, this component monitor is disabled and should only be enabled for
troubleshooting purposes.
Network Reconnections: Urgent Message Queue
Length
This monitor returns the number of urgent
messages that are in the queue waiting to be sent. Normally this number is 0,
but if the TCP connection breaks, you might observe it going up until the TCP
connection is re-established, thereby allowing all messages to be sent.
Note: The instance field is installation-specific.
You need to specify the hostname of your cluster node (for example: node1). By
default, this component monitor is disabled and should only be enabled for
troubleshooting purposes.
Messages Outstanding
This monitor returns the number of cluster
MRR outstanding messages. The returned value should be near zero.
Resource Control Manager: Groups Online
This monitor returns the number of online
cluster resource groups on this node. The returned value should be above zero
at all times.
Resource Control Manager: RHS Processes
This monitor returns the number of running
resource host subsystem processes (rhs.exe). The returned value should be above
zero at all times.
Resource Control Manager: RHS Restarts
This monitor returns the number of resource
host subsystem process (rhs.exe) restarts.
Note: By default, this component
monitor is disabled and should only be enabled for troubleshooting purposes.
Resources: Resource Failure
This monitor returns the number of resource
failures. The returned value should be as low as possible.
Resources: Resource Failure Access Violation
This monitor returns the number of resource
failures caused by access violation. The returned value should be as low as
possible.
Note: By default, this component
monitor is disabled and should only be enabled for troubleshooting purposes.
Resources: Resource Failure Deadlock
This monitor returns the number of resource
failures caused by deadlock. Deadlocks are usually caused by the resource
taking too long to execute certain operations. The returned value should be as
low as possible.
Note: By default, this component
monitor is disabled and should only be enabled for troubleshooting purposes.
Backup and Restore Functionality Problems
This monitor returns the number of events
that occur when:
o
The backup operation for the cluster configuration data
has been aborted because quorum for the cluster has not yet been achieved;
o
The restore request for the cluster configuration data
has failed during the "pre-restore" or "post-restore"
stage.
Type of event: Error. Event ID: 1541, 1542,
1543.
Check for the following pre-conditions to
make sure they have been met, and then retry the backup or restore operation:
o
The cluster must achieve quorum. In other words, enough
nodes must be running and communicating (perhaps with a witness disk or witness
file share, depending on the quorum configuration) that the cluster has
achieved a majority, that is, quorum.
o
The account used by the person performing the backup must
be in the local Administrators group on each clustered server, and must be a
domain account, or must have been delegated the equivalent authority.
During a restore, the restore software must
obtain exclusive access to the cluster configuration database on a given node.
If other software has access (open handles to the database), the restore cannot
be performed.
Cluster Network Connectivity Problems
This monitor returns the number of events
that occur when:
o
The Cluster network interface for some cluster node on a
special network failed;
o
The Cluster network is partitioned and some attached
failover cluster nodes cannot communicate with each other over the network;
o
The Cluster network is down;
o
The Cluster IP address resource failed to come online;
o
Attempting to use IPv4 for a special network adapter
failed.
Type of event: Warning and Error. Event ID:
1127, 1129, 1130, 1360, 1555.
Run the Validate a Configuration Wizard,
selecting only the network tests. Also check network devices (adapters, cables,
hubs, switches, etc) and quorum configuration.
Compare the properties of the IP Address
resource with the properties of the corresponding network to ensure that the
network and subnet information match. If this is an IPv6 resource, make sure
that the cluster network for this resource has at least one IPv6 prefix that is
not link-local or tunnel.
Cluster Service Startup Problems
This monitor returns the number of events
that occur when:
o
The Cluster service suffered an unexpected fatal error;
o
The Cluster service was halted due to incomplete connectivity
with other cluster nodes;
o
The Cluster service was halted to prevent an
inconsistency within the failover cluster;
o
The Cluster resource host subsystem (RHS) stopped
unexpectedly;
o
The Cluster resource either crashed or deadlocked;
o
The Cluster service encountered an unexpected problem and
will be shut down;
o
The Cluster service has prevented itself from starting on
this node. (This node does not have the latest copy of cluster configuration
data.)
o
The membership engine detected that the arbitration
process for the quorum device has stalled.
Type of event: Error. Event ID: 1000, 1006,
1073, 1146, 1230, 1556, 1561, 1178.
There are various software or hardware
related causes that can prevent the Cluster service from starting on a node.
Sometimes the Cluster service can restart successfully after it has been
interrupted by one of those causes. Review the event logs for indications of
the problem.
Check network hardware and configuration. Use
the Validate a Configuration Wizard to review the network configuration.
Check to see which resource DLL is causing
the issue and report the problem to the resource vendor. Consider configuring
the resource to run in its own Resource Monitor. Note that while a problem with
a resource DLL will not stop the Cluster service from running, it can prevent
other resource DLLs from running unless the resource runs in its own Resource
Monitor.
Try starting the Cluster service on all other
nodes in the cluster. If the Cluster service can be started on a node with the
latest copy of the cluster configuration data, then the node that previously
could not be started will probably be able to obtain the latest copy and then
join the cluster successfully.
Cluster Shared Volume Functionality Problems
This monitor returns the number of events
that occur when:
o
The Cluster Shared Volume is no longer available on this
node;
o
The Cluster Shared Volume is no longer directly
accessible from this cluster node;
o
The Cluster service failed to create the Cluster Shared
Volumes root directory;
o
The Cluster service failed to set the permissions (ACL)
on the Cluster Shared Volumes root directory;
o
The Cluster Shared Volume is no longer accessible from
this cluster node;
o
The Cluster service failed to create a cluster identity
token for Cluster Shared Volumes.
Type of event: Error. Event ID: 5120, 5121,
5123, 5134, 5135, 5142, 5200.
Review events related to communication with
the volume.
Check storage and network configuration.
Check Cluster Shared Volumes folder creation
and permissions.
Check communication between domain
controllers and nodes.
Cluster Storage Functionality Problems
This monitor returns the number of events
that occur when:
o
The Cluster Physical Disk resource cannot be brought
online because the associated disk could not be found;
o
While the disk resource was being brought online, access
to one or more volumes failed with an error;
o
The file system for one or more partitions on the disk
for the resource may be corrupt;
o
The Cluster disk resource indicates corruption for
specific volume;
o
The Cluster disk resource contains an invalid mount
point.
Type of event: Error. Event ID: 1034, 1035,
1037, 1066, 1208.
Confirm that the affected disk is available.
Check the underlying storage hardware and
confirm that the device is being presented correctly to the cluster nodes.
If you have problems with partitions on the
disk or corruption, we recommend that you run Chkdsk so that it can correct any
problems with the file system.
Confirm that the mounted disk is configured
according to the following guidelines:
Clustered disks can only be mounted onto
clustered disks (not local disks);
The mounted disk and the disk it is mounted
onto must be part of the same clustered service or application. They cannot be
in two different clustered services or applications, and they cannot be in the
general pool of Available Storage in the cluster.
Cluster Witness Problems
This monitor returns the number of events
that occur when:
o
The Cluster service failed to update the cluster
configuration data on the witness resource due to resource inaccessibility;
o
The Cluster service detected a problem with the witness
resource;
o
The File Share Witness resource failed a periodic health
check;
o
The File Share Witness resource failed to come online;
o
The File Share Witness resource failed to arbitrate for
the specific file share;
o
The node failed to form a cluster because the witness was
not accessible.
Type of event: Error. Event ID: 1557, 1558,
1562, 1563, 1564, 1573.
Confirm witness accessibility by viewing the
quorum configuration of a failover cluster and the status of a witness disk.
Configuration Availability Problems
This monitor returns the number of events
that occur when:
o
The cluster configuration database could not be loaded or
unloaded;
o
The cluster service cannot start due to failed attempts
to read configuration data.
Type of event: Error. Event ID: 1057, 1090,
1574, 1575, 1593.
When the cluster configuration on a node is
missing or corrupt, the Cluster service cannot load the configuration and
therefore cannot start. Where possible, the Cluster service will obtain the
latest cluster configuration from other nodes in the cluster. Ensure that other
nodes are started. If the only node or nodes that can be started appear to have
a missing or corrupt cluster configuration database, you will probably need to
restore one of the nodes from a system state backup. (For a failover cluster
node, the system state backup includes the cluster configuration.) Sometimes
when the node attempts to unload the cluster configuration database, the action
does not fully complete. Try stopping and restarting the Cluster service. If
this does not succeed, restart the operating system on the affected node.
DFS Namespace Resource Availability Problems
This monitor returns the number of events that
occur when:
o
The creation of DFS namespace root failed with error;
o
The resynchronization of DFS root target failed with
error;
o
The cluster file share resource for DFS Namespace cannot
be brought online due to error.
Type of event: Error. Event ID: 1138, 1141,
1142.
Check DFS namespace configuration.
Encrypted Settings for Cluster Resource Could
not Applied
This monitor returns the number of events
when encrypted settings for a cluster resource could not be successfully
applied to the container on this node.
Type of event: Error. Event ID: 1121.
Close any application that might have an open
handle to the registry checkpoint indicated by the event. This will allow the
registry key to be replicated as configured with the resource properties. If
necessary, contact the application vendor about this problem. You can use a
utility calledHandle with the -a option to view handles to the registry.
Failed to Form Cluster
This monitor returns the number of Failed to
Form cluster events.
Type of event: Error. Event ID: 1092, 1009.
You might be able to correct this issue by
restarting the Cluster service.
File Share Resource Availability Problems
This monitor returns the number of events
that occur when:
o
The Cluster File Share cannot be brought online because a
file share could not be created;
o
The retrieving of information for a specific share
returned an error code;
o
The retrieving of information for a specific share
indicated that the share does not exist;
o
The Creation of a file share failed due to an error;
o
The Cluster file share resource has detected shared
folder conflicts;
o
The Cluster file server resource failed a health check
because some of its shared folders were inaccessible.
Type of event: Warning and Error. Event ID:
1053, 1054, 1055, 1068, 1560, 1585, 1586, 1587, 1588.
Confirm that the share exists and that the
permissions allow access to the share.
If possible, determine whether the path to
the share has been changed. If so, recreate the share with the correct name.
View all the resources in the clustered file
server instance to ensure that they are coming online, and review the
dependencies among the resources. Reconfigure as necessary to correct any
problems.
Ensure that no two shared folders have the
same share name.
Check shared folder accessibility and the
State of Server service.
Generic Application Could not be Brought
Online
This monitor returns the number of events
that occur when a generic application could not be brought online during an
attempt to create the process due to; the application not being present on this
node, an incorrect path name, or an incorrect binary name.
Type of event: Error. Event ID: 1039.
Confirm that the following are true for the
application used by the clustered Generic Application instance:
o
The application is fully installed on all nodes that are
possible owners of the Generic Application resource;
o
The configuration for the Generic Application resource
specifies the correct application and path;
o
The configuration for the Generic Application resource
specifies the appropriate parameters and settings for registry replication.
Generic Service Resource Availability
Problems
This monitor returns the number of events
that occur when:
o
The generic service is either not installed or the
specified service name is invalid;
o
The specified generic service parameters might be
invalid;
o
The generic service failed with an error.
Type of event: Error. Event ID: 1040, 1041,
1042.
Confirm that the correct service is specified
in the configuration for the Generic Service resource and confirm that the
service is fully installed on all nodes that are possible owners of the
resource.
Check service operation and examine the
application event log.
IP address Resource Availability Problems
This monitor returns the number of events
that occur when:
o
The Cluster IP address resource cannot be brought online
because the subnet mask value is invalid;
o
The Cluster IP address resource cannot be brought online
because the address value is invalid;
o
The configuration data for the network adapter
corresponding to the cluster network interface could not be determined;
o
The Cluster IP address resource cannot be brought online
because a duplicate IP address was detected on the network;
o
The Cluster IP address resource cannot be brought online
because WINS registration;
o
The lease of the IP address associated with the cluster
IP address resource has expired or is about to expire, and currently cannot be
renewed;
o
The IPv6 Tunnel address resource failed to come online
because it does not depend on an IP Address (IPv4) resource;
o
The Cluster network associated with dependent IP address
(IPv4) resource does not support ISATAP tunneling.
Type of event: Error. Event ID: 1046, 1047,
1048, 1049, 1078, 1242, 1361, 1363.
Check the address, subnet, and network
properties of the IP Address resource.
If the resource is an IPv6 Tunnel address
resource, make sure it depends on at least one IP Address (IPv4) resource. Also
make sure the network supports Intra-Site Automatic Tunnel Addressing Protocol
(ISATAP) tunneling.
If the IP Address resource appears to be
configured correctly, check the condition of network adapters and other network
components used by the cluster.
Network Connectivity and Configuration
Problems
This monitor returns the number of events
that occur when:
o
The Cluster Service was unable to access the network
adapter or the cluster node has no network connectivity;
o
The Cluster node has no network connectivity;
o
The Cluster node has lost all network connectivity;
o
The failover cluster virtual adapter failed to initialize
the miniport adapter.
Type of event: Error. Event ID: 1289, 1553,
1554, 4871.
Correct any problems with the physical
network adapters and the cluster virtual adapter. If a previous change in the
configuration is interfering with the function of the cluster virtual adapter,
it might be necessary to reinstall the failover clustering feature on the node.
Also, use the Validate a Configuration Wizardto review the network
configuration.
Node Failed to Join Cluster
This monitor returns the number of events
that occur when the node failed to join the failover cluster due to an error.
Type of event: Error. Event ID: 1070.
You might be able to correct this issue by
restarting the Cluster service.
Problems with Cluster Service
This monitor returns the number of events
that occur when:
o
The cluster resource in the Clustered service or
application failed;
o
The Cluster service failed to bring the Clustered service
or application completely online or offline and one or more resources may be in
a failed state.
Type of event: Warning and Error. Event ID:
1039, 1205.
Check and correct any problems with the
application or service associated with the resource.
Check and correct any problems with cables or
cluster-related devices.
Adjust the properties for the resource in the
cluster configuration, especially the value for the Pending Timeout for the
resource. This value must allow enough time for the associated application or
service to start.
Check the state of all resources in the
clustered service or application.
Quorum was Lost
This monitor returns the number of events
that occur when the Cluster service is shutting down because quorum was lost.
Type of event: Error. Event ID: 1177.
This can occur when network connectivity is
lost between some or all of the nodes in the cluster, or the witness disk fails
over. It can also occur if you make a change in the cluster configuration such
as increasing the number of nodes, when the number of nodes currently online is
too few to achieve quorum in the new configuration. Run the Validate a
Configuration Wizard, selecting only the network tests. Also check network
devices (adapters, cables, hubs, switches, etc.) and quorum configuration.
Registry Checkpoint Could not be Restored to
Registry Key
This monitor returns the number of events
that occur when the Registry Checkpoint for Cluster resource could not be
restored to a registry key.
Type of event: Error. Event ID: 1024.
Close any application that might have an open
handle to the registry checkpoint indicated by the event. This will allow the
registry key to be replicated as configured with the resource properties. If
necessary, contact the application vendor about this problem. You can use a
utility calledHandle with the -a option to view handles to the registry.
System is not being Responsive
This monitor returns the number of events
that occur when the Failover cluster virtual adapter has lost contact with the
process.
Type of event: Error. Event ID: 4869, 4870.
Use Resource Monitor to determine, in real
time, how many system resources a service or application is utilizing. This may
take several minutes if the system is critically low on resources.
No comments:
Post a Comment