volwatch(8)volwatch(8)NAMEvolwatch - Monitors the Logical Storage Manager (LSM) for failure
events and performs hot sparing
SYNOPSIS
/usr/sbin/volwatch [-m] [-s] [-o] [mail-addresses...]
OPTIONS
Runs volwatch with the mail notification support to notify root (by
default) or other specified users when a failure occurs. This option is
started by default. Runs volwatch with hot spare support. Specifies
an argument to pass directly to volrecover if it is running and hot
spare support is enabled.
DESCRIPTION
The volwatch command monitors LSM waiting for exception events to
occur. When an exception event occurs, the volwatch command uses
mailx(1) to send mail to: The root account. The user accounts speci‐
fied when you use the rcmgr command to set the VOLWATCH_USERS variable
in the /etc/rc.config.common file. The user account that you specify
on the command line with the volwatch command.
The volwatch command uses the volnotify command to wait for events to
occur. When an event occurs, there is a 15 second delay before the
failure is analyzed and the message is sent. This delay allows a group
of related events to be collected and reported in a single mail mes‐
sage. By default, the volwatch command automatically starts when the
system boots.
You can enter the volwatch-s command to start volwatch with hot-spare
support. Hot-spare support: Detects LSM events resulting from the fail‐
ure of a disk, plex, or RAID5 subdisk. Sends mail to the root account
(and other specified accounts) with notification about the failure and
identifies the affected LSM objects. Determines which subdisks to
relocate, finds space for those subdisks in the disk group, relocates
the subdisks, and notifies the root account (and other specified
accounts) of these actions and their success or failure.
When a partial disk failure occurs (that is, a failure affecting
only some subdisks on a disk), redundant data on the failed por‐
tion of the disk is relocated and the existing volumes comprised
of the unaffected portions of the disk remain accessible.
Note
Hot-sparing is only performed for redundant (mirrored or RAID5) sub‐
disks on a failed disk. Non-redundant subdisks on a failed disk are not
relocated, but you are notified of the failure.
Only one volwatch daemon can be running on a system or cluster node at
any time.
Hot-sparing does not guarantee the same layout of data or the same per‐
formance after relocation. You may want to make some configuration
changes after hot-sparing occurs.
Mail Notification Support
The following is a sample mail notification when a failure is detected:
Failures have been detected by the Logical Storage Manager:
failed disks:
medianame
...
failed plexes:
plexname
...
failed log plexes:
plexname
...
failing disks:
medianame
...
failed subdisks:
subdiskname
...
The Logical Storage Manager will attempt to find spare disks, relocate
failed subdisks and then recover the data in the failed plexes.
The following describes the sections of the mail message: The medianame
list under failed disks specifies disks that appear to have completely
failed; The medianame list under failing disks indicates a partial disk
failure or a disk that is in the process of failing. When a disk has
failed completely, the same medianame list appears under both failed
disks: and failing disks. The plexname list under failed plexes shows
plexes that have been detached due to I/O failures experienced while
attempting to do I/O to subdisks they contain. The plexname list under
failed log plexes indicates RAID5 or dirty region log (DRL) plexes that
have experienced failures. The subdiskname list specifies subdisks in
RAID5 volumes that have been detached due to I/O errors.
Enabling Hot-Sparing
By default, hot-sparing is disabled. To enable hot-sparing, enter the
volwatch command with the -s option, for example: # volwatch-s
To use hot-spare support you should configure a disk as a spare, which
identifies the disk as an available site for relocating failed sub‐
disks. Disks that are identified as spares are not used for normal
allocations unless you explicitly specify otherwise. This ensures that
there is a pool of spare disk space available for relocating failed
subdisks and that this disk space is not consumed by normal operations.
Spare disk space is the first space used to relocate failed subdisks.
However, if no spare disk space is available or if the available spare
disk space is not suitable or sufficient, free disk space is used.
You must initialize a spare disk and place it in a disk group as a
spare before it can be used for replacement purposes. If no disks are
designated as spares when a failure occurs, LSM automatically uses any
available free disk space in the disk group in which the failure
occurs. If there is not enough spare disk space, a combination of spare
disk space and free disk space is used.
When hot-sparing selects a disk for relocation, it preserves the redun‐
dancy characteristics of the LSM object to which the relocated subdisk
belongs. For example, hot-sparing ensures that subdisks from a failed
plex are not relocated to a disk containing a mirror of the failed
plex. If redundancy cannot be preserved using available spare disks
and/or free disk space, hot-sparing does not take place. If relocation
is not possible, mail is sent indicating that no action was taken.
When hot-sparing takes place, the failed subdisk is removed from the
configuration database and LSM takes precautions to ensure that the
disk space used by the failed subdisk is not recycled as free disk
space.
Initializing and Removing Hot-Spare Disks
Although hot-sparing does not require you to designate disks as spares,
HP recommends that you initialize at least one disk as a spare within
each disk group; this gives you control over which disks are used for
relocation. If no spare disks exist, LSM uses available free disk space
within the disk group. When free disk space is used for relocation pur‐
poses, it is likely that there may be performance degradation after the
relocation.
Follow these guidelines when choosing a disk to configuring as a spare:
The hot-spare feature works best if you specify at least one spare disk
in each disk group containing mirrored or RAID5 volumes. If a given
disk group spans multiple controllers and has more than one spare disk,
set up the spare disks on different controllers (in case one of the
controllers fails). For a mirrored volume, the disk group must have at
least one disk that does not already contain one of the volume's mir‐
rors. This disk should either be a spare disk with some available space
or a regular disk with some free space. For a mirrored and striped
volume, the disk group must have at least one disk that does not
already contain one of the volume's mirrors or another subdisk in the
striped plex. This disk should either be a spare disk with some avail‐
able space or a regular disk with some free space. For a RAID5 volume,
the disk group must have at least one disk that does not already con‐
tain the volume's RAID5 plex or one of its log plexes. This disk should
either be a spare disk with some available space or a regular disk with
some free space. If a mirrored volume has a DRL log subdisk as part of
its data plex (for example, volprint does not list the plex length as
LOGONLY), that plex cannot be relocated. Therefore, place log subdisks
in plexes that contain no data (log plexes). By default, the volassist
command creates log plexes. For mirroring the root disk, the rootdg
disk group should contain an empty spare disk that satisfies the
restrictions for mirroring the root disk. Although it is possible to
build LSM objects on spare disks, it is preferable to use spare disks
for hot-spare only. When relocating subdisks off a failed disk, LSM
attempts to use a spare disk large enough to hold all data from the
failed disk.
To initialize a disk as a spare that has no associated subdisks, use
the voldiskadd command and enter y at the following prompt: Add disk as
a spare disk for newdg? [y,n,q,?] (default: n) y
To initialize an existing LSM disk as a spare disk, enter: # voledit
set spare=on medianame
For example, to initialize a disk called test03 as a spare disk, enter:
# voledit set spare=on test03
To remove a disk as a spare, enter: # voledit set spare=off medianame
For example, to make a disk called test03 available for normal use,
enter: # voledit set spare=off test03
Replacement Procedure
In the event of a disk failure, mail is sent, and if volwatch was con‐
figured to run with hot sparing support with the -s option, volwatch
attempts to relocate any subdisks that appear to have failed. This
involves finding appropriate spare disk or free disk space in the same
disk group as the failed subdisk.
To determine which disk from among the eligible spare disks to use,
volwatch tries to use the disk that is closest to the failed disk. The
value of closeness depends on the controller, target, and disk number
of the failed disk. For example, a disk on the same controller as the
failed disk is closer than a disk on a different controller; a disk
under the same target as the failed disk is closer than one under a
different target.
If no spare or free disk space is found, the following mail message is
sent explaining the disposition of volumes on the failed disk: Reloca‐
tion was not successful for subdisks on disk dm_name in volume v_name
in disk group dg_name. No replacement was made and the disk is still
unusable.
The following volumes have storage on medianame:
volumename ...
These volumes are still usable, but the redundancy of those volumes is
reduced. Any RAID-5 volumes with storage on the failed disk may become
unusable in the face of further failures.
If non-RAID5 volumes are made unusable due to the failure of the disk,
the following is included in the mail message: The following volumes:
volumename ...
have data on medianame but have no other usable mirrors on other disks.
These volumes are now unusable and the data on them is unavailable.
These volumes must have their data restored.
If RAID5 volumes are made unavailable due to the disk failure, the fol‐
lowing message is included in the mail message: The following RAID-5
volumes:
volumename ...
have storage on medianame and have experienced other failures. These
RAID-5 volumes are now unusable and data on them is unavailable. These
RAID-5 volumes must have their data restored.
If spare disk space is found, LSM attemps to set up a subdisk on the
spare disk and use it to replace the failed subdisk. If this is suc‐
cessful, the volrecover command runs in the background to recover the
contents of data in volumes on the failed disk.
If the relocation fails, the following mail message is sent: Relocation
was not successful for subdisks on disk dm_name in volume v_name in
disk group dg_name. No replacement was made and the disk is still unus‐
able.
error message
If the relocation fails after the plexs associated with the failing
disk were detached, the following email message is sent: The following
plex(s) in volume v_name in diskgroup dg_name were targetted for
replacement but the relocation did not succeed due to not enough spare
space. The plex(s) will be left in the DETACHED state.
detachedpl_name
If disk dm_name is not faulty, re-attach plex(s) using the volplex att
command. Otherwise add additional spare space and re-invoke the vol‐
watch -s command to replace faulty plex(s).
If any volumes (RAID5 or otherwise) are rendered unusable due to the
failure, the following is included in the mail message: The following
volumes:
volumename ...
have data on dm_name but have no other usable mirrors on other disks.
These volumes are now unusable and the data on them is unavailable.
These volumes must have their data restored.
If the relocation procedure completes successfully and recovery is
under way, the following mail message is sent: Volume v_name Subdisk
sd_name relocated to newsd_name, but not yet recovered.
Once recovery has completed, a message is sent relaying the outcome of
the recovery procedure. If the recovery was successful, the following
is included in the mail message: Recovery complete for volume v_name in
disk group dg_name.
If the recovery was not successful, the following is included in the
mail message: Failure recovering v_name in disk group dg_name.
SEE ALSOmailx(1), rcmgr(8), voldiskadm(8), voledit(8), volintro(8), volre‐
cover(8), volrootmir(8)volwatch(8)