notes and synthesis from meeting on communicating for maintenance and outages |
September 26, 2006
Present: Tim Rhodes, William Dougherty, Ron Keller, Marshall Fisher, Christine Morrison, Joyce Landreth, Laurie Zirkel, Mark Hoppe, Mary Dunker, Jeff Bevis, Wanda Baber, Mary Dunker, Lylah Shelor, Jeff Kidd, Susan Brooker-Gross
Communications strategies need to improve to schedule maintenance, notify staff of planned maintenance, and to deal with problems that result from maintenance or from other sources.
Several projects would seem to be converging: the post-mortem of the recent power outage; a proposal within NI&S to improve communications; the report from the communications committee of the Security Task Force; and the work within IT management for change management. The group primarily discussed the first two topics.
The scope focused on the Data Center, acknowledging that other systems and applications have parallel issues. Three situations for improving the communications are collaboration in scheduling, communications about the scheduled event, and communications during unplanned outages or substandard performance.
Improved mapping of interdependencies will help us ensure that appropriate communications occur.
The remainder of this document is a first attempt to synthesize the discussion and the proposal for "IT System Event Publicity." Please critique!
Actions items are embedded in the document. In addition---
Action item: Mike Naff and Jeff Bevis (and anyone else with relevant information) will provide information on the ways that communication about planned maintenance is communicated.
This section deals with planned maintenance and other outages that can potentially affect any or all activities in the Data Center. Issues include--but are not limited to--maintenance and accidents with fire suppression systems, maintenance and outages involving power and power backups, network connectivity, building safety and security, commonly used systems such as monitoring systems and backup systems.
Departments housing any systems or services located in the Data Center (aka, machine room) must agree to maintain an up-to-date list of contact information in case of system issues. Failure to provide contact information will hamper or preclude the ability to notify principals of issues within their systems. Prolonged outages may result, either from failure to restart systems or from systems being taken off-line to protect other resources.
Action item: Joyce will review the current sources for contact information---Big Brother for appropriate resources, and other contact systems or lists for other resources.
Question: Do we have the tools in place for contact individuals to maintain the information?
Question: Should there be a management-level review of contact information?
Consultation with critical interdependencies is essential. At a minimum, Data Center issues include:
Question: What is the right list? Should the flow of communications be from managers through the chain of command and/or from system administrators to applications administrators? Should applications users (some? All?) be notified? By whom?
Question: *Even without a Marketing and Communications Department for IT, there can be a designated person to assist with communication needs.
Not all interdependencies may be readily obvious. Therefore, once a planned maintenance event is scheduled, notification should be shared widely.
Assisting in the notification is the "Editor" role (from "IT System Event Publicity Proposal"). Even without a marketing and/or communications 'department' for Information Technology, a communications contact can be established to help with the notification texts.
All Data Center planned maintenance must be posted on the official calendar. The official calendar must support a subscription by interested university affiliates. The official calendar ideally supports the ability to push messages to selected affiliates.
Question: What is the official calendar? Critical dates? DBMS calendar? Wiki-based calendar?
Action item: Susan and Mary will research appropriate maintenance and options for the Critical Dates calendar. Bill Dooley will convey information on the DBMS calendar. William and Jeff Bevis will explore the wiki options.
All Data Center planned maintenance must be communicated via e-mail to
Timing. "Pushed" notices should be sent via e-mail and posted on established, advertised web outlets--
Content: The content of notification e-mail should convey some sense of risk, particularly to the wider communities (VT-DNET listserv and Techsupport listserv). Typically, conveying the risk means reassuring the recipients that minimal disruption is anticipated, with little impact on users.
The content should also include reminders of where to look for status updates:
Question: I was asked not to include www.4help.vt.edu in this list for the printed publication---do we want it here or not?
Question: Should there be a special 'status' phone line similar to the inclement weather university phone line?
Action item: Joyce to provide more information on previous discussions regarding a status telephone line.
If the planned event included system downtime, notification should be sent at the end of the maintenance to:
The procedures are also to be used in the case of sub-standard performance of systems.
Within the first 15 minutes, anyone noticing the outage or subpar performance should report it to the VTOC.
If the best estimate is that resolution of the outage may take 30 minutes or longer, the following contacts should be made (in addition, of course, to the personnel directly required for resolution).
VTOC personnel will initiate the notification process, notifying the individuals below.
Question: the VTOC procedure still needs to be clarified, including lines of responsibility.
Question: Will there be situations in which added staff will be needed to conduct the notification?
1. THE ON-CALL EXECUTIVE
If there is any ambiguity as to whether informing the entire university community is necessary, then the executive who is on-call will help decide if there is a need to inform higher university officials, University Relations, or the entire university community, and whether to use alternate channels of communication (PhoneMail, Cable TV, mass-e-mail if available, other means identified). The executive who is on call will assist in decisions as to whether to create alternative communication devices such as redirected websites.
Question: A plan to establish an on-call xecutive requires the Vice President and the IT Line Managers to also have a means to maintain 24x7 contact with a delegated individual. Duties of this individual overlaps to some extent with the "IT System Event Publicity Proposal's" "Publisher" role.
2. THE CRITICAL INTERDEPENDENCIES LIST (ABOVE) AND THE ON-CALL COMMUNICATIONS CONTACT.
Question: Is Big Brother sufficient to keep contact information up-to-date? Are pager and/or cell-phone numbers included? Is there a reasonable system for on-call contacts that may not be "naturals" for Big Brother (for example, the communications contact)?
Question: Can the IT Line mangers come off the list for this purpose, replaced by the executive on-call instead?
Question: Can we create a first notice template that anyone could use? Something to the effect "__system, or preferably, application(s)__ is/are experiencing difficulties. Work is underway to correct the problem. Date and time. If the problem is resolved in the first 30 minutes, another posting to the effect "The difficulties with ____ system, or application(s)____ have been resolved. Date and time."
If the outage or substandard performance is unresolved, complete any notifications that have not yet been accomplished under the "Immediate" step.
A status message MUST be posted, in the designated websites (above) if operable and/or at the designated telephone number, or alternate means identified by the executive who is on-call. The communications contact will assist in constructing and posting the message. It MUST contain:
The communications contact in collaboration with an appointed contact in the resolution effort MUST post a status message in the designated websites (above) if operable and/or at the designated telephone number, or alternate means identified by the executive who is on call. This and each subsequent status message must contain:
The communications contact MUST post a new status message, even if the only update is "the resolution effort continues."
Any time a partial resolution (or degradation of service) that could impact users occurs, the communications contact must post a status message.
When the outage or serious substandard performance is resolved, the communications contact MUST post a new status message noting the resolution in the same locations/formats as prior messages.
In addition, information about the outage in some detail must be communicated to the critical interdependences list.
Information must be also be communicated to the VT DNet listserv and the Techsupport listserv, as appropriate to the problem and its resolution.
The resolution should also be posted to the official calendar.
The communications contact will check the status messages that had been posted to be sure all continue to be appropriate. Most will need to be replace or updated by a "systems normal" message.
Each system, application, and/or location should develop a communication plan modeled after the Data Center plan.
Question: What more needs to be said?
Action item: Repeating from page 1---please send models that are currently in existence for Banner, network, other systems and applications.