Notes: Communications for scheduled maintenance (and beyond)

notes and synthesis from meeting on communicating for maintenance and outages  

September 26, 2006

Present: Tim Rhodes, William Dougherty, Ron Keller, Marshall Fisher, Christine Morrison, Joyce Landreth, Laurie Zirkel, Mark Hoppe, Mary Dunker, Jeff Bevis, Wanda Baber, Mary Dunker, Lylah Shelor, Jeff Kidd, Susan Brooker-Gross

Communications strategies need to improve to schedule maintenance, notify staff of planned maintenance, and to deal with problems that result from maintenance or from other sources.

Several projects would seem to be converging: the post-mortem of the recent power outage; a proposal within NI&S to improve communications; the report from the communications committee of the Security Task Force; and the work within IT management for change management. The group primarily discussed the first two topics.

The scope focused on the Data Center, acknowledging that other systems and applications have parallel issues. Three situations for improving the communications are collaboration in scheduling, communications about the scheduled event, and communications during unplanned outages or substandard performance.

Improved mapping of interdependencies will help us ensure that appropriate communications occur.

The remainder of this document is a first attempt to synthesize the discussion and the proposal for "IT System Event Publicity." Please critique!

Actions items are embedded in the document. In addition---

Action item: Mike Naff and Jeff Bevis (and anyone else with relevant information) will provide information on the ways that communication about planned maintenance is communicated.

Communication of maintenance and outages

I. Scope: Data Center

This section deals with planned maintenance and other outages that can potentially affect any or all activities in the Data Center. Issues include--but are not limited to--maintenance and accidents with fire suppression systems, maintenance and outages involving power and power backups, network connectivity, building safety and security, commonly used systems such as monitoring systems and backup systems.

Agreements regarding contact information

Departments housing any systems or services located in the Data Center (aka, machine room) must agree to maintain an up-to-date list of contact information in case of system issues. Failure to provide contact information will hamper or preclude the ability to notify principals of issues within their systems. Prolonged outages may result, either from failure to restart systems or from systems being taken off-line to protect other resources.

Action item: Joyce will review the current sources for contact information---Big Brother for appropriate resources, and other contact systems or lists for other resources.

Question: Do we have the tools in place for contact individuals to maintain the information?

Question: Should there be a management-level review of contact information?

When maintenance is being planned and scheduled

Consultation with critical interdependencies is essential. At a minimum, Data Center issues include:

  • Network engineering
  • Systems administration
  • Facilities management
  • IT Line Managers
  • IT Support (VTOC)
  • Individuals outside of Information Technology with Data Center resources
  • Primary support for critical enterprise-level applications
    • E-mail
    • Banner
    • Enterprise Directory

Question: What is the right list? Should the flow of communications be from managers through the chain of command and/or from system administrators to applications administrators? Should applications users (some? All?) be notified? By whom?

Question: *Even without a Marketing and Communications Department for IT, there can be a designated person to assist with communication needs.

Once maintenance is scheduled (or re-scheduled)

Not all interdependencies may be readily obvious. Therefore, once a planned maintenance event is scheduled, notification should be shared widely.

Assisting in the notification is the "Editor" role (from "IT System Event Publicity Proposal"). Even without a marketing and/or communications 'department' for Information Technology, a communications contact can be established to help with the notification texts.

All Data Center planned maintenance must be posted on the official calendar. The official calendar must support a subscription by interested university affiliates. The official calendar ideally supports the ability to push messages to selected affiliates.

Question: What is the official calendar? Critical dates? DBMS calendar? Wiki-based calendar?

Action item: Susan and Mary will research appropriate maintenance and options for the Critical Dates calendar. Bill Dooley will convey information on the DBMS calendar. William and Jeff Bevis will explore the wiki options.

All Data Center planned maintenance must be communicated via e-mail to

  • The critical interdependencies list (above)
  • VT-DNet listserv
  • Techsupport listserv
  • Possible wider communications depending on the scope and duration of the planned maintenance

Timing. "Pushed" notices should be sent via e-mail and posted on established, advertised web outlets--

  • As soon as the event is scheduled (or re-scheduled)
  • One month in advance, if applicable
  • With a reminder 3 to 5 working days in advance

Content: The content of notification e-mail should convey some sense of risk, particularly to the wider communities (VT-DNET listserv and Techsupport listserv). Typically, conveying the risk means reassuring the recipients that minimal disruption is anticipated, with little impact on users.

The content should also include reminders of where to look for status updates:

  • my.vt.edu
  • computing.vt.edu
  • www.cns.vt.edu
  • learn.vt.edu
  • and by telephoning 540/231-HELP---or alternate phone as appropriate for particular projects.

Question: I was asked not to include www.4help.vt.edu in this list for the printed publication---do we want it here or not?

Question: Should there be a special 'status' phone line similar to the inclement weather university phone line?

Action item: Joyce to provide more information on previous discussions regarding a status telephone line.

When the planned maintenance is successfully completed

If the planned event included system downtime, notification should be sent at the end of the maintenance to:

  • The critical interdependencies list (above)
  • VT-DNet listserv
  • Techsupport listserv

In the event of an unplanned outage---whether from a planned event or unplanned event

The procedures are also to be used in the case of sub-standard performance of systems.

Immediately

Within the first 15 minutes, anyone noticing the outage or subpar performance should report it to the VTOC.

If the best estimate is that resolution of the outage may take 30 minutes or longer, the following contacts should be made (in addition, of course, to the personnel directly required for resolution).

VTOC personnel will initiate the notification process, notifying the individuals below.

Question: the VTOC procedure still needs to be clarified, including lines of responsibility.

Question: Will there be situations in which added staff will be needed to conduct the notification?

1. THE ON-CALL EXECUTIVE

If there is any ambiguity as to whether informing the entire university community is necessary, then the executive who is on-call will help decide if there is a need to inform higher university officials, University Relations, or the entire university community, and whether to use alternate channels of communication (PhoneMail, Cable TV, mass-e-mail if available, other means identified). The executive who is on call will assist in decisions as to whether to create alternative communication devices such as redirected websites.

Question: A plan to establish an on-call xecutive requires the Vice President and the IT Line Managers to also have a means to maintain 24x7 contact with a delegated individual. Duties of this individual overlaps to some extent with the "IT System Event Publicity Proposal's" "Publisher" role.

2. THE CRITICAL INTERDEPENDENCIES LIST (ABOVE) AND THE ON-CALL COMMUNICATIONS CONTACT.

Question: Is Big Brother sufficient to keep contact information up-to-date? Are pager and/or cell-phone numbers included? Is there a reasonable system for on-call contacts that may not be "naturals" for Big Brother (for example, the communications contact)?

Question: Can the IT Line mangers come off the list for this purpose, replaced by the executive on-call instead?

Question: Can we create a first notice template that anyone could use? Something to the effect "__system, or preferably, application(s)__ is/are experiencing difficulties. Work is underway to correct the problem. Date and time. If the problem is resolved in the first 30 minutes, another posting to the effect "The difficulties with ____ system, or application(s)____ have been resolved. Date and time."

30 minutes into the outage

If the outage or substandard performance is unresolved, complete any notifications that have not yet been accomplished under the "Immediate" step.

A status message MUST be posted, in the designated websites (above) if operable and/or at the designated telephone number, or alternate means identified by the executive who is on-call. The communications contact will assist in constructing and posting the message. It MUST contain:

  • A time stamp
  • A brief description of the nature of the outage
  • Expected resolution time---only if there is a high level of confidence in the estimate. Alternatively, it would say, "time to resoolution has not been determined. Updates will be posted approximately once each hour (or half-hour)."
  • Notice that people are working on the problem
  • As appropriate, any suggested work-around procedures

1 hour into the outage

The communications contact in collaboration with an appointed contact in the resolution effort MUST post a status message in the designated websites (above) if operable and/or at the designated telephone number, or alternate means identified by the executive who is on call. This and each subsequent status message must contain:

  • A time stamp
  • A brief description of the outage
  • A time-to-resolution should be provided only if there is a high level of confidence in the estimate (same addendum as for 30 mins, in section)
  • Notice that people are working on the problem, and any high-level changes in status
  • As appropriate, any suggested work-around procedures

At each subsequent hour into the outage

The communications contact MUST post a new status message, even if the only update is "the resolution effort continues."

Any step impacting users

Any time a partial resolution (or degradation of service) that could impact users occurs, the communications contact must post a status message.

When the outage is resolved

When the outage or serious substandard performance is resolved, the communications contact MUST post a new status message noting the resolution in the same locations/formats as prior messages.

In addition, information about the outage in some detail must be communicated to the critical interdependences list.
Information must be also be communicated to the VT DNet listserv and the Techsupport listserv, as appropriate to the problem and its resolution.

The resolution should also be posted to the official calendar.

Three business days later

The communications contact will check the status messages that had been posted to be sure all continue to be appropriate. Most will need to be replace or updated by a "systems normal" message.

II. Scope: Other systems, applications, locations

Each system, application, and/or location should develop a communication plan modeled after the Data Center plan.

Question: What more needs to be said?

Action item: Repeating from page 1---please send models that are currently in existence for Banner, network, other systems and applications.

  • No labels

3 Comments

  1. Christine Morrison

    Per Jeff Kidd regarding "status" phone line - William Dougherty had a line set up back when there was extensive maintenance on, I think, the Exchange servers. 1-1600???. It was left in place for future use of a similar kind.

  2. Christine Morrison

    Per Jeff Kidd regarding added staff to conduct notification - There may be such situations, but the more likely need is to have primary and alternate individuals for managing the publicity response/notification effort.

  3. Mike Naff

    Action item: Mike Naff and Jeff Bevis (and anyone else with relevant information) will provide information on the ways that communication about planned maintenance is communicated.

    For planned Banner maintenance, AIS will notify key people in the central departments.  These central areas have email lists they use to send information to their users.  Generally 4-Help is notified of the outage as well as any new features or changes for which they are like to get support calls.   Depending on the scope of the maintenance outage, we may also contact the TECHSUPPORT and AIS-ORG listservs.