Forever Resilient: Continuity of Operations Case Study

In this post, we'll examine a recent incident at a client location where there was "catastrophic failure" during a server backup/swap.

Incident Summary: In the wee hours of a Tuesday night/Wednesday morning, the client's IT Department was changing Microsoft Exchange Servers. During this change, one of the servers failed. A vast majority of the data was lost or corrupted. The entire email system went down until Monday morning, when Outlook services were restored to about 80-90% of the staff. Some of those, lost all data before the incident, to include emails, personal folders (PST), contact lists. They only had email and data from post incident forward. When the IT Department restored their pre-event data, they lost the post event data. One member had his old data restored, but lost the past seven days worth of emails. Some people lost their entire profile/desktops.

To date, approximately one third of the staff is still without pre-event capabilities. Outlook Web Access (OWA) was fully restored on Wednesday, a full week after the event. Additionally, the staff members who lost all pre-incident data, were assigned new email addresses. Since they lost their contact list, the staff was/is unable to contact others, informing them the address change.

Lessons Learned:

Communication. There was none. The client, as a whole, did not know a critical software and/or hardware change was going to take place. In the past, the IT Department notified the organization that they would be completing an update and to back up their data. Unfortunately, this time, the client was left in the dark and the organization was caught completely off guard. A few emails ahead of time, informing the staff there may be a risk of data loss and to back up all data, may have mitigated some of the risk. Additionally, when the IT Department was fixing and restoring the staff's accounts, the general staff was not informed about the recovery status. In fact, it was the Operations Department who pushed for organization wide updates from the IT Department. Imagine what the public response would be if the Emergency Management Officials neglected public updates during Hurricane Katrina or Deep Water Horizon. This type of vacuum procedures is bad business and leads to the next lesson learned.
Reputation. Due to a lack of communication and a perceived ambivalence towards the general organizational reliance upon their systems, the IT Department has a terrible reputation in the organization. When discussing the IT systems with the client staff, it is determined that their systems fail to work correctly at a rather high frequency. In fact, there are several initiatives that are being held up because the leadership does not want to further burden the IT Department. Some of the incidents were preventable, in others they were not. However, if the general staff trusts and believes in the IT Department, the failure would have been seen as a bump in the road vice another domino in the long line of shortcomings. The IT Department's reputation will only be fixed through trust after a series of successful events and open communication. Provide that layer of transparency to the general staff. Let the staff know a software or hardware change is coming and what to do. If the update or change does not occur as planned, then provide regular and honest updates - even if the answer does not contains a timeline.
Training. This client has many redundant IT capabilities in the event the primary method or system fails. After interviewing many client staff members, it is discovered that many of them are unaware that these redundant capabilities exist. The members who know about them, are unfamiliar and uncomfortable with their use, especially in a contingent environment. This is easily fixed through training. Had the IT Department, explained and demonstrated the capabilities to the staff, provided them a working knowledge foundation, the impact of the downed system would not have been as halting as it was. Almost all of the organizational members encountered said they sat around for days, unable to begin working because they had relied upon one system, unknowing or uncomfortable with the redundant systems. When the system failed, they were dead in the water. The IT Department should have taken the time and prepared the staff, demonstrating, advocating the secondary and tertiary systems. After all, continuity does not exist if the general staff does not know how to implement it.
Personal Responsibility. This post has been rather hard on the IT Department. However, there are lessons to be learned from outside of their area of responsibility. For months, even years, leadership and the IT Department has informed staff to back up vital files in multiple locations, i.e. the shared drives, or the SharePoint Portal pages, or on a CD. Out of about twenty staff members interviewed, only two, properly backed up their information. While it is frustrating to lose this much critical data, this incident did not come out of the blue. It is not a 'Black Swan Incident' nor an isolated incident. This crisis could be called, the pinnacle of a long line of mishaps on behalf of the IT Department. It is baffling that people are surprised they lost everything and saw the incident looming on the horizon! It would be no different if they stood on the shoreline watching a hurricane coming, then have the audacity to be surprised when their house was flooded. If your job requires you to rely upon email archives and massive amounts of files to conduct everyday operations, then have the foresight to regularly back it up in multiple locations. That way, when the next failure occurs, you will be able to continue your job as you have properly prepared. Staff members should take an interest in their own resiliency.

Unfortunately, it sometimes it takes a major incident to start using best practices in Business Continuity. The best way to move forward is to learn from the mistakes of the past. Take these lessons, apply them to your organization and instill resiliency.

1 comment:

Kevin SchallerMarch 21, 2012 at 8:33 AM
You address a number of points that center on managerial oversight. Complacency on the part of senior leadership to insist on accountability by IT management. This reinforces the importance of executive engagement of the BCM program in any organization.

Wednesday, March 21, 2012

Continuity of Operations Case Study

1 comment: