03/19/2009 15:32
AIS SDTC Datacenter Post Incident Report - 03-19-09
This is the email we recieved from AIS - our datacenter. These are the details surrounding the power outage that lead to a great many issues with our servers and client outages.
------------------------------------------
Dear Valued Customer:
As a follow up to the power event which occurred on the morning of March 18th, 2009 at the 9725 Scranton Data Center (SDTC), American Internet Services has compiled the following post incident report for our customer base. As always, our Account Relations and Management team members are available to discuss specific customer issues or concerns, while this report is intended to provide comprehensive overview of the event itself.
At approximately 08:15AM PDT, March 18th, the SDTC datacenter suffered a complete power failure for approximately 30 seconds while conducting routine maintenance to the critical datacenter systems. The work that was being performed is part of AIS’ Standard Operating Procedure. This procedure is in alignment with industry guidelines, and our commitment to provide customers with the highest availability in data center solutions. As we have informed our customers in the past, all critical systems are tested bi-monthly by our team of mechanical engineers in conjunction with our outside contractors under service agreements. Standard maintenance is performed during normal business hours and is carefully planned to incorporate the strictest test procedures to ensure the success of the work performed. Our SOP incorporates escalation processes and back out procedures in the unlikely event of an alert or anomaly during the standard maintenance.
Regretfully, during our maintenance yesterday, we encountered a mechanical failure. The Powerware 9515 UPS plant failed during the transition of building load from street power to generator power. Approximately 30 seconds upon failure of the UPS plant, our CTO, Richard Sears, who was present for the maintenance, restored power to the data center by manually moving the building to generator, quickly isolated the failure to the UPS plant, reset all four UPS modules, and brought all four UPS modules back online. Following, he moved the UPS plant from bypass mode to normal operational mode.
At that time, senior management called to initiate the Emergency Response Plan (ERP) and made a decision not to move the data center back to street power until our mechanical engineers and external contractors had an opportunity to perform diagnostics of all datacenter systems to determine what caused the failure to the UPS plant, as well as, test the general state of health of all critical systems.
Within approximately 15 minutes of initiating ERP, we had mobilized 18 Customer Service Engineers, 5 Networking Engineers and Facilities and HVAC teams to the datacenter, in an effort to assists our customers with recovery. We also had UPS, battery and power experts from Eaton Powerware, CPD and Emerson there to assist in the investigation of the issue. As part of our emergency communication plan, all customers were proactively contacted and informed of the situation and were provided multiple progress updates throughout the day.
Upon reviewing of the findings, it was determined that one of our battery strings failed, resulting in their not being able to hold system load once the UPS plant went fully to battery. This caused a critically low battery voltage condition to the entire UPS plant and the plant protected itself by bypassing its system load to the main bus. This was during the time the building was being transferred from street power to generator power, so the main busses were both dead. In order to prevent a dead-head of the generator and utility systems, the SEL electrical system has a failsafe that prevents the main breakers from closing after the emergency breakers have been commanded to close, and there is power on the emergency bus. This condition prevented us from closing the main breakers, while we were still able to close the emergency breakers.
As with all of our critical datacenter systems, we have external contractors under maintenance agreements to provide system maintenance. JT Packard is responsible for system maintenance on our entire UPS and battery plant at SDTC datacenter. We rely on our vendor to test each battery at specific intervals to determine if and when our batteries are approaching the threshold that requires replacement. JT Packard has been performing this system maintenance on a regular basis for several years now, of which most recently, reported 100% System Health.
The result of the investigation; in the opinion of both Eaton and CPD who conducted their investigation under separate check is that the battery string in question failed due to bad batteries that were not identified during the latest battery tests by JT Packard.
Upon validation of the findings, we mobilized 160 replacement batteries from Orange County to our datacenter and proceeded to schedule a three hour Emergency Maintenance window to start at 7:15PM PDT in order to replace the batteries and perform the load transfer back to street power. The evening's emergency maintenance window was completed successfully at approximately 10:00PM PDT and all critical systems where again, checked and diagnosed to be operating at 100%.
We sincerely apologize for the inconvenience yesterday’s event caused you. We want to assure you that we spare no expense when it comes to designing, deploying and maintaining our datacenter systems in order to meet the industry’s highest levels of reliability which our customers have come to expect. If you would like to receive any more detailed information regarding this matter, or would like a detailed layout of our power infrastructure, please let us know. We are here to be of assistance.
We want to thank all of our customers for their continued support while we worked together to mitigate this critical event.
Sincerely,
Alessandra M. Carrasco
Chief Executive Officer
American Internet Services
9305 Lightwave Avenue
San Diego, CA 92123
(858)576-4272 main Ext. 145
(858) 427-2475 fax
acarrasco@americanis.net
This electronic mail message and any attached files contain information intended for the exclusive use of the individual or the entity to whom it is addressed and may contain information that is proprietary, privileged, confidential and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any viewing, copying, disclosure or distribution of this information may be subject to legal restriction or sanction. Please notify the sender, by electronic mail or by telephone, of any unattended recipients and delete the original message without making any copies.
____________________________________________________________
If you have any questions or concerns please address them
to support@americanis.net or give us a call at 858-576-4272, and reference the event number 50159.
<< Back
View RSS Feed