Preventing disasters with communication, knowledge and experience.
Friday, October 10th, 2008 by Tim Greer
![]()
|
Let’s face it; problems happen, people aren’t perfect and some things are inevitable. However, it is interesting to take notice of how many problems and disasters could have been prevented or dealt with before things escalated to a point of no return (being too late). Just like the Chernobyl nuclear reactor disaster, and the British Midland crash of flight 92 near Kegworth, Leicestershire, UK in 1989, looking at the historical facts, you realize how these tragic events were part design, part equipment failure, and, ultimately, part human failure, which played the final and tragic role in the chain of events.
The same holds true in the technology industry. One must have the experience to properly recognize and deal with the issue, the knowledge to take preventative measures initially by design and preparedness and the best methods to handle the situation, and above all, communication policies to ensure everyone knows what’s happening, what you’re doing and what they are doing. Poor management, poor decisions by owners to place unqualified people in certain positions or what tasks to perform, can add to this issue as well. Such as assigning their friend from high school into a general manager position, rather than doing so because this person possesses the related skills, experience or even a level head with a good idea of common sense to properly manage people and delegate tasks. Communication isn’t offered or encouraged within some companies, and their servers are just running on a wing and a prayer.
The Chernobyl incident unfolded because management demanded a less qualified and less experienced night shift crew run vital tests, and those members didn’t properly communicate with each other. While one crew member was performing pre-test checks by reducing the power and balancing the heat and steam of the system with the cooling tank in relation to the steam output (which involved reducing the number of control rods to get the turbines to turn enough to generate more steam, to in turn generate the power to pump water into the system), the other crew member was adding too much cold water to the system, and wasn’t getting enough steam due to too much water, and he therefore cut off the water supply to the cooling tank that lacked far too many control rods to ensure a safe balance, so it quickly overheated beyond their control and caused a meltdown. Had the crew member communicated that he reduced the number of control rods well below the minimum for safe operation, then the crew member that was operating the water cooling aspect would have known that there wasn’t enough control rods to keep the cooling tank in check and would have kept water flowing instead of cutting it to get the steam needed. Simple communication in an array of many variables could have allowed for a major disaster to be prevented.
The mention of British Midland Flight 92 could have been resolved by communication, as well as proper training with a flight simulator for the new Boeing 737-400 they were flying, which they only had under 50 hours of experience in, each. When a problem happened, they could have more accurately diagnosed the issue and prevented a disaster. The model was new and instruments were different. Key elements were different in this model that played a role where it gave the experienced pilots reason to believe that they resolved a problem. In fact, they had a bad engine and due to lack of communication and knowledge of the system, they shut down the wrong engine. Due to how the fuel system was designed, they thought the issue was resolved immediately after disengaging the good engine.
They were unable to see which engine was the problem, and while diagnosing the issue initially, they had concluded that it was the right side (good) engine, because they could smell smoke in the air conditioning system. The system on these planes always came from the right engine. Unfortunately, this changed and used both in the newer model. They informed the passengers that they resolved the problem by shutting down the faulty engine on the right hand side and were preparing to land at the next available airport. Ironically, the passengers saw the flames coming out of the left side engine, but no one said anything or alerted the pilots, because the passengers simply assumed that the pilot is the professional in this case and must know what he’s doing. In this case, he did, but wasn’t familiar with the new model and lacked training for it when dealing with emergencies like this, because things were changed around. Communication to inform the pilots that the passengers observed the other engine as the issue, would have allowed them to quickly review and remedy the issue. By the time the bad engine completely broke down, they didn’t have time to start up the good engine (which they thought was bad). The initial report was that both engines must have failed, which is about 100 million to 1 of happening on a new plane. Unfortunately, not realizing any of these things while the incident unfolded was what ultimately caused the crash.
Problems will happen due to a variety of reasons, but how familiar are the staff with the technology, service, the equipment, the protocols, policies and how good is their communication? I’ve personally worked with companies that have terrible communication and that was the case the entire time I was working with them. No one was willing to fix or consider a review or consultation of what could benefit from being fixed, regarding communication or how tasks were performed. Therefore, you had support staff unaware what the admins were doing, and admins unaware what support staff were telling clients. Worse, you had owners and managers that didn’t have any technical knowledge or common sense making reckless announcements to the public, and only then informing the admins to act on it and “make it happen”, regardless of how uninformed their decision was or where it came from.
Support staff have clients asking questions they don’t know how to answer, because an admin was told to “make this happen right now”, which adversely affects the clients across the server farm, and the client gets rightfully upset when asking about a major change, only to have a support staff member clearly appear clueless, and by all accounts, they had no information passed to them about this major change. In fact, only the admin making the change was filled in on the change and no other admins were. This wasn’t even a situation where one admin was more senior, but simply was the one that the owner picked at random to request they make the change. Meanwhile, another admin on a newly started shift receives hundreds of reports of problems, sees the cause and fixed it, only to break what the other admin had done earlier, at the request of the owner. This is one of many examples and problems about how poor planning and lack of infrastructure cause problems.
Any major change needs to be discussed with staff that have the technical knowledge. If the owner has an idea, s/he should run it by the people they hired because of their knowledge and experience. Find out what’s possible, what’s likely, what options there are, how it will effect the clients and servers and the services ran. Will it pose a security issue? Will it pose a stability issue? Will it cause a disaster? Will the company be able to recover from such a disaster if it does? Should it even be considered? Is there a real, valid and relevant reason to offer this, or is the owner just in a mood that day? Has the owner lost their mind, were they even thinking? Rather than just come up with an idea and act on it, discussions need to be encouraged. Is this something the staff have a lot of experience in, or any experience in? What are the benefits, and what are the drawbacks? Should this be implemented immediately, soon, or in the future? How much notice can the company give to the clients?
Does it dictate that notice is a good idea? Will this force clients to make changes in their site’s code or how their programs run? How long should be given and what discussions should take place before we plan a major change? And many more questions. Somehow, these things simply aren’t considered or a concern for owners and managers. Why? Because they are truly not qualified. You hire people for their experience, skills, knowledge, insight, ideas and trust. If they are hired because they are more experienced, more skilled, more knowledgeable than the owner, the owner needs to trust them, and trust their input. Any staff member, on any level at all, should be allowed an opportunity to be heard and any relevant suggestion, concern or experience they can offer, should be allowed for consideration. Instead, if there are any, it’s usually only involving one person and what happens when that admin doesn’t know as much as another admin about a topic? What happens if they get hit by a bus? What happens if they don’t know everything? At this point, should they really be relying on the one admin that’s their best drinking buddy, or ensure that all admins are equally aware of the decisions and have an opportunity to allow input and perhaps a superior skill set and experience in that area? This is if they even go that far. You might be surprised on just how many companies, even very large one’s, operate in this way.
The more eyes, ears and brains looking at and hearing ideas and considering the options, benefits and issues it could bring up, the better. How any company can operate on the basis that they know best because they own the thing or are friends with the owner, to the point where they either dismiss all of the good things that their staff can being to discussions and ignore their value, all because of smug and ignorant management and owners, is a recipe for disaster. It could be on a technical level, a customer service level, or even right down to how the company represents itself and the services they offer and how they offer those services. Everything in business, especially dealing with so many services, reliability and security ramifications of everything you say or do, and how it impacts the clients and how it conveys the way your company operates to the world, sometimes dictates more consideration, and some type of policies to be encouraged where people are offered an opportunity to be property trained, offer input, and above all, to use good communication.
Sometimes lack of these things can cause issues here or there that can be resolved, assuming it’s not a constant and ongoing issue, but other times, it can cause significant and major issues that are effectively disasters. Anything from reputation, to lack of proper backups due to having systems wiped because security was lax. Now what are they going to do? I suppose if they are large enough, they’ll survive, and there will always be enough people tolerable enough of anything to simply proclaim that no one is prefect and problems happen. True enough, but could these things have been prevented, are the people in the positions because they are qualified to be, is the company being responsible, and do they ultimately care?















