Search This Blog

Saturday, December 13, 2014

ADKAR – Model for Change

Awareness - What Desire - Why Knowledge - When Ability - How Re-enforcement - Practice




Saturday, November 22, 2014

General Understanding of Project Life Cycle / Methodologies to be Practice in Project Management.

Once you are assigned to any project as PM, and given a Charter / PRF (Project Request Form), perform following in sequences.

i) First thing is to prepare Stakeholder register / Organization Chart.
ii) Meeting with Project sponsor to understand expectation out of this project.
iii) Meeting with Customer / biz leads / vendor’s architect and my org. Solution Architects to understand the business requirement and to setup ground rules for the project. Plan for the design phase till solution is prepared and agreed between parties. Setting up weekly checkpoint meeting to monitor the progress. Simultaneously I will prepare Design Phase Plan, MOM, Financial tracking sheet, Quality plan, Risk & Issue Logs, Project status report, Org Chart, Lesson Learnt Document, Creation of Central Repository and uploading of all the project documents. Update in Stakeholders register and RACI Charts.
iv) Once Solution is prepared, I submit it to Customer Design board along with Budget Estimation for proposed solution review.
v) Once Design and budgets are approved, I acquire project team for execution phase. Kick-off meeting would be setup with new and old resources and all the participant will be abreast about the project scope and solution and delivery.
vi) Next meeting is to identify the task (WBS), ownership, duration, and their sequence. Next big thing is to discuss on Risk / assumptions / constraints and update Risk & Issue logs. Once the entire tasks are identified, I prepare Project Plan and send it to all the stakeholders for their consent. And Baseline Scope, Schedule & Financials.
vii) Then I Schedule recurrent weekly Status Meeting with all the stakeholders, and discuss on the Project status, review project Plan, review past weeks tasks and future tasks and Risk & issues.
viii) I also schedule monthly meeting with team’s functional managers / related service tower leads/SDM to update them on the project progress / team performance / Risk & Issue.
ix) Update project reports /plan / financials tracking every week and send it to higher management. Entertain and manage project change request, accordingly updates project baselines.
x) Send task / milestone completion / status mails to stakeholders, receive sign-offs and upload these evidences in the Central repository.
xi) I use EVM to monitor project health for schedule and budget.
xii) Co-ordinate with vendor and financial dept. for any procurement from vendors/ 3rd party and manage their approval and delivery.
xiii) Once project received all the sign off and delivers what was in the scope, I begin with project closure process.
xiv) Confirm that all the tickets / task are completed and closed. Confirm delivery of the product with customer and save this communication. Release all the resources from the project. Close project financials. Prepare Project closure report (update any open item / task with proper reasoning, new owner, and estimated date to finish) & Lesson learned documents and obtain sign off.
xv) Obtain project closure approval / sign offs and close project in Project Management systems / PPM system and close the entire task in Project Plans.

Tuesday, November 11, 2014

What change Management processes we should use to ensure that change is introduced properly?


Change management deals with how changes to the system are managed so they don't degrade system performance and availability.

In effective change management, all changes should be identified and planned for prior to implementation. Back-out procedures should be established in case changes create problems. Then, after changes are applied, they should be thoroughly tested and evaluated.

Step 1: Define change management process and practices
As you would with other systems management disciplines, you must first craft a plan for handling changes. This plan should cover…..
A. Procedures for handling changes—how changes are requested, how they are processed and scheduled for implementation, how they are applied, and what the criteria are for backing out changes that cause problems
B. Roles and responsibilities of the IT support staff—who receives the change request, who tracks all change requests, who schedules change implementations, and what each entity is supposed to do
C. Measurements for change management—what will be tracked to monitor the efficiency of the change management discipline
D. Tools to be used
E. Type of changes to be handled and how to assign priorities—priority assignment methodology and escalation guidelines
F. Back-out procedures—Actions to take if applied changes do not perform as expected or cause problems to other components of the system
Step 2: Receive change requests
Receive all requests for changes, ideally through a single change coordinator. Change requests can be submitted on a change request form that includes the date and time of the request.
Step 3: Plan for implementation of changes
Examine all change requests to determine:
A. Change request prioritization
B. Resource requirements for implementing the change
C. Impact to the system
D. Back-out procedures
E. Schedule of implementation
Step 4: Implement and monitor the changes; back out changes if necessary
At this stage, apply the change and monitor the results. If the desired outcome is not achieved, or if other systems or applications are negatively affected, back out the changes.
Step 5: Evaluate and report on changes implemented
Provide feedback on all changes to the change coordinator, whether they were successful or not. The change coordinator is responsible for examining trends in the application of changes, to see if:
A. Change implementation planning was sufficient.
B. Changes to certain resources are more prone to problems.
When a change has been successfully made, it is crucial that the corresponding system information store be updated to reflect them.
Step 6: Modify change management plan if necessary
You may need to modify the entire change management process to make it more effective. Consider reexamining your change management disciplines if:
A. Changes are not being applied on time.
B. Not enough changes are being processed.
C. Too many changes are being backed out.
D. Changes are affecting the system availability.
E. Not all changes are being covered.

Monday, November 10, 2014

Centralized Vs. Decentralized PMO

PM’s usually work on two modes of models in most of the companies…
  1. Centralized PMO which is accountable for delivery of projects, has HR responsibility for the PMs and owns the process/metrics/tools. In these models PMs are dotted lined into business segments.
  2. Decentralized model which has PMs directly aligned into business segments but have dotted lined responsibility back to the PMO to ensure standards.
I have worked in both models and think there are pros and cons of each. Some benefits of a centralized PMO include:
  1. Ability to build a sense of community because everyone is together
  2. Ease of rolling out standards because there is a direct alignment of standards and the PMs
  3. Ability to rotate resources to different areas in the case of volatile demand across business segments
  4. Consistency of work
Some benefits of a decentralized PMO include:
  1. Closer alignment of PMs to business segments and the ability to become an expert in the business of that segment
  2. PMs can report into segment managers so there is not as much overhead as a PMO
I think there is validity in either model and it really depends on the organizational culture and values. If consistency is a top priority then it makes sense to centralize. If being nimble and operating at a segment level is a priority, then decentralization makes more sense. Having a hybrid model is probably the best scenario – for example having a centralized organization but then keeping the PMs aligned to specific segments and not moving them around too much.



Friday, October 31, 2014

ITIL and PMP Framework Similarities

While ITIL addresses how IT organizations as a whole should operate, PMP addresses how individual projects within the organization should be executed. PMP applies to projects throughout the entire organization not just IT. Both frameworks rely heavily on process and the use of tools to enable consistent execution of processes. The ITIL framework and the project management framework support each other in a way that propels services and operations to a greater level of proficiency. Furthermore, both frameworks address the need to manage quality, risk, and accountability. Most importantly, however, both ITIL and PMP consistently help improve efficiency and usefulness within the organization. ITIL describes the ideal end state that an organization would like to achieve. There are those who believe that if the ITIL framework behaves according to the ideal model, all will go according to plan. Unfortunately, this impractical IT end result is not realistic in the business world - and an organization must implement a framework that allows for individual projects to be completed over months’ time in order to get to the desired end result.

Tuesday, October 14, 2014

Differences between ITIL and PMP

The differences between the ITIL framework and the project management framework are inconsequential when compared to the overall effectiveness of combining the two. Similarities aside, project management is not specific to IT. The PMP framework, focusing on effective execution of projects, can be applied to any area of any organization. Unlike ITIL, the project management framework does not operate on a lifecycle approach, but is organized into nine key knowledge areas: project integration, scope, time, cost, quality, human resources, communications, risk, and procurement management. As previously mentioned, rather than analyzing the breakdown of each project, the ITIL framework examines the whole picture - a key difference. By taking a larger view of services in the organization as a whole via a lifecycle approach, ITIL sets out to examine service strategy, service design, service transition, service operation, and continual service improvement. Take, for example, an organization that is building and deploying an email service - on one level, ITIL will evaluate what is needed; PMP will then take this information and further break it down into easier-to-manage increments.

Friday, October 3, 2014

Data Center Implementation and build

I was bit fortunate when I participated in a presentation on DC Implementation and build.
After the presentation, I was very much excited about the content & knowledge of the presentation which is not available freely & easily over the internet. Thus I decided to put the content here with some of the my understanding. Most of the IT managers look for the deep secretes of building a DATA CENTER, following are the points which are very much important in terms of the building a Data Center.
1st, How many types of Data Centers?        So, there are 4 types of Data Centers (Tier 1–4, details are easily available over the internet). Tier 1 to 4 Data Center is nothing but a standardize methodology used to define uptime of DC. Primarily these are useful for measuring: DC performance, Investment, and ROI.
Tier 4 Data Center is considered to be most robust and less prone to failures (Downtimes). A Data Center is nothing but a(Power + Cooling + Place) where we keep different networked devices. While constructing a data center we should see that, it is scalable, modular, flexible & simple in design.
Which are most important element/s of the any Data Center?    Following are the most important elements during planning & designing for a new Data Center which will direct the course of action....
A. Stakeholders
B. Scope
C. Space
D. Cost
E. Requirement (individual)
F Compliance
What are the precautions we should follow while building a Data Center?        While building a new DC, there are many things which are of lot of importance, we should follow below design guidelines during the course....
A. Label everything (golden rule)
B. Use RLU (rack location unit) costing
C. Be flexible, for future needs
D. Think modular, for future expand plan
E. Worry about weight, as we are aware that most of the DC does not build on ground floor, so we should think properly for weight of the devices.
F. Keep things covered or bundled to avoid dusting
G. Build raised floor tiles at least of 4 feet
Lets look at Data Center build from the perspective of project management (PMI). Once I was a part of the team where we were involved in planning & designing phase of the DC build project. There are many major areas in any Data Center build project, i.e. Civil work, Electrical work, Fire work, Security & Cooling arrangement. I learned most of the stuff in that project only, following are the some golden learning's....
a. A complete Scope, an approved budget, & an experience team (are the keys to build DC successfully)
b. Insurance and local/ state /country building codes
c. Determining the viability of the project
d. Realistic project & full budget
e. Considerable redundancies of power/services/HVAC/UPS
f. Proper management of account (how will be funds distributed)
g. Future modifications, upgrades, changes
h. Factors in running cost, services, and maintenance contract with supplier
i. Are redundant system really necessary
j. How much will be projected failures (downtimes)
k. Separate NOC/command center
l. What is the best time to bring the facility online (schedule management)
m. Physical space and weight capacity for equipment/device
n. Power availability/grid power
o. Cooling system redundancy to increase uptime
p. ISP Bandwidth type & amount
And for above we should have fantastic Support, processes which are....
A. Location of floor & weight of rack
B. Sufficient power to run racks
C. Humidity in Data Center should be between 40-45% & cooling should be between 18–27c
Ok, its not all, we have something bigger to worry. Lets talk about risk in these types of projects. Till now I have identified following major risks....
A. Ambiguity in the scope of work
B. Not having clear contract/sourcing process
C. Queries from vendor, turns out to be additional work
D. Change management
E. Finance
F. No involvement of delivery team/people during planning or design phase
That's all I have, will be posting more in this topic if I learned/discover new things in this regards.

Sunday, September 21, 2014

A systematic approach to problem solving by SWAT

Summary:  Problem determination is not an exact science, but it's also not rocket science. A methodical approach will help your problem solving techniques become more organized, systematic, and, ultimately, more effective. How do you approach a new problem? When you encounter a new problem, how do you decide what to do? Where do you start? What do you look for? How can you become more effective at troubleshooting? What you need is a methodology for problem determination. By its very nature, problem determination is about dealing with the unknown and the unexpected. If you knew in advance everything about all the problems that you could encounter and exactly how they manifest themselves, then you would take measures to prevent them and wouldn’t have to investigate them. You cannot expect that problem determination should be a perfectly predictable process, but there are a number of common approaches that can make the process go smoother and be more effective. This blog is based on the experience and observations of members of the IBM Support, Serviceability, and SWAT organizations from years of helping our clients, and from seeing both best and worst practices in action. This is an evolving work, as we continue to look for new ways to further enhance the investigative process. Common challenges Looking back, we can identify several common challenges that can make problem determination exercises difficult to resolve:
  • Need for direction, or "What do I do next?" Sometimes the people involved in resolving a problem simply don’t know where to start or what to do at each step. Problems can be complex and it is not always obvious how to approach them. This blog will provide some general guidance to help you get started finding and deciding what to do at each step of the process for a broad range of problems.
  • Need for information, or "What does it mean?" Sometimes what’s missing is simply information: you see some sort of diagnostic message or diagnostic file, but you don’t know how to interpret it or can’t understand how it relates to the problem at hand. You need good sources of information and tools to help you interpret all the clues discovered in the course of an investigation.
  • Miscommunication and lack of organization, or "What was I doing? What were you doing" Sometimes time is wasted or important clues are lost because of miscommunication, or because the investigation has dragged on for so long, making the collected information more difficult to manage. Events, timelines, and artifacts that are often invaluable to determining next steps of an investigation and communicating progress can easily get lost or forgotten.
  • Dealing with multiple unrelated problems or symptoms, or "What are we looking for?" A particular challenge in complex situations is not knowing whether you are dealing with a single problem or with multiple independent problems that happen to occur at the same time. You might see a variety of symptoms, some of which relate to one problem, some to another, and others that are simply incidental and benign. Being able to distinguish between the “noise” and the real problems can go a long way to an timely resolution.
Characterizing the problem A common mistake when troubleshooting is to jump to specific analysis or isolation steps without taking the time to properly characterize the problem. Many investigations take longer than necessary or go off on the wrong track because they look for the wrong problem or miss critical elements that would have helped direct the research. In other cases, time is wasted because miscommunication has caused various parties in the investigation to have different understandings of the situation. When studying journalism, aspiring reporters are taught to approach every news story by asking Who, What, When, Where, and Why. You can use a variation of this principle when investigating software problems:
  • What happened?
    • What are the main symptoms that led you to determine that there is a problem, as opposed to all other ancillary symptoms?
    • Exactly how would the system have to look or act for you to consider that the problem is no longer present?
    • How would you recognize that the same problem happened again?
    • Be cautious about using vague terms like “hang,” “crash,” and “fail,” which are often generalizations, inaccurate, and distract attention away from important symptoms.
    • Be aware that in real-world situations, there can be several independent problems rather than just one. You need to recognize them and prioritize them.
    • Be conscious of tangential problems that are consistent and well-known (for example, an application error that is always written to the same log in which the original problem occurs). Sometimes these problems are incidental and not worth investigating, and sometimes they can be related to the problem.
  • Where did it happen?
    • Be precise about which machine, which application, which processes, and so on, the problem was observed on.
    • Which logs and which screens should you look at to see the problems?
    • Know the overall environment surrounding the problem (for example, find out the system topology, network topology, application overview, software versions, and so on).
  • When did it happen?
    • Track time stamps for where to look in the logs. Note time zone offsets and whether or not multiple systems have synchronized clocks.
    • Are there any special timing circumstances:
      • Every day at a particular time of day?
      • Every time you try to perform a particular operation, or every time a particular system process executes?
      • Every time a particular user or batch process starts processing?
    • Did the problem happen only once or does it occur regularly? Is the problem repeatable at will?
    • Has the problem been reproduced before, during load or stress testing?
  • Why did it happen?
    • Why did the problem happen now, and not earlier? What has changed?
    • Why does the problem happen here, on this system, and not on other similar systems? What is different?
    • Has it ever worked correctly in the past?
  • What do you know? Finally, it helps to clearly summarize all the information available. Make a list of all the symptoms and anomalies, whether or not they seem related to the problem.
Not every problem is difficult Before jumping into more complex techniques, it is often useful to approach each new problem in two phases:
  • In Phase 1, you perform a broad scan of the entire system after a problem has occurred to find any major error messages or dump files that have been generated in the recent past. Each of these errors and dumps constitute an initial symptom for the investigation. Search these symptoms across one or more knowledge bases of known problems.
  • In Phase 2, you do everything else: Select one or more initial symptoms for additional investigation beyond a simple search for known problems, perform specific diagnostic actions to generate additional symptoms or information (analysis or isolation approach, described below) that are typically specific to the problem under investigation, and then repeat the process as needed until a solution is found.
Phase 1 is clearly easier than Phase 2, since it is a single step rather than an iterative process, and does not typically require significant prior knowledge that is specific to a particular problem. In practice, a large percentage of problems can be resolved by a systematic execution of Phase 1 type activities, so it is well worth starting here in most if not all situations. Moreover, even if a solution is not found at the end of Phase 1, the set of initial symptoms collected during this phase is usually just the kind information that you need to start Phase 2. Use this collected information to populate the table of problems, symptom, and actions described below. Conversely, if you omit Phase 1, you can miss important clues that might take much longer to find during a later, focused investigation. For example, if you focus too soon on one particular server that is responding abnormally, you might not notice that another server in the system has been going down or that there have been network errors, both of which are conditions that could indirectly affect the abnormal behavior that is occurring on your server. The concept of “Phase 1 problem determination" has been steadily gaining acceptance, and IBM Support is developing specialized tools to facilitate this activity. For example, the Log Analyzer tool (available in IBM Support Assistant), coupled with a set of broad automated log collection scripts (also in IBM Support Assistant) can be used for this purpose Balancing resolution and relief Before launching a more in-depth investigation, you need to consider the business context in which the problems -- and your investigation -- occur. In every investigation, you must balance two related, but distinct goals:
  • Resolution: Finding the root cause and ensuring that it will not happen again.
  • Relief: Enabling your users to resume productive work.
Unfortunately, these two goals are not always simultaneously attainable. A simple reboot is often enough to bring the system back, but it can destroy information needed to debug the problem. Even actions that simply gather or report information, such as running a system dump, could cause the system to be down for an unacceptable period of time. Finally, the resources needed for a complete investigation might also be needed for higher priority work elsewhere. In some situations, such as system or integration testing, driving the problem to full resolution is required and relief is not an issue. In other situations, such as business-critical production systems, immediate relief is paramount. Once the system is back up, the investigation begins with whatever data is available or salvaged. The team working on the project must be clear about the business priorities and tradeoffs involved, and revisit the question periodically. Finding that the problem recurs frequently might raise the importance of full resolution; however, finding that the scope of the problem is quite limited may tilt the scales toward simple relief. Fundamental approaches: analysis and isolation Phase 2 involves more advanced problem determination activities and techniques, but fundamentally, all troubleshooting exercises boil down to this: you watch a software system that exhibits an abnormal or undesirable behavior, make observations and perform a sequence of steps to obtain more information to understand the causing factors, then devise and test a solution. Within this very broad context, it is useful to recognize two distinct but complementary approaches:
  • In the analysis approach, you pick one or more specific symptoms or anomalies observed in the system, and then drill down to gain more detailed information about these items and their causes. These, in turn, might lead to new, more specific symptoms or anomalies, which you further analyze until you reach a point where they are so simple or fundamental that it is clear what caused them. Typical techniques used in this approach include searching knowledge bases to help interpret symptoms, and using traces or system dumps to find more specific symptoms that you can then research further with knowledge bases. For example, consider a problem in which an application server appears to be crashing. By analyzing the diagnostic material from the crash, you can determine that the crash originated in a library called mylib.so. By looking at the source code for the library and taking the native stack trace information from the gathered diagnostic material, you can see that a bit of code creates a pointer to a memory location, but does not handle it correctly. This results in an illegal operation and subsequent crash.
  • In the isolation approach, rather than focusing on one particular symptom and analyzing it in ever greater detail, you look at the context in which each symptom occurs within the overall system and its relation to other symptoms, and then attempt to simplify and eliminate factors until you are left with a set of factors so small and simple that, again, it is clear what caused the problem. Typical techniques used in this approach include performing specific experiments to observe how the system’s behavior changes in response to specific inputs or stimuli, or breaking down the system or some operations performed by the system into smaller parts to attempt to see which of these parts is contributing to the problem. For example, consider a large WebSphere Application Server environment, consisting of many nodes across several physical machines, in which an accounting application, deployed into two clusters, is having long response times. Using the isolation approach, you might opt to trace the application servers that are involved along with the network links between the servers. This method would enable you to isolate the cause of the slowdown between the network and the application servers, making it possible for more in-depth investigation to be performed on the effected component.
At first glance, it could seem that the steps followed by someone trying to troubleshoot a complex problem are random and chaotic, fueled by the hope of stumbling upon the solution. In reality, a skilled troubleshooter rarely performs any step without a very specific objective that is rooted in one of these two approaches. Understanding these approaches will help you formulate steps to follow in each of your own investigations. Conversely, if you can’t justify a step based on one or both of these approaches, chances are good that you might be relying a little too much on luck. Now, although a clear distinction is made here between the analysis and isolation approaches, in practice they are not mutually exclusive. In the course of a complex investigation, you will often take steps from both approaches, either in succession or in parallel, to pursue multiple avenues of investigation at the same time. Organizing the investigation As mentioned earlier, many investigations suffer from imperfect communication and organization. For non-trivial problems (generally, anything requiring Phase 2 problem determination), you should generally keep four types of information:
  • Executive summary This is a short paragraph that summarizes the status of the highest priority issues, what has been done in the last interval, and what the top action items are, with owners and approximate dates. This enables both stakeholders and those who are only marginally involved to understand the current status. Not exclusively for “executives,” this information helps focus the team, explains progress and next steps, and should highlight any important discoveries, dependencies, and constraints.
  • Table of problems, symptoms, and actions Phase 2 investigations can sometimes suffer when the number of problems to be resolved grows prodigiously. It is important to keep a written list of these additional problems and not rely on the collective memory. This table is a crucial piece of record-keeping and should be kept for all situations, no matter how simple they seem at first. When the situation is simplistic, the table is simple as well and easy to maintain. The effort to track this information will pay off on those (hopefully rare) occasions when things are much more complicated than originally believed. By the time that complexity is realized, it is almost impossible to recreate all of the information that you will want to have kept. The actual format of this table, its level of detail, and how rigorously it is used will vary between each individual troubleshooter and each situation. Regardless of the exact format, this table should contain:
    • Problems: One entry for each problem that you are attempting to resolve (or each thing that needs to be changed).
    • Symptoms: External manifestations of the underlying problem and anomalies that might provide a clue about the underlying problem. A symptom might be an error message observed in a log, or a particular pattern noticed in a system dump; the problem is the error condition or crash itself. Problems can be fixed; symptoms go away (or not) as changes are made. Sometimes new symptoms appear during an investigation.
    • Actions: Tasks to be performed that may or may not be directly related to a particular symptom or problem, such as upgrading the software or preparing a new test environment.
    • Fixes: Alternatives to be tried to achieve a resolution or workaround. (Some troubleshooters list these as actions.)
    • Theories: It is useful to track ideas about why the problem is occurring or how it might be fixed, along with actions that could be taken to test them. Noting which symptoms the theories are derived from can help rule out theories or draft new ones for the investigation.
    Regardless of the actual format, you should constantly review and update the table with your team so that it reflects the current state of the investigation (that is, what is known and what is not known, what to do next, and so on). When there are multiple problems, it is important to group symptoms with their corresponding problem, and to review these relationships frequently. Finally, each entry should be prioritized. This is really the key to staying organized and methodical throughout a complex investigation. Do not attempt to use the table as a historical record of the progress and activities in the investigation. This table will typically be complex enough just keeping track of the current state of things. The timeline (covered next) will contain the historical information needed for most investigations. If you really wish, you can keep an archive of past tables. Timeline of events.
  • In any investigation that lasts more than a few days, or that involves more than a few individuals, there will invariably be questions about the results of some earlier experiment or trace, where some file was saved, and so on. Keep a written document or log book in which you record a timeline of all major events that occurred during the investigation. The exact format of the timeline and the level of detail might vary between individuals and between different situations, but a timeline will typically contain:
    • One entry for each occurrence of any problem being investigated.
    • One entry for each significant change made to the system (such as software upgrades, reinstalled applications, and so on).
    • One entry for each major diagnostic step taken (such as a test to reproduce the problem or experiment with a solution, a trace, and so on).
    • A precise date and time stamp.
    • A note of the systems (machines, servers, and so on) that were involved.
    • A note of where any diagnostic artifacts (logs, traces, dumps, and so on) were saved.
    Inventory of the diagnostic artifacts
  • Over the course of an investigation, you will end up collecting a large number of diagnostic artifacts and files, such as log files, traces, dumps, and so on. Some of these artifacts could be the result of multiple spontaneous occurrences of the problem over time, and others could be the result of specific experiments conducted to try to solve the problem or to gather additional information. Just as the timeline of events is important to keep track of what happened over time, it is also very important to manage all the various diagnostic artifacts collected during these various events so that you can consult them when you need additional information to pursue a line of investigation. For each artifact, you should be able to tell:
    • Which event in the timeline does it correspond with?
    • Which specific machine or process (in all of the machines and processes involved in a given event) did the artifact come from?
    • What system or process configuration was in effect at the time the artifact was generated?
    • What options were used specifically to generate that artifact, if appropriate (for example, which trace settings)?
    • If the artifact is a log or trace file that could contain entries covering a long period of time, exactly at which time stamp(s) did something noteworthy happen in the overall system, that you might wish to correlate with entries in this log or trace file?
    As for other aspects of the organizational devices described here, there are several equally good ways and formats suitable to maintain this information. One straightforward approach favored by many experienced troubleshooters is to organize all artifacts into subdirectories, with one subdirectory corresponding to each major event from the timeline, and to give each artifact within a directory a meaningful file name that reflects its type and source (machine, process, and so on). When necessary, you could also create a small README file associated with each artifact or directory that provides additional details about the circumstances when that artifact was generated (for example, configuration, options, detailed timestamps, and so on). While it is perfectly acceptable to simply use any manually-created directory structure you wish to organize these artifacts, the Case Manager facility provided in IBM Support Assistant includes several features that help organize artifacts precisely along the principles outlined above.
Avoiding tunnel vision Regardless of the approach you take, it’s important to always keep an open mind for out-of-the-box thinking and theories. Avoid having “tunnel vision,” focusing on one possibility for root cause and attempting to solve only that one cause relentlessly. Failure to see the whole picture and not methodically evaluating all potential solutions can result in prolonged relief and recovery times. Here are some ideas to help you avoid this condition:
  • Utilize checkpoints: During your investigation, take time to create checkpoints for all team members to share all relevant findings since the last checkpoint. With this method, team members can create and maintain creative synergy among themselves and assure that no details are left aside.
  • Work in parallel: When a team member has an alternative theory to the root cause, it can benefit the investigation by enabling one or more team members to work on proving that theory in parallel with the main investigative effort.
  • Regularly (re)ask the “big picture” questions: Keep asking “what is the problem?” and “are we solving the right problem?” One classic example of tunnel vision is thinking that you have reproduced a problem in a test environment only to discover later that it was actually a slight permutation of a production problem. Asking big picture questions can help contextualize problems and avoid chasing those that are of lesser priority.
Peeling the onion: The nature of complex problems Often problems are relatively straightforward, with a single symptom or a small cluster of symptoms that lead more or less directly to an understanding of a single problem which, when fixed, resolves the entire situation. Sometimes, though, you have to deal with more complex situations, with a series of related symptoms or problems that one-by-one must be “peeled back” to get to the root cause, It is important to understand this concept so that you can address it effectively when conducting a complex investigation. The phenomenon of “peeling the onion” might manifest itself in a few different variations:
  • Multiple problems or symptoms can be linked by a cause-and-effect relationship. For example:
    • A total lack of response from a Web site might be due to overload in a Web server...
    • ...which might itself be due to the slow response of an application server that happens to serve only one type of request on the entire Web site but ties-up excessive resources on the Web server...
    • ... which might itself be due to database access slowdown from another application server, which ties resources needed by this application server...
    • ...which might itself be due to problems on the underlying network that connects the application server to the database...
    • ... and so on, until you get to so-called root cause of this sequence of problems.
  • Encountering one problem might cause the system to enter a particular error recovery path, and during the execution of that path another problem might manifest itself.
  • In other cases, one problem might simply mask another; the first problem does not let the system proceed past a certain point in its processing of requests, but once that first problem is resolved, you get further into the processing and encounter a second independent problem.
In all these cases, you have no choice but to address one problem at a time, in the order that they are encountered, while observing the operation of the system. Each problem could itself require a lengthy sequence of investigative steps before it is understood. Once one problem is understood, you can proceed to the next problem in the sequence, and so on, peeling away each imaginary onion skin, until you finally reach the core. This method can be very frustrating, especially to those unfamiliar with the troubleshooting process, but effective communication can help keep morale high and build trust in the process by showing concrete progress and minimizing confusion. Maintaining and publishing a clear executive summary helps set the context of the overall situation, helps highlight each specific problem when it has been resolved so that progress is evident, and helps identify new (major) problems as they are discovered. The table of problems, symptoms and actions helps to keep track of the various layers and clarify the relationship between similar problems and symptoms. Using IBM Guided Activity Assistant As a complement to the various techniques outlined in this article, you might want to consider using the IBM Guided Activity Assistant to help you conduct your investigation. The most visible function of which is to provide information and step-by-step guidance for the specific tasks that should be performed to diagnose a variety of problems. It also embodies many of the general principles presented in this article, helps you characterize your problem, keeps track of the state of the investigation and the various diagnostic artifacts through its Case Manager, and guides you through initial steps that are similar to Phase 1, which also support information gathering necessary to launch Phase 2. Summary Problem determination is about dealing with the unknown and unexpected, and so it will probably never be an exact science -- but it’s also not rocket science. By following the recommendations and techniques outlined in this blog, you can take steps to make your problem determination work more organized, systematic, and, in the end, more effective and rewarding.

Saturday, August 2, 2014

How to minimize server consolidation mistakes!

It's all too easy for even a knowledgeable and experienced IT veteran to make mistakes while managing a complex server-consolidation project. You have to think about everything, It can be a minefield.
Server virtualization projects are usually easy to justify on both financial and operational grounds, but that doesn't make them foolproof to execute. Pitfalls, such as inadequate planning, faulty assumptions or failure to quickly detect post deployment glitches, can entrap consolidation project leaders and team members at almost every stage. Every time or most of the time we felt that we covered every base, that every single thing had been looked at.... that's when the danger started.
Avoiding disaster while keeping a complicated consolidation project on schedule and within budget isn't easy. In fact a few mistakes along the way is inevitable. It will go wrong: Be prepared. On the other hand, planning and learning from others will keep you away from making the big and obvious mistakes.
Plan for success - While even the most thorough, painstaking planning can't completely eliminate project mistakes, building a detailed virtualization design and deployment strategy will help minimize the number of errors. Planning is really key for server consolidation.
Thorough planning creates a road map that helps managers gather the knowledge required to avoid most major problems. I think people aren't spending enough time thinking about the issues of the existing workloads and how you migrate those into a virtual environment, and what does that mean in terms of cost structure, ongoing expense and high availability.consolidation planning also needs to address an organization's future needs. Look at what you're going to do in a year, 3 years and 5 years from today. Servers, software and other system elements need to be planned with an eye toward anticipated growth. You don't want to get yourself in a situation where you do this whole big upgrade and then you find you need more server capacity later on.
I also agree that every consolidation plan needs to address scalability issue. From the standpoint of server virtualization, it's very important to have a system that scales and meets the performance need of the load you're putting on it. We often come to know about many organizations that either didn't allocate enough storage, or simply didn't correctly anticipate the amount of server power that was going to be needed to facilitate their server consolidation project.
It's extremely common to overestimate the physical-to-virtual consolidation ratio: Experts Advise (As per KBC)

Wednesday, July 9, 2014

Your biggest Risk is Poor Project Management!

One of our most highly sought after single discipline classes within project management is risk management. Top management is hot on it; they feel their Project Managers are weak in risk management because of a common pattern they witness. Projects are not being completed as expected and the main culprit it seems is unexpected events, unforeseen risks, which plague these projects. Top management believe that if PM’s receive training in risk management they will be more able to foresee these events and avoid them. What top management do not fully know is that these unforeseen events are not due to poor risk management skills but poor fundamental project management skills. This is what happens behind the scenes on a miss managed project: The project manager gets surprised when he becomes aware that his project’s schedule or quality of the deliverable is missing the mark. In an effort to draw attention away from himself, the PM creates a story based on events that could not have been seen and therefore mitigated. These stories are not total fiction, they often contain a good portion of fact but are told in a manner that supports the project manager’s position that he was the victim of these unforeseen events. What probably happened is the PM did not have a handle on the details of what the project’s current state was. He also did not pay attention to the future work that might be impacted by the current reality. As a result, the current reality was vague and its impact on the future was even more so. In this situation everything becomes a surprise. Symptoms of poor project management display themselves in several specific ways. Here are a few:
  • Not breaking down the deliverables into a detailed set of tasks
  • Not identifying the dependencies between tasks
  • Not knowing what tasks have been fully completed
  • Not knowing how much more remains to be completed on incomplete tasks
  • Not curtailing scope expansion based on the project’s scope statement
  • Not adjusting future estimates based on estimating error trends
  • Not adjusting future work based on current reality and its impact on project completion
  • Not tracking external dependencies
  • Not knowing if the promised availability of partial resources is being met
If we apply the risk management process to the risk above it would go like this: Risk identification tells us poor project management is a potential risk, risk assessment tells us the potential impact is extremely high, and the risk mitigation plan is to ensure PM’s practice good project management fundamentals. This is easier said than done, but still very doable. Five actions must be implemented to ensure PM’s practice good project management fundamentals:
  1. Project managers must receive training in the fundamentals. The training must cover project planning, execution and control, and it must be sufficiently thorough to give them the practical understanding of the concepts.
  2. There must be a standard methodology that project managers can follow. It must be appropriately balanced between planning, execution, and control, and it must be detailed enough to give them a routine to follow that will help overcome the practices of poor project management defined above.
  3. Project managers must have access to software tools that enable them to easily follow the training and methodology provided.
  4. Project managers must deliver management reports on a routine basis that are metric driven and allow management to verify that good project management practices are being followed as well as an accurate status of the project.
  5. Management must support the practice of project management. This is much more involved than you think. It means acquiring a working knowledge of project management fundamentals so you can talk the PM’s language. It also means giving PM’s the time to do their job right and not asking them to take a short cut or disregard portions of the methodology.
The reason is that the five actions above were instituted within the PM’s organizations and were sustained. Next time a PM plays the victim and blames unforeseen risks audit him. You will learn that most likely it isn’t unforeseen risk that hurt his project; it is poor project risk management.







Wednesday, June 25, 2014

DATA MIGRATION OUTLOOK

This is more or less related with DC migration, however it mainly concentrate on Data migration. It came with some more clear understanding about this activity of the lifetime. Following are the 1st requirement in this term. 1.    Clear Definition of requirement for all data
2.    Funding constraints
3.    Required expertise in heterogeneous storage  / Server H/W / Network environment / cooling / HVAC
4.    Detail security & availability requirements
5.    Define migration requirement -- which determines success criteria - includes, SLA, expectation for new storage infrastructure, and objective such as reduced management cost, reduced storage expenditure, greater insight info expenditure, a simplified vendor model or greater technical flexibility or stability The prospect of data migration can be overwhelming. Some of the common conditions we find among our clients are familiar to many IT leaders:
  • Lack of clear definition of requirements for all data. Data rules should focus on security, availability and recoverability. It's easy to imagine that documents with temporary data and permanent data could be confusing, making it difficult to determine which data is important and which isn't.
  • Distributed islands of data. Often, a business unit will implement a new application and request that the infrastructure for it remain close at hand.
  • Funding constraints. Tight budgets may limit technology decisions and options.
  • Lack of expertise in heterogeneous storage environments. With each vendor's support limited to its own products, incompatibility between storage technologies becomes the problem of the IT manager.

Data migration has much in common with storage consolidation. However, storage consolidation tends to be a better-organized project because it's usually backed by executive sponsorship with specific goals for cost reduction. Data migration, on the other hand, tends to be departmental in scope and limited to tactical objectives, which minimize project size and potential returns. We should focus on the similarities between storage consolidation and data migration, to ensure that the work that's done has strategic value.
  1. Detail security and availability requirements. Sometimes called data classification, this requires the security and infrastructure teams to jointly identify the needs of the IT environment. Data classification describes conditions for data access, retention requirements and security measures such as encryption. Even a very limited set of classifications will have tremendous benefit.
  2. Define migration requirements -- which determine success criteria. These may include new service-level agreements, expectations for the new storage infrastructure, and objectives such as reduced management costs, reduced storage expenditures, greater insight into expenditure, a more simple vendor model or greater technical flexibility or stability.
  3. Survey the IT environment. IT departments often use tools and scripts to do this. But migration requires a complete understanding of infrastructure technology involved, including the networks and file servers. The location of data is just as important: Without providing sufficient bandwidth capacity to support heavy network access, relocating data to a centralized repository could have wrong effects. A internal survey could provide the following:
    • The location of the data, its capacity and growth requirements.
    • Data usage as a measurement of the I/O load.
    • Data criticality rating, which can reveal potential effect on network load and influence retention and availability requirements.
    • Data classification. IT managers can decide what data requires the most expensive storage and stringent protection and what could be restored easily from an archive (backup). They can then make well-informed, strategic decisions about future systems.
    • Management costs for the current environment. This often-overlooked step offers the best opportunity to define benefits of migration.
  4. Design the appropriate consolidation or replacement platform, including technology, management and backup tools and procedures. Data classification and a good survey reduce the chance of an over engineered system and contribute to a platform design that will accommodate and boost growth, availability and performance.
  5. As with any IT project, it's important to remember that communication is vital. The actual movement of data from Point A to Point B will affect the organization, and it's crucial to minimize downtime -- especially in situations where data could change even as databases are restored from tape.
    Ideally, standards for data should be defined and communicated in advance of the migration, and personnel are alerted to changes in the way they will access data. However, costs also need to be communicated so that business managers understand how noncompliance will affect the project.

Data Center Relocation Coordination

It is essential to have the IT equipment migration project team assembled and organized early in the design process. As construction drawings are completed and the construction begins, the team should be very hard at work with equipment planning and migration activities.
  • Pre-design relocation project cost estimating.
  • Identification and cataloging of all equipment within IT spaces.
  • Production of project drawings, schedules, meeting documentation and other documentation for distribution to all teams.
  • Team Participation in design and construction meetings. 
  • Monitoring of the building construction schedule against planned events.
  • Coordination of the installation of voice and data circuits from the carriers.
  • Supervision of the physical relocation of all equipment.
That’s all from me. Take good care of your valuable data.





Monday, May 19, 2014

Data Center basics - Definition and Solutions

What is a data center?
Known as the server farm or the computer room, the data center is where the majority of an enterprise servers and storage are located, operated, maintained and managed. There are four primary components to a data center:
1. White space: This typically refers to the usable raised floor environment measured in square feet (anywhere from a few hundred to a hundred thousand square feet). For data centers that don't use a raised floor environment, the term "white space" may still be used to show usable square footage.
2. Support infrastructure: This refers to the additional space and equipment required to support data center operations — including power transformers, your UPS, generators, computer room air conditioners (CRACs), remote transmission units (RTUs), chillers, air distribution systems, etc. In a high-density, Tier 3 class data center (i.e. a concurrently maintainable facility), this support infrastructure can consume 4-6 times more space than the white space and must be accounted for in data center planning.
3. IT equipment: This includes the racks, cabling, servers, storage, management systems and network gear required to deliver computing services to the organization.
4. Operations: The operations staff assures that the systems (both IT and infrastructure) are properly operated, maintained, upgraded and repaired when necessary. In most companies, there is a division of responsibility between the Technical Operations group in IT and the staff responsible for the facilities support systems.
How are data centers managed?
Operating a data center at peak efficiency and reliability requires the combined efforts of facilities and IT.
IT systems: Servers, storage and network devices must be properly maintained and upgraded. This includes things like operating systems, security patches, applications and system resources (CPU, memory, and storage).
Facilities infrastructure: All the supporting systems in a data center face heavy loads and must be properly maintained to continue operating satisfactorily. These systems include cooling, humidification, air handling, power distribution, backup power generation and much more.
Monitoring: When a device, connection or application fails, it can takedown mission critical operations. Sometimes, one system's failure will lead to applications on other systems that rely on the data or services from the failed unit. A failure in one will compromise all the others. Additionally, modern applications typically have a high degree of device and connection interdependence. Ensuring maximum uptime requires 24/7 monitoring of the applications, systems and key connections involved in all of an enterprises various workflows.
Building Management System: For larger data centers, the building management system (BMS) will allow for constant and centralized monitoring of the facility, including temperature, humidity, power and cooling.
The management of IT and data center facilities are often outsourced to third party companies that specialize in the monitoring, maintenance and remediation of systems and facilities on a shared services basis.
What are some top concerns about data centers?
While the data center must provide the resources necessary for the end users and the enterprise's applications, the provisioning and operation of a data center is divided uncomfortably between IT, facilities and finance, each with its own unique perspective and responsibilities.
IT: It is the responsibility of the business's IT group to make decisions regarding what systems and applications are required to support the business' operations. IT will directly manage those aspects of the data center that relate directly to the IT systems while relying on facilities to provide for the data center's power, cooling, access and physical space.
Facilities: The facilities group is generally responsible for the physical space — for provisioning, operations and maintenance, along with other building assets owned by the company. The facilities group will generally have a good idea of overall data center efficiency and will have an understanding of and access to IT load information and total power consumption.
Finance: The finance group will be responsible for aligning near term vs. long term CAPEX to acquire or upgrade physical assets and OPEX to run them with overall corporate financial operations (balance sheet and cash flow).
Perhaps the biggest challenge confronting these three groups is that by its very nature a data center rarely will be operating at or even close to its optimally defined range. With a typical life cycle of 10 years or might longer, it is essential that the data center's design remains sufficiently flexible to support increasing power densities and various degrees of occupancy over a not insignificant period of time. This in-built flexibility should apply to power, cooling, space and network connectivity. When a facility is approaching its limits of power, cooling and space, the organization will be confronted by the need to optimize its existing facilities, expand them or establish new ones.
What are some data center measurements and benchmarks and where can I find them?
PUE (Power Usage Effectiveness): Created by members of the Green Grid, PUE is a metric used to determine a data center's energy efficiency. A data center's PUE is arrived at by dividing the amount of power entering it by the power used to run the computer infrastructure within it. Expressed as a ratio, with efficiency improving as the ratio approaches 1, data center PUE typically range from about 1.3 (good) to 3.0 (bad), with an average of 2.5 (not so good).
DCiE (Data Center Infrastructure Efficiency): Created by members of the Green Grid, DCiE is another metric used to determine the energy efficiency of a data center, and it is the reciprocal of PUE. It is expressed as a percentage and is calculated by dividing IT equipment power by total facility power. Efficiency improves as the DCiE approaches 100%. A data center's DCiE typically ranges from about 33% (bad) to 77% (good), with an average DCiE of 40% (not so good).
LEED Certified: Developed by the U.S. Green Building Council (USGBC), LEED is an internationally recognized green building certification system. It provides third-party verification that a building or community was designed and built using strategies aimed at improving performance across all the metrics that matter most:- energy savings, water efficiency, CO2 emission reduction, the quality of the indoor environment, the stewardship of resources and the sensitivity to their impact on the general environment.
The Green Grid: A not-for-profit global consortium of companies, government agencies and educational institutions dedicated to advancing energy efficiency in data centers and business computing ecosystems. The Green Grid does not endorse vendor-specific products or solutions, and instead seeks to provide industry-wide recommendations on best practices, metrics and technologies that will improve overall data center energy efficiencies.
Telecommunications Industry Association (TIA): TIA is the leading trade association representing the global information and communications technology (ICT) industries. It helps develop standards, gives ICT a voice in government, provides market intelligence, certification and promotes business opportunities and world-wide environmental regulatory compliance. With support from its 600 members, TIA enhances the business environment for companies involved in telecommunications, broadband, mobile wireless, information technology, networks, cable, satellite, unified communications, emergency communications and the greening of technology. TIA is accredited by ANSI.
TIA-942: Published in 2005, the Telecommunications Infrastructure Standards for Data Centers was the first standard to specifically address data center infrastructure and was intended to be used by data center designers early in the building development process. TIA-942 covers:
• Site space and layout
• Cabling infrastructure
• Tiered reliability
• Environmental considerations
Tiered Reliability — The TIA-942 standard for tiered reliability has been adopted by ANSI based on its usefulness in evaluating the general redundancy and availability of a data center design.
Tier 1 Basic — no redundant components (N): 99.671% availability
• Susceptible to disruptions from planned and unplanned activity
• Single path for power and cooling
• Must be shut down completely to perform preventive maintenance
• Annual downtime of 28.8 hours
Tier 2 — Redundant Components (limited N+1): 99.741% availability
• Less susceptible to disruptions from planned and unplanned activity
• Single path for power and cooling includes redundant components (N+1)
• Includes raised floor, UPS and generator
• Annual downtime of 22.0 hours
Tier 3 — Concurrently Maintainable (N+1): 99.982% availability
• Enables planned activity (such as scheduled preventative maintenance) without disrupting computer hardware operation (unplanned events can still cause disruption)
• Multiple power and cooling paths (one active path), redundant components (N+1)
• Annual downtime of 1.6 hours
Tier 4 — Fault Tolerant (2N+1): 99.995% availability
• Planned activity will not disrupt critical operations and can sustain at least one worst-case unplanned event with no critical load impact
• Multiple active power and cooling paths
• Annual downtime of 0.4 hours
Due to the doubling of infrastructure (and space) over Tier 3 facilities, a Tier 4 facility will cost significantly more to build and operate. Consequently, many organizations prefer to operate at the more economical Tier 3 level as it strikes a reasonable balance between CAPEX, OPEX and availability.
Uptime Institute: This is a for profit organization formed to achieve consistency in the data center industry. The Uptime Institute provides education, publications, consulting, research, and stages conferences for the enterprise data center industry. The Uptime Institute is one example of a company that has adopted the TIA-942 tier rating standard as a framework for formal data center certification. However, it is important to remember that a data center does not need to be certified by the Uptime Institute in order to be compliant with TIA-942.
What should I consider when moving my data center?
When a facility can no longer be optimized to provide sufficient power and cooling — or it can't be modified to meet evolving space and reliability requirements — then you're going to have to move. Successful data center relocation requires careful end-to-end planning.
Site selection: A site suitability analysis should be conducted prior to leasing or building a new data center. There are many factors to consider when choosing a site. For example, the data center should be located far from anyplace where a natural disaster — floods, earthquakes and hurricanes — could occur. As part of risk mitigation, locations near major highways and aircraft flight corridors should be avoided. The site should be on high ground, and it should be protected. It should have multiple, fully diverse fiber connections to network service providers. There should be redundant, ample power for long term needs. The list can go on and on.
Moving: Substantial planning is required at both the old and the new facility before the actual data center relocation can begin. Rack planning, application dependency mapping, service provisioning, asset verification, transition plans, test plans and vendor coordination are just some of the factors that go into data center transition planning.
If you are moving several hundred servers, the relocation may be spread over many days. If this is the case, you will need to define logical move bundles so that interdependent applications and services can be moved together so that you will be able to stay in operation up to the day on which the move is completed.
On move day, everything must go like clockwork to avoid down time. Real time visibility into move execution through a war room or a web-based dashboard will allow you to monitor the progress of the move and be alerted to potential delays that require immediate action or remediation.

Wednesday, April 16, 2014

Data Center Migration Do's & Don'ts

For any and every organization, data center migration can be a bit tricky. But that's especially true for the do-it-yourselfers at small-to-medium sized businesses (SMBs); they often don't have the resources to involve high-priced specialists. Instead they need to do the best with the resources they have.
The common challenge in all of the moves/ migration is unknown. "That's why it is so important to plan, plan, plan and then plan some more". In one of the "worst case," one of my friend Manu said the bankruptcy of a co-location facility forced an unplanned shift of one of his data centers within a matter of days (3 weeks). Thanks to good luck, he and his organization has immediately identified an alternate location, the move succeeded. However, he noted, "It took three months before things were really back to normal." Data Center migrations have been the result of more leisurely and thorough planning as well as the ability to prepare a whole new facility and then simply move the resources, to minimize the impact and reduce the possibility of outages.
As Manu suggested that first step should always has be to be clear your mind about your bandwidth needs and then talk to your ISP / network suppliers. "Moving network devices might take as little as a few weeks, but for higher capacity systems you might need three or four months of lead time," he said.
By the way, any one knows how a specialist work on the DC migration project. Ok, no guess needed,  I have prepared a Data center migration checklist which will help you to become a specialist in DC migration project. The list includes the following steps....
a. Review support contracts to make sure they will still be valid in the new location, after H/W migration
b. Determine who will do the physical shifting and reconfiguration: your own staff, a consultant. Collect & classify every bit of detail
c. Review service-level agreements (SLA) with outside customers (if any), particularly with regard to liability for any customer supplied or customer owned equipment
d. Document and label everything. If possible, consider simplifying your infrastructure well in advance of a move
e. Document every application/system interdependencies and plan the sequence of the move accordingly
f. If possible, set up a duplicate facility in the new location rather than moving and setting up the same equipment again
g. Be aware of and try to eliminate any single points of failure in both old and new locations
h. Develop and use redundancy. For example, a disaster recovery (DR) location where you can keep critical apps and provide failover
i. Consider moving to storage or server virtualization -- or both. one could set up a virtual platform including storage, virtualize the physical servers and move the virtual platform to the new location while keeping the physical setup at the old location until the virtual platform is up at the new location
j. Back up everything in preparation for the move. Hard drives are notorious for failing when they have been shut down, cooled, moved and restarted
k. You need to identify your most critical applications and have contingency plans in place, its just like preparing a disaster recovery and business continuity (BC) plan
As per my understanding, the planning process should be divided into two phases. The first phase is to determine what is wanted from the new data center or ensuring that a chosen location can meet certain requirements (i.e., space, power, security, other facilities, growth capacity and location). The second phase is to execute the move.
5 years ago I moved our two-racks worth of kit between floors in the same data center at VSNL Mumbai and it was too hard to get all the communications suppliers / colo server owners to turn up on the same Saturday to move everything at once.
Of course, migration aren't smooth, which begs the question of whether to do it yourself or hire someone else. As per my understanding, it depends on many factors. If you hire an outside consultant, be sure they have experience in this type of work and be sure you know how equipment will be transported and by whom.
For his part, Manu said hiring outside help may be a must if you have a small staff. However, he said, at a minimum, "You should make sure preparing for the move and coordinating it is the full-time responsibility of at least one person." Otherwise, you are looking at trouble. "Moving is a challenge but if you properly plan and prepare, it's manageable," he added.
For those embarking on a data center move, there are a few tools that may help. For example, data migration can be assisted by product such as....
a. Brocade's Data Migration Manager (DMM)
b. EMC Corp.'s Rainfinity (a NAS virtualization appliance that can be deployed for migrating data)
c. Hitachi Data Systems (Universal Replicator and open-source rsync). In addition
d. Shunra Software's Shunra VE, a tool that helps predict the impact of moving or reconfiguring a data center on application performance, backups and other aspects of operations.
That's all from me, all the very best for your endeavor to migrate Data Center

Tuesday, March 11, 2014

2 Lessons on risk

2 Lessons for oil industry from aeronautic industry on risk management....
Risk is an non avoidable, elusive, and ultimately unconquerable, opponent. And there's a strong argument to be made, looking at the disaster in the recent times, (i.e. gulf oil leak) that perhaps we shouldn't be opening up the earth in places too remote to close it up again. But in any event, the oil industry would do well to take two lessons from the aeronautics community.
1st: anything mechanical can and will break -- and that includes back-up systems. Count on it, and
2nd: expect the unexpected. And plan accordingly.

Saturday, February 22, 2014

Ideas for arguing your point effectively in IT meetings

If you’re in IT management, improving your communication skills is probably the best investment of time and effort you can make. Most business executives will tell you that even senior IT leaders tend to falter when it comes to putting their business strategy into a common business language. IT has a well-known tendency to drown their audience in a sea of technical jargon. And when no one challenges any of the statements, it can look like agreement. However, a lack of challenge is less a sign of an apprehension to ask for clarification. It takes someone with a healthy and brave heart to ask for clarification in front of the group, as many see it as an admission of one’s ignorance. It is of no surprise, then, that an IT leader finds after a meeting that her recommendations, seemingly well received by the audience, have gone unnoticed. Sometimes it’s just a blow to his ego, but most in this field have a story of a magnificent and expensive failure that resulted from IT advice that had been ignored. Discussing the three usual suspects that lead to such failures. Today, I want to explore the topic further and offer some advice on getting your point across that most people, irrespective of seniority, will find useful. This is not a complete list or an ultimate guide but a collection of ideas intended to provide incredible return on the time invested to read it. So, here we go…
  1. Realize that a dialogue should not be about you, Accept the fact that you just might be wrong and treat the opposition with respect.
  2. There are two parts to every argument: A position and a bunch of points that support it. Always separate them and be clear on them both.
  3. Never accept an argument that you don’t understand. Ask for clarification.
  4. To each decision, there are objectives (what we want to achieve) and alternatives (how we can achieve it).
  5. Choose the language both you and your opponent understand.
  6. When you make your point, nothing is as effective as the masterful command of the language and use of relevant examples.
  7. Your opponent might pass his beliefs and opinions for an unquestionable truth. So, be ready for attacks (when your opponent targets your persona and not your argument).
  8. Watch out for arguments that say that something is right just because it is either new or old.
  9. Don’t fall for arguments that rely on wide acceptance and popularity. What’s right for many is not necessarily right for you, even if the others are in the same industry, market, or building.
  10. Beware of the attacks like, your opposition objects not to your position but to a similar but much weaker and sometimes ridiculous one.
  11. Watch for arguments with little to no connection to the issue at stake, which are introduced to misdirect the attention of you and the rest of the audience.
  12. Sometimes you may lose on the basis of unobtainable perfection. Your way may be the best available but not perfect, When you feel that the conversation has fallen into this rut, call a spade a spade.
You have probably noticed that in a number of points I advised you to “watch out” or “beware of” or to “be ready” against various acts of attacks. The Golden Rule applies.






Saturday, January 4, 2014

Risk Management–Lessons Learned from the Titanic

Taken from PCSI review report The Titanic was one of the greatest maritime disasters of all time and still continues to fascinate us today even though almost 100 years have passed since the tragedy. One of the facts about the tragedy that makes it significant was the number of lives that were lost: over 1,500 souls perished in the disaster! The story of the Titanic continues to fascinate us to this day, as witnessed by the 1997 movie: Titanic. The movie had one of the largest budgets of any movie to that date and its success revived interest in the story. That level of interest can be gauged by the amount of information available on the internet. There are scores of YouTube posts, blogs, tweets, you name it. Add to this the books, magazine articles, specials, documentaries, and movies and you can understand why the story of the Titanic is so widely known. I intend to add to the store of comment on this story with this article. I am especially interested in the business of project management and particularly risk management and I believe that the Titanic story provides some Lessons Learned which may improve our performance in the risk management area.
Background
For those not familiar with the story (is there anyone left who isn’t?), let me set the background for you. It is important to view this story in the context of the times it took place in, otherwise learning becomes impossible. Viewing the events from a 2010 perspective gives us access to information that wasn’t available at the time; 20/20 hindsight is easily obtainable but isn’t very helpful in averting future risk events. The time is the turn of the century - 1900. Travel from Europe to North America is still exclusively by ship and travel back and forth is extensive. The rich travel via ocean liner from North America to vacation in Europe, rich Europeans travel to America for vacations and hundreds of thousands of Europeans emigrate to North America, all by steam ship. Competition of business was as fiercely competitive for steam ships as it is for airlines today. The White Star Line and the Cunard Line were 2 of the key competitors in the trade, both based in England. Competition for passengers drove passage prices down so the number of passengers a liner could carry and the speed with which it could make the crossing were 2 crucial factors in making a profit for her owners. Crossing speed became a rivalry between the two companies with the prestige of the record for fastest crossing being used as a marketing tool. The British government sought to encourage this rivalry by awarding the “Blue Riband” to the holder of the crossing record. They did this to enhance their prestige as a seafaring nation. The Cunard Line held the Blue Riband for 22 years. Their ship, the RMS Mauritania, having made the fastest crossing in 1907. The RMS Mauritania was not only the fastest passenger ship on the Atlantic run, she was also one of the biggest and most luxurious. White Star were desperate to build a liner which could compete with the RMS Mauritania and claim the Blue Riband. They approached the ship builders Harland and Wolff in Belfast Ireland to build them a ship which could take the record and at the same time be luxurious enough to attract the very richest and most demanding of passengers. They also wanted a large enough vessel to accommodate many second and third class passengers. These latter were where the profits lay because of their numbers. The RMS Titanic was commissioned in 1909 and construction started on March 31 of that year. The White Star Line not only wanted the ship to be the fastest, largest, and most luxurious on the Atlantic route, they also wanted her to be the safest. In fact, they wanted to be able to claim the ship was unsinkable.
Ship Building in 1900
Ships were built of steel in those days but the steel they were built of and the way the steel was fabricated into a hull were very different. Steel then had a much higher sulphur content which tends to make the steel more brittle. This brittleness increases as the ambient temperature drops and the temperature of the water in the North Atlantic can be very cold, cold enough for icebergs. Steel plates are fashioned into hulls by welding them together today but in 1900 the technology available to shipbuilders was riveting. Riveting can be effective at holding the plates together but is nowhere nearly as strong as welding. This is an important safety factor as collisions are much more likely to cause “holing” to the hull. There were several competing factors to be considered when Harland and Wolff designed the Titanic. Speed was important - speed would improve the business case for building the RMS Titanic (and the 2 sister ships which were to follow). Speed is a result of several factors: the length of the hull, the width of the hull (or beam), the weight of the ship, and the power of the engines and propellers. The longer a ship’s waterline the faster she will go. A broad beamed heavy ship will be slower than a narrower, lighter one and engines and propellers must be matched to the hull to achieve maximum speed. Propulsion was by means of coal fired steam boilers on these ships and bigger, more powerful engines required more room for boilers. Not only would boilers have to be larger, more room was required to store the extra coal needed. On ships such as the Titanic space is money. White Star Line believe that their ship should have sufficient capacity to carry 3500 passengers of all classes. They would like to be able to carry 550 first class passengers, 450 second class passengers and 2500 third class passengers, but they must be accommodated in luxury. The White Star Line would like their first class passengers to be treated to a level of luxury not found on any of the competition’s vessels, the second class passengers to enjoy conditions equivalent to first class passengers on other liners and third class passengers to enjoy second class passenger comfort. The primary means of providing luxury accommodations will be space. The more space each cabin has, the fewer total cabins the ship will be able to hold (space is finite, the total length overall of the ship will be 890 feet). Harland and Wolff must design a ship that will maximize luxury and speed while minimizing weight. Another conflict White Star and Harland and Wolff contended with was the need for room aboard ship for life boats. The life boats and davits which launch them must all be carried aboard ship and these take up room, the more lifeboats carried the less room for paying customers.
Safety Considerations
The White Star Line was safety conscious. Prior to the Titanic, safety was pretty much consigned to avoiding collisions that would cause hull damage and the safe extraction of passengers into seaworthy lifeboats failing that measure. Harland and Wolff came up with several new safety features which would make the Titanic safer than her competition. The first of these features was the introduction of bulkheads in the hull which created watertight compartments that could be closed by means of electric motors. The proposed design was for a total of 16 bulkheads, any 2 of which could be flooded without sinking. Operation of the doors would be automatic or controlled manually from the bridge. A second was a second, inner, hull which would protect the ship should the outer hull be breached. Harland and Wolff recommend using these as safety features but the cost of these is considered extremely high, which is why the competition have not seen fit to incorporate them into their designs. A third safety feature was the use of a wireless telegraph which could be used to advise the ship of weather conditions and hazards to navigation. This safety feature is can also be used as a benefit to first class passengers; the ships radio operator can transmit messages from first class passengers to destinations in North America via the radio station White Star uses to handle its radio communications. This is another luxury feature that can be used to promote the Titanic and attract first class passengers. The Titanic is meant to travel the North Atlantic from Southampton, England to New York City, USA. The shortest route is via the Northern route, however the North Atlantic is prone to icebergs in the winter and into the early spring. The alternate route takes ships further south to avoid the icebergs. This approach will avoid icebergs but add time to the trip. Lifeboats are the last resort and will only be used if and when all other safety measures fail. Regulations at this time have not kept pace with ship building technology and passenger ships are required to carry 16 lifeboats regardless of the number of passengers. Lifeboats aren’t seen as a primary concern at this point as any emphasis on them would detract from the company’s marketing of the ship as the safest in existence.
The Disaster
The Titanic set out from Southampton, England on April 10th 1912. Departure was delayed by 1/2 hour when the SS New York docked nearby was torn from her moorings by the propeller was from the Titanic. She made 2 stops before her final departure for New York, at Cherbourg, France, and Queenstown, Ireland, where she picked up more passengers. She left Queenstown for New York with 2,240 passengers and crew aboard on April 11th. She sailed the most direct route to New York, through the North Atlantic, and made good time until she reached a point about 400 miles south of the Grand Banks off the coast of Newfoundland. The Titanic was equipped with a telegraph, a relatively recent addition to ships and had been advised that there were icebergs in the vicinity. One of the problems the ship’s crew faced was the demand for the use of the ship’s telegraph; keep in mind the Titanic carried some of the most influential business people (and socialites) in the world so there was a constant demand to send and receive telegrams. At one point in the evening, the operator became so frustrated with the traffic that he told an operator aboard another vessel trying to warn him of the icebergs to “shut up”. Because priority for sending and receiving telegrams was given to the passengers, the warnings about the icebergs in the area never reached the bridge. The Titanic maintained her speed of over 20 knots (approx. 22 mph) through that night. The captain took precautions against a collision with icebergs, he posted a watch in the “crow’s nest”, an observation platform above the bridge of the ship. Neither crew posted to the crow’s nest were able to locate the binoculars which were supposed to increase their range. Under normal conditions the watch would have spotted any icebergs in plenty of time to avoid a collision, but on this night the lookout spotted an iceberg when the Titanic was very close to it. The iceberg the lookout spotted had “turtled”, that is it had turned over so that the white snowy surface one normally associates with an iceberg was replaced with a dark opaque surface which did not reflect light. Also contributing to poor visibility was the fact that there were no waves breaking on the iceberg. This combination of events caused the lookout to telephone the bridge with the warning “iceberg dead ahead” when the Titanic had very little time to react. One measure the skipper did not take was to slow the ship down; the Titanic was making 22 knots (approx. 25.5 mph) that night. There are conflicting accounts of the sequence of events in the next several seconds. Steering in those days still followed the convention of tiller steering; when the tiller is moved to the left, the boat steers to the right and vice versa. When an officer wanted to give a command to steer the boat sharply to the left he would give a “hard-a-starboard” order. Some accounts have the First Officer giving an order for “hard-a-starboard”, by which he meant to turn the ship to the left (or port) and that the helmsman panicked and steered the ship to the right instead. Others have the helmsman steering the boat to the left as directed. Regardless of whether the helmsman mistakenly steered the ship to the right, or starboard, initially he did bring the wheel around to turn the ship to the left, or port. At the same time the order to change course was given, the order to reverse engines was given to slow the ship down. Reversing the engines will slow the ship but it also creates turbulence around the rudder. Remember that when the Titanic left her dock in London the turbulence her props created was sufficient to break another ship loose from her moorings. The effect on the Titanic in this instance would have been to make the rudder less efficient. A ship’s rudder relies on the flow of water passing over it to create force and reversing engines would have lessened this force. Whether or not it would have been possible for the Titanic to avoid that iceberg or not we can’t know. What we do know is that the rudder did eventually “bite” and the ship slowly began to alter course to port. The course change was just enough that the Titanic scraped the iceberg with her right side at 11:40 pm Sunday evening, April 14, 1912. The force of this collision was sufficient to open a long tear down the side that stretched to nearly a third of the ship’s length. Experts have speculated that if the ship had not attempted to change course she might have survived the impact with the iceberg because a head-on collision probably would not have flooded more than the most forward compartment. Expecting the officers and crew on the bridge to deliberately hit the iceberg is not an option, it would be a bit like advice to hit that deer who wandered onto the road rather than to swerve into another car or off the road. It is great advice but a little difficult to follow in the heat of the moment. Think how difficult it would be to deliberately steer a ship carrying 2200 passengers and crew into an iceberg! The ship began to flood and the closure of the doors in order to seal the flooding compartments off was of no help since those compartments were open to the sea. Once Captain Smith became aware of the collision he went up to the bridge to take command. He sent a fourth officer to inspect the damage and this person initially reported that there was no serious damage. Shortly after that however, the reports started coming in to the bridge that flooding was taking place on a massive scale and that the ship was sinking. Smith’s first reaction was to have an SOS telegraphed to summon anyone in the area to his aid. Keep in mind that the Titanic was in the middle of an ice field which other vessels had warned her away from; he must have known that help would take a long time to arrive and anyone immediately steaming to his aid would be taking a tremendous risk of colliding with an iceberg. By 12:30 am on Monday the Titanic was noticeably down at the bow. Now an additional problem evidenced itself. Although the Titanic was equipped with water tight doors to seal the individual compartments, these doors were not water tight at the top so when a compartment was completely flooded water would begin to flow into the next compartment over the top of the doors, much like the water in an ice cube tray when you tilt it. The Titanic’s officers began the process of getting passengers off the ship now. Although the Titanic had developed a list, it was not sufficient to communicate a sense of urgency to the passengers. The officers had a difficult time filling the lifeboats with reluctant passengers who would rather go back to their beds, or to the smoking room, or gymnasium rather than leave the ship in a life boat. Reluctance was so prevalent that some of the first ones were launched less than half full. The lifeboats had a capacity for 68 adult men, the first lifeboat to launch on the starboard side contained only 12 people. The Titanic was sinking. There were insufficient lifeboats to accommodate all the passengers so every lifeboat spot that was vacant meant another passenger was going to die! At 12:45 am the crew fired the first distress rockets to attract the attention of any other vessels in the area (some thought they saw a ship’s lights some 5 miles distant). This had the immediate effect of communicating a sense of emergency to the crowd which was lacking up to this point. The Titanic was laid out with different areas for 1st, 2nd, and 3rd class passengers with the 1st and 2nd classes having access to the deck where the lifeboats were being launched. These passengers began to fill the lifeboats now. Generally, the policy of “women and children first” was being followed, with some exceptions but this did not apply to the 3rd class passengers. 3rd class passengers were prevented from making their way up to the boat deck by the crew of the Titanic. The ship continued to fill with water more and more rapidly. The bow began to disappear below the water and as the bow began its descent to the bottom it brought the stern up out of the water as the ship began to assume a more vertical attitude. Out of the total complement of over 2200 passengers and crew, fewer than 700 escaped in lifeboats. The last of the lifeboats pulled away from the Titanic at 2:00 am. The 1500 passengers who remained, including the 3rd class passengers, made their way to the stern, or fan tail to await their fate. A wave began rolling up the ship as she sank and this plucked some of the passengers off the fan tail and drowned them. The rest were all in the water by 2:20 that morning when the Titanic sank. The water in that part of the Atlantic at that time of year would be barely above freezing and at that temperature survival is only possible for a matter of a few minutes. Those that did not drown on board the Titanic died of hypothermia shortly after going in the water. No lifeboats returned to help any of these passengers. The passengers and crew who had escaped in the lifeboats were rescued later that morning by the Carpathia and brought to New York city. The tragedy had claimed over 1500 lives.
Causes
The commonly held view that the Titanic was “unsinkable” runs like a thread through all the events that took place that April. From the decision to provision insufficient lifeboats to accommodate all passengers and crew to the reluctance of the passengers to take to the lifeboats after the iceberg had been struck, the belief that the Titanic was unsinkable influenced everyone’s thinking in some way. Marketing the ship as unsinkable was undoubtedly good salesmanship but had disastrous consequences. The Titanic actually carried more lifeboats than she was obliged to carry by the maritime authorities. The rules governing the number of lifeboats a ship was obliged to carry had not been updated to keep pace with the increase in size and capacity of these ships. The result was a feeling that a need for opulence, the existing rules regarding lifeboat capacity, a desire to cut costs, and the fact that the Titanic was unsinkable persuaded her owners to under equip her. She only carried 20 lifeboats, 14 regular lifeboats with a capacity of 65 people, 2 emergency cutters with a capacity of 40 adults each, and 4 collapsible lifeboats with a capacity of 47 adults each. These last were never launched. There seems to have been several breakdowns of communication on board. There were a number of warnings from other ships that there were icebergs in the path of the Titanic. Some of these were taken to the bridge and posted on a bulletin board there. The last warning actually provided information about an area of icebergs identified by latitude and longitude; the Titanic had actually proceeded into the area by the time of the transmission. This last transmission was never relayed to the bridge. The telegraph operator was so overwhelmed by messages to and from the ships passengers that he lost his temper and told the other ship’s operator to “shut up”. Even though this last message was not communicated to the bridge, the bridge had to have been aware that icebergs were in the area from the previous messages. Smith would have been aware of the risk even without telegrams, this area of the Atlantic always has icebergs at this time of the year. The Titanic was travelling at over 25 mph in an area that contained icebergs, many of which could sink the Titanic. Icebergs come in all shapes and sizes but where there are large numbers, some will always be large enough to cause major damage to the largest ship. This speed meant that there simply was not enough time, after an iceberg had been spotted, to avert a collision. The Captain is ultimately responsible for the vessel under their command and all the passengers and crew, but Smith can be excused to some extent by the pressure applied to make a speedy passage. He was not pushed to set a record but, if he had avoided the ice field or slowed to avoid collision it is likely that his passage would be longer than the competition’s. There was lifeboat capacity for at least half the people on board the Titanic, probably a lot more, considering the number of women and children and the calm conditions. Fewer than 700 people actually made it into lifeboats and a number of these were men. The policy aboard that ship and any other of the time was that women and children were to be rescued before men. This policy was defeated aboard the Titanic when members of the crew prevented 3rd class passengers, including all 3rd class women and children, from getting to the lifeboats. The Titanic never had a shakedown cruise or lifeboat drills before setting out on her maiden voyage. This lack of familiarity with the ship was what one of the lookout’s blamed for his inability to locate the ship’s binoculars.
Lessons Learned
Maritime regulations were changed to require a lifeboat seat for everyone aboard a ship, and regular lifeboat drills to ensure everyone aboard knows what is expected of them and that the crew is able to fill the lifeboats and launch them efficiently. Regulations were changed to prioritize distress calls and later required an alarm system to be installed that would automatically be triggered by a distress call. The Titanic began sending out SOS’s as soon as Smith realized she was sinking but no-one responded. Most operators had gone to bed because of the hour. Maritime authorities established ice patrols to gather information about the ice in the North Atlantic. Up to that point, shipping relied on visual observation and warnings from other ships to advise them when there was ice in their area. The ice patrols spotted icebergs, plotted their drift and advised all shipping of ice conditions. Water flooded the entire length of the Titanic because here water tight doors were not water tight at the top and allowed water to overflow into the next compartment. Titanic’s sister ship, the Brittanic (originally called the Gigantic), was equipped with these improved watertight compartments. She was hit with a torpedo in 1915 and sank in 90 minutes, approximately the time it took the Titanic to sink. These lessons were effective in preventing further accidents of this nature, due to this cause. There has never been a maritime disaster to this day which has claimed that many lives. There may be some lessons for the rest of us in the Titanic disaster. Let’s start with the problem caused by the boast of the “unsinkable” Titanic and the desire to make the passage to New York as quickly as possible. Marketing and speed spoke directly to White Star’s bottom line. There are plenty of organizations that are ready to pay lip service to safety and corporate social responsibility, the ones that are able to demonstrate success are those that put their money where their mouth is. A company, group, or individual that treats these issues seriously will be willing to allocate money to address them. The money should prove to be a sound investment in the long run, because the avoidance of disasters such as the Titanic are immensely expensive, as the folks at BP are finding out in the Gulf of Mexico. Fortunately, the level of class distinction that existed in 1912 is gone today, however it has not completely disappeared. Air carriers still provide the option of business class, or first class to passengers. Those aboard the Titanic unfortunate to find themselves in 3rd class stood no chance of rescue thanks to crew who took it upon themselves to keep them from the boat deck so that 1st and 2nd class passengers could access the lifeboats. Communication failed the Titanic in several specific ways, most chillingly, the lack of response to her SOS’s. Communication must be 2 way in order to be successful. We have to have an attentive audience and then verify that they have heard/read/seen the message and understand it. In the case of the SOS signals the tragedy brought about a change in the way telegrams were prioritized and required a 24 hour watch for distress signals. The relative newness of the telegraph was a contributing factor to the lack of communication. Always remember that the promise of new technology can never be taken for granted until it has been demonstrated and the users of the technology have been fully trained. The crew of the Titanic had to walk a fine line between communicating the urgency of the situation to the passengers without panicking them. This they failed to do. They failed to communicate the danger to the passengers and the result was many died who could have been saved by lifeboats. The contingency plan that every vessel that carries people uses is the lifeboat. The Titanic failed to carry adequate lifeboats for the number aboard her, but the crew also failed to take advantage of all the spots available on the lifeboats. This was due partly to a failure on the passengers part to act and partly due to the unfamiliarity of the crew with the handling of the lifeboats. Rollback strategies and contingency plans should always be tested whenever possible to verify their effectiveness and to familiarize the team with them. When the time comes that a contingency plan must be implemented, implementation will go smoothly and there will be no surprises after implementation. The final lesson to learn is that no ship, airplane, vehicle, company, or system, is unsinkable. Thinking that way inhibits our ability to spot risks and mitigate them. We should never make assumptions based on marketing hype (or our own confidence in our work) that prevent us from identifying possible risk events. Remember that no matter how unlikely the risk event, the impact could be such that spending on a mitigation strategy is the only rational decision.