Modular power protection in industrial applications – understanding the “ilities”
Modular power protection and conversion technology, particularly in the form of UPSs, has long been used in commercial applications, but take-up in industrial applications has, to date, been relatively slow.
This relatively slow uptake is due, in part, to a limited understanding of the “ilities” (“Availability”, “Reliability”, “Scalability”, “Flexibility” and “Maintainability”) commonly associated with modular technology and how the various “ilities” complement each other.
The second of five articles within our 'ility' series focuses on availability and reliability. We will review the technical and mathematic backgrounds and explain how both terms relate to the increasingly popular modular technology.
Availability v Reliability
Before we can discuss the similarities and differences between availability and reliability we will define them as follows:
- Reliability is the probability that a system will not fail.
- Availability is the probability that a system is operating when required.
These two definitions appear to be very similar, and there is a relationship between the two but they are different and it is the difference between the two that creates most of the confusion surrounding their usage.
Although counter-intuitive, reliability is not the most important factor in power protection system design. Power protection systems must be available every second of every day and therefore maximising system availability is the overriding objective for any power protection system design and what technology is used in what configuration will dramatically affect system availability.
Every mechanical or electrical system ever invented will, if operated long enough, probably fail at some time. This probability is known as the system’s failure rate and in reliability engineering is shown as λ (Lambda).
If λ is the probability that a system will fail, the probability that a system will not fail is 1/ λ and because every system will probably fail at some time, λ can never be 0% and a system’s reliability can never be 100%.
As percentage probabilities are more difficult (for most of us) to comprehend than time, it is more common to consider a system’s reliability as the average number of hours it takes to fail. This measure of reliability is referred to as Mean Time Between Failure (MTBF), therefore:
MTBF = 1/λ
But using purely MTBF figures to estimate how long a system is likely to operate without failure can be misleading. For example, the 2013 actuarial “actual life tables” stated that a 30 year old male (i.e. system) had a 0.1467% probability of dying (i.e. failing) within 1 year. Applying this “failure rate” (λ) of 0.1467% to the above MTBF equation gives
MTBF = 1/λ = 1/0,1467 = 681 years
The fact that no 30 year old male currently alive can expect to live (i.e. not fail) for 681 years shows how reliability statistics can be misleading when used in isolation.
If availability is the probability that a system is operating when required we must also consider how long it takes to return the system to full operation after it has failed i.e. how long it takes to fully repair the system.
This “repair time” is typically referred to as the Mean Time To Repair (MTTR) and gives us the following availability equation:
System Availability = MTBF/(MTBF + MTTR)
From this equation we can see that if a system’s MTTR is 0 hours the system’s availability will be 100% regardless of the system’s MTBF. It is clear, therefore, that in order to maximise a system’s availability it is necessary to minimise a system’s MTTR. This is not to say that it is OK completely disregard a system’s MTBF (reliability) as clearly a system with a high MTBF will be more available than a system with a low MTBF if the systems’ MTTRs are the same. What it does say, however, is that low MTTR increases the availability of reliable systems. Some Availability v Reliability examples:-
You will (hopefully!) recall the 3 off modular UPS topologies discussed in the first article in the “ilities” series, namely Modularity. In the Modularity article the three system topologies discussed were “traditional mono-block”, “modular block” and “rack-mounted modular”. We will now consider the respective MTBFs and MTTRs of these topologies to see what, if any, impact the various topologies have on system reliability and, more importantly, system availability.
In order to help maximise the level of critical load protection let us assume the following:
- all of the UPS “modules” are of a high quality, industrial design;
- the systems are properly maintained in line with manufacturer recommendations;
- in all 3 examples that the critical load is 120kVA;
- all 3 systems are parallel redundant (N+1)
In this topology, the parallel redundant system comprises two separate UPS cabinets feeding the critical load (i.e. N+1 = 1+1) and the system component count is therefore double that of a single UPS solution. It follows that the greater the number of system components, the greater the probability of a component failure, however, because the system is parallel redundant, a component failure in one of the UPS cabinets will not expose the critical load to raw mains and will, therefore, not result in a system failure. We therefore have a highly reliable “system” and, for the purposes of this example, we will assume its MTBF is 800,000 hours. However, because the system components (PCBs, IGBTs etc.) in this topology are separately housed in the UPS cabinets and must be separately removed from and/or added to the UPS system all components must be individually replaced on site. This means that the MTTR of this topology is the highest of the 3 topologies and, for the purposes of this example, we will assume it is 8 hours. Therefore:
MTBF/(MTBF + MTTR)
= 800.000/(800.000 + 8)
= 99.999 % (often referred to as “five nines” availability)
As with the traditional mono-block topology, this parallel redundant system configuration comprises two separate UPS cabinets feeding the critical load (i.e. N+1 = 1+1) and the system component count is double that of a single UPS solution and we will assume the system MTBF is the same 800,000 hours. However, because the system components (PCBs, IGBTs etc.) in a modular-block system are grouped into sub-assemblies that can be replaced as sub-assemblies rather than as individual components in a monoblock system the MTTR is lower. Let us assume it is 4 hours.
MTBF/(MTBF + MTTR)
= 800.000/(800.000 + 4)
= 99.9995 %
This demonstrates that a significant improvement is system availability is achieved when the MTTR is reduced.
The US Uptime Institute has introduced a tier-classification system as the global standard for rating computer centre availability: this is part of TIA-942 (Telecommunications Infrastructure Standard for Data Centres) and can be applied both to individual systems, like air-conditioning or UPS, and to computer centres as a whole. Tier I provides 99.67% availability with 28.8 hours of downtime a year, Tier II 99.75% availability with 22 hours' downtime, Tier III 99.982% availability with 1.6 hours' downtime, and Tier IV 99.995% availability with just 0.8 hours' downtime a year.
While Tier I and II are sufficient for conventional PC workstations and office servers, Tier III availability is almost indispensable for process-critical applications in industry, using n+1 redundancy as a prerequisite.
And any company that provides services around the clock 365 days a year, such as computer centres in the financial sector, would rely on the top Tier IV. Here, all components and pathways are 2n+1 redundant, avoiding single points of failure completely.
Modular power protection
To rate computer centre availability, the German Federal Office for Information Security (abbreviated as BSI) has divided it into six availability classes or ACs. From AC 4 upwards, maximum availability is 99.999%, as calculated using the traditional monobloc system equation.
These have only minor outages, which are barely noticeable due to the parallel-redundant systems, as found in many process-critical applications in industry.
They are especially recommended for financial groups whose computer centres must function perfectly 24/7, 365 days a year.
In this topology, for reasons that will be explained in the “Scalability”, “Flexibility” and “Maintainability” articles in this series, we have chosen to use four parallel redundant modules to feed the critical load (i.e. N+1 = 3+1). As the system component count is now quadruple that of a single UPS solution we must assume the system MTBF is lower than that of the mono-block and modularblock systems at, say, 500,000 hours.
However, because each UPS module in this topology is a fully functioning and complete UPS that can be “hot swapped” (see Modularity article) in less than 10 minutes the MTTR is a very impressive 0.17 hours. Therefore:
MTBF/(MTBF + MTTR)
= 500.000/(500.000 + 0.17)
= 99,99996% (often referred to as “six nines” availability)
This shows that a very significant improvement is system availability is achieved when the MTTR is minimised, regardless of a reduction in system MTBF.
The most important design consideration for any power protection system is its availability.
Highly reliable modules are important component parts of a highly available power protection system but how the modules are configured and their topology are more important. Parallel redundant module configuration will increase the reliability and availability of power to the critical load so should be used wherever possible. Rack-mounted modular topology maximises system availability. The next article in the “ilities” series will discuss “scalability” and how modular topology can minimise the capex and opex of a power protection system.