A device's mean time before failure, or MTBF, is a way of accurately estimating the amount of time that it should take before a critical malfunction occurs. Obviously this is a guideline – the actual lifespan of any electronic device is subject to numerous outside factors that a manufacturer couldn't possibly take into account – and that's what separates the professionals from the amateurs.
It's one thing to replace a critical component or piece of hardware according to its life expectancy – but it's an entirely different matter when you're trying to recover from an unexpected system crash. And it's even worse when you failed to act on the warning signs that were likely given ahead of time.
Common Signs of Server Failure
Although this list is by no means all-inclusive, it covers some of the most common signs of server failure, including:
- Random system crashes: A hardware crash is never a good thing, but some can be recreated and even traced back to a particular activity or occurrence. It's when your system starts experience frequent, random crashes that you should really start to worry.
- File system errors: Many different aspects of a server's file system have the potential to malfunction. If a server constantly needs to perform system checks or if it frequently kicks into read-only mode, the server might be on its way to the graveyard.
- Storage issues: Servers are prone to numerous storage issues – especially RAID setups. Hard drives are amongst the most common component to fail in any computer, and servers are no exception. Controllers and connectors can also fail.
- Command line freezes: Although it's generally considered more stable than current-gen GUIs, errors can still manifest at the command line, too. This typically comes in the form of a brief hang-up or freeze, which usually resolves itself within a matter of seconds. If it causes you to restart the entire system, your server is likely experiencing its final days.
- Physical problems: Most of your server's hardware is subject to physical damage, too. One of the biggest hazards is rapid heat generation and dissipation – like the kind that's seen with your server's motherboard. This repeated process can cause microscopic cracks to form in the motherboard itself, which will ultimately result in a catastrophic failure.
Now that you have a better understanding of some of the most common points of failure on today's servers, you can take steps to mitigate your risks.
- Use an uninterruptable power supply. An uninterruptable power supply (UPS) will save your system from sudden power surges and outages, but UPS devices have been known to fail, too.
- Keep a record of system crashes and issues. While a crash log will be automatically generated when an incident occurs, some IT experts keep their own records. This makes it easier to connect the dots later on and could provide a key to solving recurring issues.
- Update and maintain your software, too. A lot of emphasis is placed on server hardware, but don't forget about the software side of things. As always, update your software on a regular basis and eliminate any junk files on a regular basis.
While these three general tips can go a long way in ensuring the longevity of any server, it ultimately depends on the hardware's exact usage and workload. Remember: individual servers have their own nuances just like desktop and laptop systems.
Don't Ignore These Early Warning Signs of a Server Failure
No comments yet. Sign in to add the first!