/ Articles

Server failure – how to deal with it

8 min. czytania

Have you ever woken up in the middle of the night, drenched in sweat, with the terrifying thought that your website or app just crashed? No? Then you must have a solid Disaster Recovery Plan in place and you’re sleeping like a baby. Or maybe you don’t have one, and now you’re wondering if you should start panicking?

Don’t worry, it’s not the end of the world, even if the mere thought of a server outage sends your heart racing. A server crash isn’t the apocalypse, even though it might feel like it at first. Instead of freaking out, it’s time to take action – and that’s exactly what we want to show you in this article. Who knows, maybe DRP (or the lack thereof) will suddenly become a hot topic at the lunch table.

From this article, you will learn:

  • What steps to take after detecting an outage
  • Which diagnostic tools and methods to use
  • How to gain access to the server or management systems
  • How to analyze and solve common problems
  • How to prevent future outages
  • About server monitoring and analysis

First steps after detecting a failure 

When your website or application suddenly stops working, it’s crucial to act quickly and calmly. The first steps you take after detecting an outage can determine how quickly you can restore proper operation of your services. Here’s what you should do: 

 

  1. Assess the scope of the problem – does the outage affect only one service or the entire infrastructure? Is the problem visible to all users or only to some? A quick check will help you understand the scale of the problem. For example, the problem may only be on the administrative panel side and not necessarily visible to end users. 
  2. Ensure communication – inform your team and your users about the problem using available communication channels such as social media, email, or a status page. Transparency in this matter builds trust and reduces user frustration. 
  3. Check recent changes – were any changes made to the configuration or software before the problem occurred? Often, the cause of an outage is recent modifications. 
  4. Consult your disaster recovery plan – if you have a Disaster Recovery Plan, now is the best time to use it. It will contain defined procedures that can significantly speed up the resolution of the problem. 
  5. Pause advertising campaigns – if a longer break is expected, and your advertising budgets are high and the volume of traffic/sales is large, your users will be even more easily frustrated by seeing an ad and then going to a non‑functioning website, online store, or application. At the same time, you will save thousands of dollars on empty, non‑converting traffic. 

 

Diagnostic tools and methods 

 When starting your diagnostics, it’s worth equipping yourself with the right tools to help you locate the source of the problem. Here are a few that may prove invaluable: 

  • System and application logs – the first place to look is the logs. They can provide valuable information about errors, warnings, and other events that occurred before the outage. 
  • System monitoring – monitoring tools such as Nagios, Zabbix, or Prometheus can provide real‑time information about the performance of your system and applications. They allow you to quickly spot anomalies, such as resource overload. 
  • Availability testing tools – use tools like Ping, Traceroute, or MTR to check if the problem lies with the network. They will help you understand if the server is accessible from different locations. 
  • SSH and remote access – access to the server via Secure Shell (SSH) is essential for performing many diagnostic and repair operations. Make sure you have access to your machines. 

Remember that effective diagnostics is a combination of technical knowledge and available tools, so in such moments, it’s worth having a dedicated team or subcontractor who will react in a timely manner (you can use 24/7/365 server administration). Knowledge of your system and applications is invaluable when troubleshooting. Don’t forget to also use online forums, numerous Facebook groups, and documentation – often the solution to a similar problem has already been described by someone else. 

Need quick support during an outage?

Be sure to contact us - no obligation!

Quick contact

Server and management systems access

Once a server failure has been identified, the next step is to gain access to the server itself and its management systems to perform a detailed diagnosis and take the necessary corrective actions. Server access is crucial for analyzing logs, checking the status of services and applications, and making any necessary configuration changes.

Login and remote access methods

There are several basic methods for logging in and gaining remote access to a server, which can be used depending on the configuration and preferences. Here are the most important ones:

  • SSH (Secure Shell): this is the standard method for accessing Linux/Unix servers, allowing for secure login and command execution in the remote server’s terminal. SSH access usually requires a username and password or an SSH key.
  • RDP (Remote Desktop Protocol): primarily used in Windows environments, RDP enables remote access to the server’s graphical user interface. This is particularly useful for managing applications that require interaction with the GUI.
  • Hosting control panels: for servers rented from hosting providers, web‑based control panels such as cPanel, Plesk, or DirectAdmin are often available. These panels offer easy access to many server management functions, including files, databases, email, and logs.
  • KVM (Keyboard, Video, Mouse) console or remote IPMI (Intelligent Platform Management Interface) console: these methods allow for hardware‑level access to the server, which is useful when other login methods fail, for example, in the event of a system crash or network problems. They enable full access to the BIOS, machine restart, and remote operating system installation.

Troubleshooting login issues

When you encounter problems logging into a server, there are several steps you can take to diagnose and resolve the issue. Here’s what you should do:

  • Check your network connection: make sure your computer has internet access and there are no network issues that could be blocking the connection to the server. Use tools like ping, traceroute, or mtr to check the connection to the server’s IP address.
  • Verify your login credentials: double‑check that you’re using the correct login information, such as your username, password, or SSH key. Make sure your keyboard isn’t set to a different layout (e.g., AZERTY instead of QWERTY), which could cause errors when typing your password.
  • Check SSH configuration: if you’re logging in via SSH, check the SSH configuration file (/etc/ssh/sshd_config on the server) to make sure there are no settings that could be blocking your connection, such as access restrictions for specific IP addresses or a requirement for SSH key authentication.
  • Login attempt limit: some systems have security mechanisms that block an IP address after several failed login attempts. If you suspect this might be the cause, try logging in from a different IP address or contact the system administrator if you’re not the administrator yourself 😉
  • Locked accounts: in some cases, a user account may be locked due to suspected unauthorized access or other reasons. Contact the system administrator to check the status of your account.
  • SSH key issues: if you’re using an SSH key to log in, make sure it’s correctly installed on the server and that you’re using the correct private key. Also, check that the ~/.ssh/authorized_keys file on the server contains the correct public key.
  • Server logs: check the SSH server logs (/var/log/auth.log on most Linux systems) for information about errors related to login attempts. They may provide clues as to the cause of the problem.

Troubleshooting login problems often requires a step‑by‑step approach and elimination of potential causes. Remember that staying calm and taking a methodical approach are key to diagnosing and resolving server access issues.

Analysis and troubleshooting

Once you have access to the server, it’s crucial to conduct a thorough analysis of the situation to understand the cause of the outage and take appropriate corrective actions. This requires identifying the source of the problem, which often involves using various diagnostic tools and methods. Remember that effective problem‑solving starts with a thorough analysis.

Common failure scenarios and their solutions

While managing a server, you may encounter various outage scenarios, each requiring a different approach. Here are some of the most common problems and how to solve them:

  • Web service outage (e.g., Apache, Nginx not starting): check the service logs for configuration errors or dependency issues. Make sure all required modules are installed and configured correctly.
  • Database issues (e.g., MySQL, PostgreSQL not responding): verify that the database service is running and has access to its data files. Check the database logs for details about the failure.
  • Resource overload (CPU, RAM, disk): use tools like top, htop, or iotop to monitor resource usage. Find and terminate processes consuming excessive resources or consider scaling resources.
  • Network problems (e.g., the server is not accessible from outside): check the server’s network configuration, firewall rules, and routing. Use diagnostic tools like ping and traceroute to analyze connectivity issues.

 

Restoring services and systems

After identifying and resolving the root cause of the outage, the next step is to restore normal operation of services and systems. Here’s what to do:

  • Restart services: once the necessary changes have been made, restart the services that were affected by the issue. In many cases, this is enough to get them back up and running smoothly.
  • Restore configurations: if the outage was caused by configuration errors, restore the previous, working versions of the configuration files.
  • Monitor after resolving the issue: use monitoring tools to ensure that all services are functioning correctly and the problem doesn’t recur. Monitoring will also help detect potential future problems before they escalate.
  • Test and verify: perform tests to make sure the services are working as expected and the problem has been completely resolved.
  • Document changes and conclusions: record all changes made and conclusions drawn from the outage analysis. This documentation will be valuable in troubleshooting future problems and planning to prevent similar outages.

Remember, every outage is an opportunity for improvement and strengthening the system. Restoring services and systems isn’t just about returning to the pre‑outage state, but also about optimizing and securing the system for the future.

Optimization and preventing future outages 

Every server outage provides valuable lessons that can be used to optimize and prevent similar problems in the future. Ensuring high availability and reliability doesn’t end with fixing current issues; it requires continuous monitoring, updates, and thoughtful planning.

 

Monitoring systems and applications

Systematic monitoring of systems and applications is key to early problem detection and outage prevention. Monitoring tools can track various aspects of server performance, including CPU usage, memory, disk space, service availability, and more. They also allow you to set alarms that notify administrators of potential issues before they turn into major outages. Popular monitoring tools include Nagios, Zabbix, Prometheus, and Grafana, each offering a wide range of functionality and can be tailored to the specific needs of the infrastructure. We might even write a few articles about our most frequently used tools in the work of a Centuria Admin. 

Outsourcing monitoring 

For organizations that don’t have the internal resources for continuous monitoring and management of their server infrastructure, outsourcing these tasks to specialized companies can be an effective solution. Centuria, as an experienced server administrator, offers 24/7/365 server monitoring and supervision services for clients. This allows clients to focus on their core business, knowing that their infrastructure is constantly monitored by professionals. 

The benefits of outsourcing server monitoring and management include: 

  • 24/7/365 monitoring: Ensuring continuous observation of key server performance indicators and quick response to any anomalies. It also allows for tracking the load for a specific URL, e.g. example.com/cart. 
  • Expert management: Access to an experienced team of specialists who can effectively manage and optimize servers, ensuring their stability and security. An experienced admin can immediately determine what to do to get the environment back up and running as quickly as possible. 
  • Outage prevention: Proactive actions aimed at minimizing the risk of outages through regular audits, updates, and security configurations. 
  • Technical support: Quick access to technical assistance in case of problems, enabling their quick diagnosis and resolution. 

Delegating the responsibility for monitoring and managing servers to an external company allows organizations to make better use of their resources while increasing the security and reliability of their IT infrastructure. This makes it possible not only to react to current problems but also to anticipate and prevent potential outages in the future. 

Summary 

Managing server outages is an integral part of maintaining a stable and secure IT infrastructure. As this article has shown, it’s crucial not only to respond quickly when problems occur but also to continuously monitor, optimize, and prevent potential future outages. 

A Disaster Recovery plan is the foundation for any organization, ensuring preparedness for various emergency scenarios and minimizing the impact of unforeseen events on business operations. Developing such a plan and regularly conducting disaster recovery tests allows for the quick restoration of key services and data protection. 

Furthermore, implementing monitoring systems and using the services of specialized companies like Centuria can significantly increase the security and reliability of the infrastructure. Professional server management and proactive preventive measures allow for maintaining the continuity of services and reducing the risk of downtime and losses. 

O autorze

Patryk Szczepaniak

Marketing Manager w Centurii. Entuzjasta digital marketingu, samouk. Praca w różnych sferach digitalu pozwala mu na spoglądanie na biznes holistycznie łącząc wiele działań naraz. Prywatnie biega po krakowskich ścieżkach.

See also

Zobacz więcej