Lidando com interrupções inesperadas com SQL Server

Handling unexpected outages with SQL Server

May 28, 2024 Roberto Magalhães

How can your database system help you overcome a catastrophic failure and get your business back on track as quickly as possible?

A catastrophic failure occurs. There is a backup, but you have some delay in extracting it from remote storage or cloud provider. You also need an available and experienced DBA, no matter the time, to handle this in the most seamless way possible. How can your database system help you overcome this and get your business back on track as quickly as possible?

Where do you start from?

Your company may have many different levels of criticality and the first step is to ask yourself some questions to identify where a failure could put you at risk. The two most important are: What are your limits for acceptable data loss and downtime? When is enough time for your company to get back up and running?

There are two concepts you can use to answer these questions: Recovery Point Objective (RPO) and Recovery Time Objective (RTO ).

RPO is the time when you want your data to be available – at least 2 minutes behind healthy records before failure, for example.

RTO is the SLA time for your company to return to operating within the RPO you defined. For example, you want service to be restored in a maximum of 3 minutes. Take a look at where your databases currently stand. Based on these answers, you can start planning your Business Continuity Plan.

Interruption Types

Understanding what types of outages your SQL Server may experience can help you choose the right infrastructure and resources to overcome them.

When something fails that affects just one server, such as OS failure, patch installation failure, RAM/CPU failure, disk failure, you need a High Availability solution to move your resources to.

Sometimes your problems are widespread, like network switches or SAN failures, natural disasters like fire, flood, power outage, and so many others. So, you need a disaster recovery solution, like moving your resources to another room or location in the data center, for example.

Database corruption can occur from SAN failures, SQL bugs which also need processes and plans to fix this corruption. Human errors include erroneous drops of tables or databases, updates without a where clause, and so on.

Characteristics

How SQL Server can help you get rid of interruptions? Here are the resources that will leverage the high availability of your business in a secure and robust way.

Backup and restore – Yes! The most basic part of data recovery must be taken into consideration! Make sure your backups meet your RPO and RTO and test them periodically. There are different types of backups, full, differential, transactional log and also file groups, which can bring your database online faster and in parts.

Replication – Copy and distribute data between databases on local or remote instances at scheduled intervals, with multiple directions and types available. Only replicate data, ideal for specific subsets and for sending/receiving data to/from specific locations such as store branches, distribution centers, etc.

Log Shipping – Automated backup and restoration of the transactional log to another local or remote database, with read-only replicas available (Standby).

Always-ON FCI (Failover Clustered Instances) – A cluster of 2 or more servers with shared storage. When a node fails, all resources are moved to an available node. These nodes can be in the same datacenter or remote.

Always-ON Availability Groups – A cluster of standalone SQL instances with independent storage volumes where you can move groups of databases instead of the entire instance in case of failure. You can leverage a read-only secondary replica, backup offloading, and it can be asynchronous or synchronized. Ideal for mission-critical environments

Storage replication – Replication of volumes between servers or clusters, replicating disk blocks so that data can be moved between sites and SAN clusters

Virtual Machine Replication – Replication of virtual instances with the entire server as a snapshot, becoming fully functional VMs to bring your infrastructure back in a timely manner.

Investment

I started by saying that you must be aware of the criticality of your databases. The less RTO and RPO you need, the more you will have to invest in SQL resources and infrastructure to provide near-zero data loss. Well-defined criticalities will provide a scenario where you can invest more in what is most valuable.

Here you have 3 main options:

Enterprise Edition: The complete suite of SQL features provides cutting-edge, highly scalable solutions for compute and availability. All features are available. I would like to mention here distributed availability groups (availability between domains and network), online restoration of pages and files and high I/O scalability.

Standard Edition: A basic SQL feature pack that provides most Enterprise features with some limitations, such as basic availability groups for 2-server clusters, accelerated database recovery, and failover cluster instances.

Web and Express: Free editions with very limited features and functionality.

Make it happen

You can start with a very simple implementation like fixing your schedules and backup types, optimizing your parameters like I/O, number of files, and move on to more complex design like availability groups across different rooms, domains and data centers , ensuring that your business wins. Don't suffer major interruptions and you will be calm and at ease to focus on business strategies that will boost your profits.