|
|
The state of backups can be rather depressing at times. I've worked in an
environment where backups weren't a bottom-of-the-rung job, it was a critical part of the
organization where its very existance relied on reliable, replicated, and timely backups.
Seeing how most other organizations manage backups is not very pleasant. The main reason,
from what I can tell, is the lack of understanding in various topics ranging from reliability,
managability, to how backups work.
The following tips are designed for sysadmins that have to do backups and are in a resource
crunch. For efficiently laid out networks with a proper backup policy and infrastructure, well,
I'm sure you don't need this.
There's some interesting things that take very little of your time
and resources that you can do to improve the performance and reliability
of backups on a network. There's also some things you can do to make a
backup system's resources stretch out more when you don't have the
budget to expand it.
I'm listing some of the stuff you can do here to help backup
administrators dealing with large SANs and many, varied clients across
the network, along with things that can be done on the backup server and
hardware.
These tips are provided with the hope they'd be useful, but there's
no guarantee at all. Just look at the topics covered and how they
are implemented, and adapt as needed to your environment.
Reducing time spent waiting on the backup file index
Many applications will keep a central database of
data objects and elements so they can be referenced. In the
case of many backup software, such as Legato NetWorker,
this database is supposed to be locked against multiple updates to
the indices of one client, and can have multiple writes going
to the database as long as they are for different clients.
When doing backups from multiple streams, such as backing up several
filesystems on a client at once, the lock contention on the file
indices can severely impact performance.
One way to solve this problem is to ensure writes complete faster.
Using most RAID solutions will improve read throughput, but will
lower write throughput. In order to minimize the effects of
this latency, one can do several things:
- Back up one stream at a time per client. Have the backups
proceed sequentially for each filesystem/object on the client
that needs to be backed up. If the backup software will handle
parallel writes for multiple clients correctly, this is the ideal
solution. Simple, effective, and understandable.
-
Employ hardware RAID. Some RAID controllers will simply
show a single large drive to the OS. The RAID is done in the hardware
with a large write cache in place so it can accept writes, report
success, and rebuild the stripe on its own. Please note,
IDE RAID systems such as the Promise controllers do not have true
hardware raid.
-
Get faster CPUs. Some software will improve in performance by simply
giving it more CPU time. Having multiple CPUs will usually
cause a greater bottleneck, as processes on all CPUs may have to wait on
the same lock. Instead, speeding up each CPU so it can relinquish its
timeslice faster will allow other tasks to complete much more
quickly, improving throughput overall.
- Get more memory. If the RAID stripe can be cached and not
evicted from the system memory, then there is no need to re-read the
disk and rebuild the stripe in the case of software RAID. Eliminating
the hard-drive I/O wait will improve the performance noticeably on
memory-starved systems.
- Back up fewer streams at once. Try to balance out the backup
tasks as much as possible over as wide a period as possible.
This policy will allow one to keep too many jobs from hitting the
file index at once, thus reducing the contention for the locks
(Hello, Legato NetWorker) that some backup software exhibits.
In fact (again, NetWorker) will trigger warnings from performance.se
such as a possible attack from SATAN.
Reducing tape load/unload latency.
- Preemptively eject tapes and load them. Since many backup software
will just use whatever tape is in the drives that has free space when
it starts a backup, this can interfere with cloning of tapes. In
order to replicate tapes and pull them out of the writable tape
pool, you want the tapes to fill up. So before a night's backup
sequence kicks off, eject all the tapes from the drives, then load all
the mostly full tapes into the drives. Keep a couple drives loaded
with tapes that aren't near capacity, as you don't want to suffer too
long a delay while the backup software rewinds the tapes, ejects them,
loads a new tape, and seeks to a free location (this is especially
painful on single-reel tapes, such as DLT and LTO Ultrium).
- If you have to babysit a server through the backups (I've had
to do that before, when we didn't have the budget to replace
old and under-powered servers), you might as well do the preemptive
tape ejection during the backup when necessary. If you notice
the backup server is rather idle, kick off a backup group
earlier than usual (assuming the backup software will lock the
run so two instances of the same schedule can't be run).
If you notice that it's overloaded, dequeue a scheduled group, make a
note of the group you dequeued, and kick it off later when the backup
server is not heavily loaded.
Overall speed/performance improvements
- Force ALL interfaces to full-duplex on Ethernet if possible.
Instead of dealing with the sloooow speeds of a 10/half interface,
try to get everything set up with 100/full. Auto-negotiation
works fine one some combinations of NICs and switches, and it fails
miserably on others. Instead of taking a chance, don't allow the
hardware to auto-negotiate if at all possible. For those that can't
be forced, keep a simple list that has all the odd-ball configurations
around, so when you notice a problem with the backup performance, you
can see if network bandwidth is the cause.
- Diagnose routing issues for all systems. Many large network infrastructures
have multiple VLANS with different customer networks behind it, with even smaller segregation within
those networks. You want to make sure the backups aren't going over interfaces and routes
they're not supposed to. For once thing, that slows down the network. For another, if the backup
software is not running over an encrypted link, other customers can sniff the traffic and see
confidential data that they shouldn't have access to.
- Consider plotting the backup trends for the various systems so
you can determine how much data is backed up during a full and
incremental backup. Also look at the trend over time to see if the
system is gaining data, losing, or staying relatively constant.
Based on that, try to balance out the amount of data you back up
in a single day (so tape replication/cloning will complete), and
schedule the clients by backup groups marked by time, not server
class. Thus, you'd have a 02:00 group, instead of a "mail servers"
group. Unfortunately, this will require you manually spread out the
full and incremental backups across the days of the week,
unless you want to spend the time writing a piece of software that
will automatically do this for you. Considering it's a one-time
cost, after that the maintenance can be done manually with
very little added overhead if you do it on a regular schedule,
it's not really worth the effort.
Picking the right backup software
- This is the part where many admins make a mistake. Don't look at the feature
list or the supported systems or the cost or the ease of use right away. First, make a list
of the requirements: the systems you need supported, the budget you have for both
hardware and software, the amount of time you can afford to invest in the backup
infrastructure (don't skimp on this!), and the security requirements of your network.
- The last point is especially important. For systems where the internal network is
not trusted (such as financial institutions where they can't afford to make assumptions),
you can't have a backup software that lets anyone talk to the backup server or clients
and kick off tasks.
- Once you have that, make a list of product that support your requirements. Then before you
even look at the extra features the products offer, see if you can find where their support forums
are. Also look for posts regarding that software on various archives like Google and Google News.
You want to keep an eye out for reports of corrupted data, corrupted backup server indices, and crashes.
Keep track of the versions they're reported in, and verify the same bugs don't keep recurring. If they do,
avoid that product like the plague.
- Many backup products brag about the features they have in terms of backing up systems. How many of them
brag about restoration? Who cares about backups? It's the successful restoration that counts.
Make sure any product you look at has a way to get a system up and running if it needs to be reimaged from
scratch after a complete system failure. This requirement also includes the backup servers. If all your
servers go down, you need a way to get up and running, fast. Any software that requires you to scan tapes
or media one at a time and rebuild the index before you can even start a restore should not be considered.
Look for something that will let you create a "backup server restoration tape".
- You also want to look for a backup software that will let you restore, at the very least, the backup server
using standard tools. If you need to install the company's product to get the backup server up and running again,
you don't want that product -- a disaster requires you to get as far as you can on as limited a set of resources as
possible. Reinstalling an OS and putting the backup software vendor's software on it to begin restoration
of the backup server is not an option. Remember, you'll probably end up doing this at 3:00 AM in the morning when
you've had less than an hour of sleep pulling fried cables and smoking servers out of racks after some idiot decided
to hit the power transformer with a backhoe. Simplicity is the key.
- After all that's taken care of, look at the software, and if possible, give it a test run. You want to
have backup software that will let you selectively back up, and more importantly, selectively exclude from backups,
various data. It should not be limited to a filesystem granularity. There are some data companies are required
to back up legally, and others that a company MUST destroy within a certain timeframe. You don't want to throw away
critical data because the tapes just happen to contain stuff that needs to be deleted, right?
- Now, you're ready to give it the ultimate test -- Try restoring the backup server from scratch after formatting
the drive. Document the steps you had to take. Try it again. Really. Twice. Then ask yourself, "In an
emergency, will this procedure be simple enough to do in a limited amount of time?" If the answer isn't a clear
"Yes", don't buy that product.
- You can now start looking at cost, time investments, and other issues. Have fun!
|