Intro
News
Humor
Sysadmin
Programming
Books
Screenshots
Firefox Extensions
Kitty
Links!
Employment

The state of backups can be rather depressing at times. I've worked in an environment where backups weren't a bottom-of-the-rung job, it was a critical part of the organization where its very existance relied on reliable, replicated, and timely backups.

Seeing how most other organizations manage backups is not very pleasant. The main reason, from what I can tell, is the lack of understanding in various topics ranging from reliability, managability, to how backups work.

The following tips are designed for sysadmins that have to do backups and are in a resource crunch. For efficiently laid out networks with a proper backup policy and infrastructure, well, I'm sure you don't need this.

There's some interesting things that take very little of your time and resources that you can do to improve the performance and reliability of backups on a network. There's also some things you can do to make a backup system's resources stretch out more when you don't have the budget to expand it.

I'm listing some of the stuff you can do here to help backup administrators dealing with large SANs and many, varied clients across the network, along with things that can be done on the backup server and hardware.

These tips are provided with the hope they'd be useful, but there's no guarantee at all. Just look at the topics covered and how they are implemented, and adapt as needed to your environment.


Reducing time spent waiting on the backup file index

Many applications will keep a central database of data objects and elements so they can be referenced. In the case of many backup software, such as Legato NetWorker, this database is supposed to be locked against multiple updates to the indices of one client, and can have multiple writes going to the database as long as they are for different clients.

When doing backups from multiple streams, such as backing up several filesystems on a client at once, the lock contention on the file indices can severely impact performance.

One way to solve this problem is to ensure writes complete faster. Using most RAID solutions will improve read throughput, but will lower write throughput. In order to minimize the effects of this latency, one can do several things:

  • Back up one stream at a time per client. Have the backups proceed sequentially for each filesystem/object on the client that needs to be backed up. If the backup software will handle parallel writes for multiple clients correctly, this is the ideal solution. Simple, effective, and understandable.
  • Employ hardware RAID. Some RAID controllers will simply show a single large drive to the OS. The RAID is done in the hardware with a large write cache in place so it can accept writes, report success, and rebuild the stripe on its own. Please note, IDE RAID systems such as the Promise controllers do not have true hardware raid.
  • Get faster CPUs. Some software will improve in performance by simply giving it more CPU time. Having multiple CPUs will usually cause a greater bottleneck, as processes on all CPUs may have to wait on the same lock. Instead, speeding up each CPU so it can relinquish its timeslice faster will allow other tasks to complete much more quickly, improving throughput overall.
  • Get more memory. If the RAID stripe can be cached and not evicted from the system memory, then there is no need to re-read the disk and rebuild the stripe in the case of software RAID. Eliminating the hard-drive I/O wait will improve the performance noticeably on memory-starved systems.
  • Back up fewer streams at once. Try to balance out the backup tasks as much as possible over as wide a period as possible. This policy will allow one to keep too many jobs from hitting the file index at once, thus reducing the contention for the locks (Hello, Legato NetWorker) that some backup software exhibits. In fact (again, NetWorker) will trigger warnings from performance.se such as a possible attack from SATAN.

Reducing tape load/unload latency.

  • Preemptively eject tapes and load them. Since many backup software will just use whatever tape is in the drives that has free space when it starts a backup, this can interfere with cloning of tapes. In order to replicate tapes and pull them out of the writable tape pool, you want the tapes to fill up. So before a night's backup sequence kicks off, eject all the tapes from the drives, then load all the mostly full tapes into the drives. Keep a couple drives loaded with tapes that aren't near capacity, as you don't want to suffer too long a delay while the backup software rewinds the tapes, ejects them, loads a new tape, and seeks to a free location (this is especially painful on single-reel tapes, such as DLT and LTO Ultrium).
  • If you have to babysit a server through the backups (I've had to do that before, when we didn't have the budget to replace old and under-powered servers), you might as well do the preemptive tape ejection during the backup when necessary. If you notice the backup server is rather idle, kick off a backup group earlier than usual (assuming the backup software will lock the run so two instances of the same schedule can't be run). If you notice that it's overloaded, dequeue a scheduled group, make a note of the group you dequeued, and kick it off later when the backup server is not heavily loaded.

Overall speed/performance improvements

  • Force ALL interfaces to full-duplex on Ethernet if possible. Instead of dealing with the sloooow speeds of a 10/half interface, try to get everything set up with 100/full. Auto-negotiation works fine one some combinations of NICs and switches, and it fails miserably on others. Instead of taking a chance, don't allow the hardware to auto-negotiate if at all possible. For those that can't be forced, keep a simple list that has all the odd-ball configurations around, so when you notice a problem with the backup performance, you can see if network bandwidth is the cause.
  • Diagnose routing issues for all systems. Many large network infrastructures have multiple VLANS with different customer networks behind it, with even smaller segregation within those networks. You want to make sure the backups aren't going over interfaces and routes they're not supposed to. For once thing, that slows down the network. For another, if the backup software is not running over an encrypted link, other customers can sniff the traffic and see confidential data that they shouldn't have access to.
  • Consider plotting the backup trends for the various systems so you can determine how much data is backed up during a full and incremental backup. Also look at the trend over time to see if the system is gaining data, losing, or staying relatively constant. Based on that, try to balance out the amount of data you back up in a single day (so tape replication/cloning will complete), and schedule the clients by backup groups marked by time, not server class. Thus, you'd have a 02:00 group, instead of a "mail servers" group. Unfortunately, this will require you manually spread out the full and incremental backups across the days of the week, unless you want to spend the time writing a piece of software that will automatically do this for you. Considering it's a one-time cost, after that the maintenance can be done manually with very little added overhead if you do it on a regular schedule, it's not really worth the effort.

Picking the right backup software

  • This is the part where many admins make a mistake. Don't look at the feature list or the supported systems or the cost or the ease of use right away. First, make a list of the requirements: the systems you need supported, the budget you have for both hardware and software, the amount of time you can afford to invest in the backup infrastructure (don't skimp on this!), and the security requirements of your network.
  • The last point is especially important. For systems where the internal network is not trusted (such as financial institutions where they can't afford to make assumptions), you can't have a backup software that lets anyone talk to the backup server or clients and kick off tasks.
  • Once you have that, make a list of product that support your requirements. Then before you even look at the extra features the products offer, see if you can find where their support forums are. Also look for posts regarding that software on various archives like Google and Google News. You want to keep an eye out for reports of corrupted data, corrupted backup server indices, and crashes. Keep track of the versions they're reported in, and verify the same bugs don't keep recurring. If they do, avoid that product like the plague.
  • Many backup products brag about the features they have in terms of backing up systems. How many of them brag about restoration? Who cares about backups? It's the successful restoration that counts. Make sure any product you look at has a way to get a system up and running if it needs to be reimaged from scratch after a complete system failure. This requirement also includes the backup servers. If all your servers go down, you need a way to get up and running, fast. Any software that requires you to scan tapes or media one at a time and rebuild the index before you can even start a restore should not be considered. Look for something that will let you create a "backup server restoration tape".
  • You also want to look for a backup software that will let you restore, at the very least, the backup server using standard tools. If you need to install the company's product to get the backup server up and running again, you don't want that product -- a disaster requires you to get as far as you can on as limited a set of resources as possible. Reinstalling an OS and putting the backup software vendor's software on it to begin restoration of the backup server is not an option. Remember, you'll probably end up doing this at 3:00 AM in the morning when you've had less than an hour of sleep pulling fried cables and smoking servers out of racks after some idiot decided to hit the power transformer with a backhoe. Simplicity is the key.
  • After all that's taken care of, look at the software, and if possible, give it a test run. You want to have backup software that will let you selectively back up, and more importantly, selectively exclude from backups, various data. It should not be limited to a filesystem granularity. There are some data companies are required to back up legally, and others that a company MUST destroy within a certain timeframe. You don't want to throw away critical data because the tapes just happen to contain stuff that needs to be deleted, right?
  • Now, you're ready to give it the ultimate test -- Try restoring the backup server from scratch after formatting the drive. Document the steps you had to take. Try it again. Really. Twice. Then ask yourself, "In an emergency, will this procedure be simple enough to do in a limited amount of time?" If the answer isn't a clear "Yes", don't buy that product.
  • You can now start looking at cost, time investments, and other issues. Have fun!