Review: The Duplicity Backup System

I needed a way to backup a single server to a remote hard disk. There are many scripts around, and I certainly could have written one myself, but I found Duplicity and now I highly recommend it:

http://duplicity.nongnu.org

Duplicity uses librsync to generate incremental backups that are very small. It generates the backups, GPG encrypts them, and then sends them to another server by all the major methods: scp, ftp, sftp, rsync, etc. You can backup starting at any directory, not just at mountpoints and there is a full language for specifying files you want to exclude.

Installation: The most difficult part is probably setting up your GPG keys if you've never set them up before. (Note: you really, really, need to protect the private key. It is required for restores. If you lose your machine due to a fire, and don't have a copy of the private key somewhere, you won't be able to do a restore. Really. (I burned mine on a few CDs and put them in various hidden places.)

The machine I'm backing up is a virtual machine in a colo. They don't offer backup services, so I had to take care of it myself. The machine runs FreeBSD 8.0-RELEASE-p4 and it works great. The code is very portable: Python, GPG, librsync, etc. Nothing digs into the kernel or raw devices or anything like that.

I wrote a simple script that loops through all the directories that I want backuped, and runs:

duplicity --full-if-older-than 5W --encrypt-key="${PGPKEYID}" $DIRECTORY scp://myarchives@mybackuphost/$BACKUPSET$dir

The "--full-if-older-than 5W" means that it does an incremental backup, but a full back every 35 days. I do 5W instead of 4W because I want to make sure no more than 1 full backup happens every billing cycle. I'm charged for bandwidth and fear that two full dumps in the same month may put me over the limit.

My configuration: I'm scp'ing the files to another machine, which has a cheap USB2.0 1T hard disk. I set it up so that I can ssh from the source machine to the destination machine without need of a password ("PubkeyAuthentication yes"). In the example above "myarchives" is the username that I'm doing the backup to, and "mybackuphost" is the host. Actually I just specify the hostname and use a .ssh/config entry to set the default username to be "myarchives". That way I can specify "mybackuphost" in other shell scripts, etc. SSH aliases FTW!

Restores: Of course, I don't actually care about backups. I only care about restores. When restoring a file, duplicity figures out which full and incremental backups need to be retrieved and decrypted. You just specify the date you want (default "the latest") and it does all the work. I was impressed at how little thinking I needed to do.

After running the system for a few days it was time to do a restore to make sure it all worked.

The restore syntax is a little confusing because the documentation didn't have a lot of examples. In particular, the most common restore situation is not restoring the full backupset, but "I mess up a file, or think I messed it up, so I want to restore an old version (from a particular date) to /tmp to see what it used to look like."

What confused me: 1) you specify the path to the file (or directory) but you don't list the path leading up to the mountpoint (or directory) that was backuped. In hindsight that is obvious but it caught me. What saved me was that when I listed the files, they were displayed without the mountpoint. 2) You have to be very careful to specify where you put the backup set. You specify that on the command line as the source, and you specify the file to be restored in the "--file-to-restore" option. You can't specify the entire thing on the command line and expect duplicity to guess where to split it.

So that I don't have to re-learn the commands at a time when I'm panicing because I just deleted a critical file, I've made notes about how to do a restore. With some changes to protect the innocent, they look like:

Step 1. List all the files that are backuped to the "home/tal" area:

duplicity list-current-files scp://mybackuphost/directoryname/home/tal

To list what they were like on a particular date, add: --restore-time "2002-01-25"

Step 2. Restore a file from that list (not to the original place):

duplicity restore --encrypt-key=XXXXXXXX --file-to-restore=path/you/saw/in/listing scp://mybackuphost/directoryname/home/tal /tmp/restore

Assume the old file was in "/home/tal/path/to/file" and the backup was done on "/home/tal", you need to specify --file-to-restore as "path/to/file", not "/home/tal/path/to/file". You can list a directory to get all files. The /tmp/restore should be a directory that already exists.

To restore the files as of a particular date, add: --restore-time "2002-01-25"

Conclusion: Duplicity is a great piece of engineering. It is very fast, both because they make good use of librsync to make the backups small, but also because they store indexes of what files were backuped so that the entire backup doesn't have to be read just to get a file list. The backup files are small, split across many small files so that not a lot of temp space is required on the source machine. The tools are very easy to use: they do all the machinations about full and incremental sets, so you can focus on what to backup and what to restore.

Caveats: Like any backup system, you should do a "firedrill" now and then and test your restore procedure. I recommend you encapsulate your backup process in a shell script so that you do it the same way every time.

I highly recommend Duplicity.

http://duplicity.nongnu.org

Posted by Tom Limoncelli in Technical Tips

Comments (7)
| Trackbacks (0)
Tweet

7 Comments | Leave a comment

Fabio Muzzi | October 12, 2010 5:17 AM | Reply

Why didn't you use Backuppc? I find it useful for remote backups like this, using rsync over ssh.

Mike Doyle | October 14, 2010 5:55 PM | Reply

Excellent! Duplicity has a great concept, but was vaporware a few years ago. I actually learned Python to revamp development on it. Fortunately someone who already knew the language did that for me while I was still learning.

I'm glad to read this recommend. I'll probably start using it this weekend.

brent.chapman | November 7, 2010 1:53 PM | Reply

Looks like there's a key limitation to Duplicity: it doesn't handle hard links. That isn't a problem for file-oriented restores (where you want a particular file that you accidentally deleted or that got mangled or something), but isn't it a show-stopper for system-level bulk restores? If you want to restore a full system, you need to preserve hard links; many parts of the OS (libraries, for example, and timezone data) depend upon hard links, as do many applications.

Got any suggestions about that?

Tom Limoncelli | November 8, 2010 10:12 AM | Reply

Good point. It is definitely not a "restore from bare metal" kind of system. Luckily, the system I'm using Duplicity doesn't require that. If I did, I might check out Bacula or Amanda to see what they offer.

Tom

Quentin | February 26, 2011 10:49 AM | Reply

Just a (belated) extra vote for duplicity - I've been using it for a variety of backups for a couple of years and it has been excellent.

In particular, it makes sensible use of bandwidth if backing up to Amazon S3.

DKSG | July 8, 2012 3:43 PM | Reply

@brent

The answer you need is here
http://wiki.zimbra.com/wiki/Hard_links

djeliba | January 10, 2013 4:11 PM | Reply

You could backup a directory containing hard links without using much extra space by first creating a hard-link copy of that directory with rsync --link-dest, then scanning for and purging hard links with the script that DKSG linked. The script needs to be modified to look for files with 4 or more links, since all files will have at least 2 links from rsync. Say you want to backup a directory a_dir which contains hard links:

rsync -aH a_dir/ a_dir.tmp --link-dest='../a_dir'
#comment: if the arg of --link-dest is a relative path, it is relative to the destination directory (a_dir.tmp in this case)
hardlinks scan a_dir.tmp a_dir.tmp/.hardlinks
hardlinks purge a_dir.tmp a_dir.tmp/.hardlinks
duplicity a_dir.tmp remote_host_url

This requires more effort client side than is really necessary, but it uses no extra network resources. Unless a large amount of disk is used by a_dir to store the names of files and directories, the copy created by rsync won't use much extra space.

Awesome Conferences

No TrackBacks

7 Comments | Leave a comment

Leave a comment

Best of Blog

Navigation

Recent Entries

Search

Archives

RSS Feed

Credits