Backing up servers to S3Sat 19 February 2011

As part of our build automation here at Riot, we've been trying to find solid options to backup our servers (configs, logs, data etc.) to an off-site location. Our provider does daily backups of our servers and restores data on demand, which is certainly nice, but left us wanting more fine grained control of the process. Cost, simplicity and security were our top concerns, and our search led us to start using duplicity combined with Amazon's S3. Here's how we use it.

Setup

You will need to have librsync installed on your system as well. In ubuntu:

apt-get install librsync-dev

Since duplicity is a python app, we chose to install it in a virtualenv. It's pip installable, but is not in pypi, so you will have to point pip at the tarball.

virtualenv duplicity
cd duplicity
source bin/activate
pip install -E . http://code.launchpad.net/duplicity/0.6-series/0.6.11/+download/duplicity-0.6.11.tar.gz boto

or in ubuntu:

apt-get install duplicity

If you want to encrypt your backups you will need to generate a GnuPG key, like so:

gpg --gen-key

You can accept the default options during install, make sure you add in a passphrase to the key, as duplicity will not work without it.

Backup

S3 is just one of the many backends duplicity supports. Their docs have more info.

Here's our backup script:

export AWS_ACCESS_KEY_ID='xxxxxx'
export AWS_SECRET_ACCESS_KEY='xxxxxx'
export PASSPHRASE='xxxxxx'
export NOW=`date +"%Y-%m-%d-%H-%M"`

duplicity --exclude ".*" --include "**" --full-if-older-than 30D \
          --log-file /var/log/duplicity/s3-$NOW.log --verbosity 6 \
          --s3-use-rrs --s3-use-new-style --asynchronous-upload \
          /var/www/backups s3+http://riot.xxxx.xxxx

export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export PASSPHRASE=
export NOW=

Restore

Restoring is a snap too. Though we haven't had the need to restore yet, this is how you would:

# Restore a file
duplicity --file-to-restore var/www/backups/code.tar s3+http://riot.xxxx.xxxx ~/tmp/restore

# Restore a directory
duplicity --file-to-restore var/www/backups/db s3+http://riot.xxxx.xxxx ~/tmp/restore

# Restore everything from a point in time
duplicity -t 2011-02-19T12:20:45 s3+http://riot.xxxx.xxxx ~/tmp/restore

The backup script runs hourly and does incremental backups to our S3 bucket.

Timing Code Execution in PythonSun 30 January 2011

I wrote an implementation of the Levenshtein algorithm in python a few days back, and today while noodling around, I came across another implementation of the same algorithm, written by Magnus Hetland the author of Python Algorithms and wanted to see which was the "faster" implementation.

So, enter the timeit() module in python. Here's what I did:

>>> def levenshtein(a,b):
...     "Magnus's Code"
...
...     [ Code here ]
...
>>> def leven(a,b):
...     "Rohit's Code"
...
...     [ Code here ]
...
>>> import timeit
>>> t1 = timeit.Timer(setup='from __main__ import levenshtein', stmt='levenshtein("plumber","causes")').timeit()
>>> t1
50.655728101730347
>>> t2 = timeit.Timer(setup='from __main__ import leven', stmt='leven("plumber","causes")').timeit()
>>> t2
68.573153972625732

Seems like Magnus has me beat :(.

One point to note here is that timeit() temporarily turns off garbage collection, so if your code requires it you will need to add it in.

>>> import gc
>>> setup = """\
... from __main__ import levenshtein
... gc.enable()
... """
>>> t2 = timeit.Timer(setup=setup, stmt='levenshtein("plumber","causes")').timeit()

There is also quite a nice collection of python performance tips here.