Sunday, May 13, 2012

Rsync cyrus volumes from EMC to NetAPP

A couple of weeks ago I was faced with the following problem: clean up the EMC filer that hosts one of our client`s cyrus e-mail mailboxes.Putting it all on paper I got to the following:
- there were more than 3 million files and folders
- the only available protocol is NFS
- the bandwidth is 1 Gbit
- source filer is EMC, destination is NetApp

The chosen option was a copy-paste operation from one filer to another with service interruption to avoid synchronization problems. One question was raised: how long it will take to do it and what tool to use? 

After some parking lot meetings and discussions, the shortlist for tools included cp, rsync and cpio. First thing first, speed tests. I have generated a workload of 50000 files having the following composition: 40% @ 1 KB, 20% @ 2 KB, 20% @ 3 KB and 20% @ 10 KB using for and dd:

### first 20000 files 1K, then increase seq and bs 
for i in $(seq 1 1 20000); do dd if=/dev/zero of=/tmp/FILER/vol99/$i count=1 bs=1024 2>/dev/null; done


Next, the tests, using time command:
### cp
time  cp -a vol99/ /tmp/DEST/vol99

### rsync
time rsync -a vol99/ /tmp/DEST/vol99/ 

### cpio 
time find . -print -depth | cpio -padmuv /tmp/DEST/vol99/

cp -a does the same thing as cp -dpR - preserves links, permissions and is recursive. rsync -a does the same as rsync -rlptgoD and I was mainly interested to ensure recursive transfer, preservation of permissions, modification times, groups and owners. cpio -padmu ensures creation of directories where needed, preservation of mtime.

The fastest was cp and it was considered the baseline. rsync was 7% slower, while cpio was 46% slower (because of the big find piped to it). Since the obvious choice was cp, it was decided to use rsync (mostly because of the possibility to continue interrupted transfers and to do check sums - which was never used). The speed tests also allowed to approximate transfer times of the real data at around 100 minutes. This way I could go ahead with the formalities on the client side.


When rsync was tested against real data (around 400K files and directories), it generated a bit of a surprise. It started to build up its file list, it took more than 15 minutes to do it and it started a very slow transfer process - after 1 hour it did not finish one volume. And we had 5.  The main cause was the partition structure used by cyrus imap for mail store and the big number of files. To get an idea: vol1 has 20 imap partitions: cyrus1 to cyrus20. Each cyrusN partition has the the mailbox structure from A to Z. The first rsync test was:
nohup rsync -a /SRC/vol1/ /DEST/vol1/ &
and this got stuck into creating the file list and transferring very slow. The idea was to have fast transfer, so for the second test rsync was run in parallel for each cyrus partition from 1 to 20:
nohup rsync -a /SRC/vol1/cyrus1/ /DEST/vol1/cyrus1/ &
nohup rsync -a /SRC/vol1/cyrus2/ /DEST/vol1/cyrus2/ &
...
nohup rsync -a /SRC/vol1/cyrus20/ /DEST/vol1/cyrus20/ &

This time it worked just as expected. Actually, during the operation we managed to increase the CPU on a NetApp 6280 by 10%-15% using parallel rsync.

One more thing to be said: rsync syntax is a bit tricky. If the slash at the end of source directory is omitted, then it will create the directory itself in the destination and you will end up with a shifted folder structure: /DEST/vol99/cyrus20/cyrus20/.

No comments: