In pursuit of the other kind of Big Data(tm)

[I suppose I should mention by way of disclaimer that I work for a really big company. If you can think of a data warehousing, analytics, business intelligence, or even any technology product that exists, my employer probably uses it somewhere, or has, or will. Others in the org have written and/or presented about our technology choices in the not so distant past. So no big secrets being revealed here. I’d also note that I’m writing this as a sysadmin who had to solve a problem, not as a representative or spokesman for my company. I’m the former, not the latter.]

Almost ten years ago I got into what would eventually be branded Big Data. Back then it was Business Intelligence, or in my environment, just “batch”… I haven’t had a job since 2003 that didn’t involve BI/analytics/big data, and the last three have had a big focus on things around Hadoop and other intertwined modern data warehousing tools.

But there’s another kind of big data that predates mapreduce. And it’s bitten me three times in the last couple of months, while setting up a secondary site for an analytics unit.

Your mission, should you choose to accept it…

The task involved moving a couple hundred terabytes of data from different environments and applications (primarily Hadoop, MySQL, and Vertica, but some archival logs and such as well). Moving them across the country without a dedicated connection, mind you.

One of the early suggestions was something like this:

Just dump 300TB of data onto a USB drive and ship it. Sounds like a great idea, but a few problems arose.

1) They don’t really make 300Tb USB drives. One provider has some 20TB USB arrays. We could have done Thunderbolt or USB3 with add-on cards, but we’d still be limited by the fact that…

2) It takes a while to load data off a server’s disks onto some other sort of disk. I usually base my rough math on 20-40MByte/sec to pull data off a disk array. Add this to the time it takes to unload it on the other end, and you realize that “we can send it by Fedex overnight” isn’t exactly the speedy proposition it sounds like.

3) There were some policies that would have complicated the ship-overnight part. But those were the simplest to deal with, and not technological, so not really in my scope.

Treating the data in 50TB blocks, we had an estimate of 20 days to load a block onto external storage, whether it was USB external or NFS external or local attach fibre channel. We could have sped it up a bit, at the cost of the production infrastructure’s performance. We had to leave *some* of the disk throughput for the users/BI processes.

Somewhere out there…

So I set about plotting a path across the Internet

For the Vertica platform, we had easily-made backups that ended up being a huge pile of files on a couple of big servers. In this form, they were very easily separated into buckets, and transferred with rsync+ssh in parallel. Something like 30 streams at a time, peaking at 250MBytes/second = 900GBytes/hr. So roughly three days to transfer 50TB over the dirty Internet. Not too bad.

This method is reproducible, so that incremental backups can be transferred with the same structure. I used a few client systems and three source servers, so I limited the impact on either end from the transfer. It did wake up the network guys though, as I guess they weren’t used to seeing 2gbit coming out of a non-production network to the public Internet.

If you’re doing this in your world, some recommendations that may help.

1) Use the newest rsync available to you. The difference between 2.4.6 and 2.5.4 10 years ago was night and day when I used to have to move postgres databases around… and getting into a 3.x version will give you newer options and better performance.

1a) Make sure your openssh packages are current as well. If you have to do this with huge data or frequent transfers, build your own packages.

2) At least the first time, consider [ –wholefile ] (-W) since you will have no data on the other end and it will not try to check for changes.

3) After the first time, if you need to resync incrementals (or lose connections), check your file sizes before using [ –partial ] or -P. The “partial” check is great if you’re dealing with large files, but it does take longer to start up and I’ve seen some cases where it causes the connection to time out and the rsync to fail before it really gets started. If your files take less than a minute or two to transfer, you will lose more than you can gain from this option.

4) Check your ssh cipher. You will probably find “arcfour” to be a reasonably fast option, and depending on your version and compile options for ssh on both ends, you might get something even faster. Change the cipher used by including the rsync option [ -e “ssh -c arcfour” ] or whatever cipher you choose. This change will probably make more difference than whether you use [ –compress ] or “-z”.

5) Consider logging status to a text file so you can track your changes. Using the options [ –stats –progress ] will give you the most detail. The “progress” option shows transfer rate and speed for each file being transferred. The “stats” option gives you details at the end of the transfers.

6) If you’re going to do incrementals, you can run with -n and –stats to see how much needs to be transferred before running the real transfer.

I actually did not use compression, because I had lots of small files and a fast enough theoretical connection (and a rush on the proof of concept which ended up being the production method used). You should try on a subset of your data, comparing with and without. If you have time, also consider trying different ssh ciphers. You can find some examples of other people’s benchmarks (like here and here) but unless you have the same CPUs and connections they do, it’s just a slightly more educated guess.

But wait there’s more… soon

Since this post is getting a bit long, I’ll come back later and talk about the challenges of moving a 10TB single file, and an old version HDFS cluster, in this same adventure.

If you have thoughts on data migration challenges, I read (and often respond to) comments below. Join the discussion!

Some references:

Completely unrelated writeup about Hadoop at Disney, from Hadoop World 2011Slides and Video links from Cloudera (the links on Hadoop World are outdated, apparently)

OpenVPN’s Gigabit Networks Linux page, which talks about MTUs and Ciphers: http://community.openvpn.net/openvpn/wiki/Gigabit_Networks_Linux

An interesting blog entry by Ivan Zahariev on OpenSSH ciphers: http://blog.famzah.net/2010/06/11/openssh-ciphers-performance-benchmark/

1 thought on “In pursuit of the other kind of Big Data(tm)

  1. Pingback: In pursuit of the other kind of Big Data(tm) – Gestalt IT

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.