Saturday, April 13, 2019

Finding Candidate Open Source Distributed File Systems

In my ever-present quest to keep myself entertained at work I've decided that I should mess around with distributed file systems for a bit. The "comparison" page on Wikipedia lists a lot of systems, but its hard to tell which of them are truly viable options for production use and which of them are research projects. This post records a little bit of research in the hopes that it'll be useful to posterity.

Eliminating all systems marked "proprietary" leaves us with:

  • Alluxio
  • BeeGFS
  • Ceph
  • GlusterFS
  • MooseFS
  • Quantcast File System
  • LizardFS
  • Lustre
  • OpenAFS
  • OpenIO
  • SeaweedFS
  • Tahoe-LAFS
  • HDFS
  • XreemFS
  • Ori

If I'm going to be running something in production I'd prefer that it be actively maintained. Projects which don't appear to be under active development anymore:

  • Quantcast File System: Last release was in 2015.
  • XtreemFS: Last release was in 2015.
Interesting that this only knocked out two players; looks like most of the file systems listed on the Wikipedia page are still under active developement.

How about robustness of development? Which of these are someone's thesis project and which of them have a large development community? Presented in descending order of number of contributors:

  1. Alluxio: 943 contributors
  2. Ceph: 737 contributors
  3. GlusterFS: 191 contributors
  4. SeaweedFS: 59 contributors
  5. LizardFS: 35 contributors, also a small-ish number.
  6. OpenIO: 23 contributors
  7. MooseFS: 9 contributors
  8. BeeGFS: Their source code repository doesn't appear to have any data on number of contributors.
  9. Lustre: Not at all obvious from their repository.
  10. OpenAFS: Not obvious from their repository.
  11. Tahoe-LAFS: Not obvious from their repository
  12. HDFS: Not obvious from their repository
Interesting that Alluxio, a system I don't think I've ever heard of before, has the largest number of contributors. Also, some of the systems that don't have data readily available (Lustre and HDFS in particular) are known contenders. So number of contributors should really only be interpreted as a soft marker of popularity and/or robustness, useful for prioritizing investigations but not necessarily qualifying out.

What flavors of storage do they support? I'm mostly interested in a good GFS, with object store a nice-to-have and block devices a distant third.

  • Alluxio: "Alluxio sits between computation and storage in the big-data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface". Not really what I want.
  • Ceph: File system via CephFS, object store, and block storage.
  • Gluster: File system via GlusterFS (and maybe object store via SWIFT?)
  • Moose: File system via MooseFS.
  • LizardFS: File system.
  • Lustre: File system.
  • OpenAFS: File system.
  • OpenIO: Object store (plus non-free/proprietary FUSE connector)
  • SeaweedFS: Object store with optional FS support.
  • Tahoe-LAFS: Cloud storage-ish model. Doesn't looks like it fits the bill.
  • HDFS: Specifically designed for streaming access for large-scale computation; not a general-purpose DFS.

At this point Ceph looks like the front-runner in terms of features and robustness. Next question: Does it run on CentOS 7? According to the Ceph OS Recommendations it does, as long as you don't use btrfs. So Ceph seems like a good place to start dabbling for the time being.

0 Comments:

Post a Comment

<< Home

Blog Information Profile for gg00