Scaling out Web applications is an intense challenge, especially if your system is making the inevitable shift away from a monolithic design towards a distributed design. Databases and asset files are usually the first two points of pain when building out a distributed architecture, especially when using cloud services.
10gen set out to ease both of these pain points with the document database MongoDB, and this article will specifically look at how MongoDB's GridFS file storage system can be used to solve problems with distributed asset storage.
On a traditional VPS or dedicated server, uploaded asset files from a Web app have usually been stored on the filesystem. Initially this is a very simple approach -- writing to a directory structure is straightforward, and Nginx can read from the file system with blazing speed (6600 requests per second). Asset file backups can be set up using tools like Rsync, and other third party tools can provide failover support for a high availability environment.
This configuration becomes much more complicated, however, as the app is scaled up. If the deployment environment has multiple servers that all need to access assets on their filesystems, it becomes necessary to use a file server like NFS to mount the asset directories to each box. Using NFS can significantly degrade the read/write speed of the asset files, and the app can bottleneck from the ensuing I/O wait times.
The second complication arrives as your asset collection grows from gigabytes to terabytes in size. As the asset files start filling up hard drives, accessing that data becomes slower and slower. Eventually it becomes necessary to shard the asset collection over many servers which subsequently adds a new level of complexity to the application.
Finally, there's the growing possibility that you will be deploying to a cloud platform like Heroku that limits you to a READ-ONLY filesystem. This entirely removes the possibility of writing files to disk and mandates an alternative storage solution.
MongoDB provides you with the GridFS interface for storing files alongside your other data. It achieves this by breaking up all files into 256KB chunks that are stored as a collection of documents just like any other data in MongoDB. This provides some distinct benefits:
Additionally, GridFS benefits from all the features that make MongoDB such an attractive solution as a database:
A final benefit that I have enjoyed within my development environment is easy data/asset portability between developers on a team. If I have a new developer come onto the team and I want to get her up and running as quickly as possible, I will usually dump MySQL and zip up all the assets to send to them. This is effective, but it's a hassle to build and takes a series of instructions for the new dev to make sure all the files end up in the right directories.
Using MongoDB and GridFS, on the other hand, makes this process incredibly simple. One command with the mongodump utility packs up all my data AND all my GridFS asset files into a single dump file that can be restored by a one-liner with the mongorestore utility. Or even better, our two MongoDB instances can connect over the network to sync up without having to use any dump files at all.
There are several methods you can use for accessing and serving files stored in GridFS. Here's a great article that explores and benchmarks these methods and more.
gridfs-fuse is a great tool that lets you to mount your files in GridFS on your filesystem, allowing applications like Nginx to transparently read them "from disk". Though not close to Nginx's regular filesystem read speed of 6600 req/sec, Nginx + gridfs-fuse still comes in around a very adequate 1000 req/sec.
Perhaps the trickiest part of this solution is getting gridfs-fuse installed and running successfully. Another consideration, however, is that you must have clear logic for mapping URLs directly to GridFS ids. Since GridFS doesn't actually support directory structures, you may have to be clever with Nginx rewrites to build the kind of vanity URLs you intend on using for your assets.
All of the MongoDB drivers should support getting file data into and out of GridFS. At the application level, you have the ability to read in GridFS data, manipulate it and then send it down to the client. The upside is that this allows you to do cool things like generating dynamic image thumbnails on the fly from a full-size image fetched from GridFS.
The downside is that you take the speed and resource hits from handling this at the application level. These downsides can be offset significantly by leveraging caching techniques. Using a caching tool like Varnish will allow you to cache these slow, intensive operations and keep your load under control.
Amazon S3 is obviously a great tool, and it offers very similar benefits to using GridFS. Using S3, you get automatic scalability, backups and failover. Your files are highly available and can be accessed by your application or served directly to the end user. The price is usually very cost effective, and you don't have to worry about the infrastructure.
One downside to using S3 is that you have to keep track of your remote files somehow; usually this means recording metadata about your files in your database. This extra bit of complexity is unnecessary with GridFS, since your data is stored in your database to begin with. Otherwise, it's a great service to use.
So should you choose GridFS over S3? Ultimately it comes down to your project's requirements. In some circumstances one may be more cost effective than the other, depending on what hardware resources you are already using. If you are already using MongoDB as a database, you can start leveraging GridFS without additional moving parts. While I personally enjoy the integrated nature of having all my data in one place, there is also something to be said for not having to worry about your own infrastructure. Ultimately, both are great solutions to the growing problem of scaling out asset file management in modern Web apps.