I have received a lot of responses for my article “What I’m Telling Business People About Why Relational Databases Are So Bad”. A lot of people have commented on that article saying that if relational databases are so bad what are the alternatives. I totally agree that this is a valid question.
In this article I’m going to show a software pattern that for most enterprise applications is far superior to using a database at all. I will describe the pattern using Java like code.
Basically this pattern simply says put all the data in memory and handle the persistence as just a way of rebuilding the memory state in the case of restarting the applications. The data of even seemingly large applications is small enough to fit in the memory of modern computers. When you have 256GB memory on a computer it is hard to find applications whose data cannot fit in this. Certainly those kinds of applications exists, but they are outliers. So you can basically put all the data in memory and persist serialized objects to the disk in a long list adding new ones to the end each time. If you have to reboot the application you retrieve all of that persisted data and re-build the memory structures. Simple but it works like a charm.
This article explores a simple and ultra reliable way of doing this. But first we should look at databases and see what they are doing for us.
The two major services provided by databases are persistence and data access.
Persistence is the ability to keep data even when the power goes off. As I mentioned in my article on why databases will eventually be a forgotten technology (see Databases The Future Buggy Whips of Software) not having a high-speed persistent memory had bedeviled computing since its very beginnings. So one thing that people use databases for is to guarantee that the data will survive a power interruption (persistence).
Data Access is the ability to add and extract data from your data set. In relational databases data access is normally done by executing an SQL (Structured Query Language) statement.
The whole database concept, including relational databases, was born in the 1970s and 1980s. In those days high speed main memory was incredibly expensive — an IBM 360 would ship with 4MB and cost over a million dollars. That’s 1/1,000 of the memory that a really cheap back-to-school special laptop would ship with today.
This meant that any remotely significant amount of data had to be stored on the disk which was a magnetic technology and hence persistent. With the data on the disk when, for example, you wanted to gather a list of employees that satisfied a particular criteria you had to read a series of disk records to extract employee data and evaluate the criteria.
Doing data access using the disk produced some really complicated algorithms. When you are dealing with disk access times, especially slow ones like in the 1980s, you do everything you can to figure out exactly what disk sectors you want to read because you don’t want to read any unnecessary ones and whenever possible you try and read contiguous groups of sectors, preferably a whole track, because it is faster. This is very difficult given that a database table is spread over disk sectors in a complex pattern dictated by the file system software. One approach database companies took was too take over a large hunk of the disk and run their own sector management system inside it. That meant they were essentially writing a file system that was trying to access data with the minimum number of disk accesses. That is an immensely complicated problem and the solutions were immensely complicated programs. Of course, immensely complicated invariably means slow and buggy.
Another consequence of doing the data access using the disk records is that to get any reasonable performance you need a large amount of indices. This causes the size of relational database on disk to swell to enormous sizes. It is not uncommon to see relational databases whose disk size is 20 times the table sizes. This means that the disk image of a relational database is often 95% overhead.
However this is the 21st century and we have memory sizes now that are large compared to most enterprise problem sizes. This makes a tremendous difference to how persistence and data access should be handled.
I’m going to show how you can do this providing a software pattern that separates persistence from data access. I‘m not sure what the best way to describe this pattern is, but I’ve settled on showing how to implement an example application. This will allow me to discuss various aspects of the pattern and indicate how different applications could change the architecture. All the code is in Java, but it is more of a description than actual code. Some of the method calls are just names to indicate function rather than being exactly correct.
Data Access
I’ve decided to implement a programmer’s forum like Stack Overflow. This is a massive question and answer site that is well known to all programmers. We are interested in the data storage part i.e. the part that would be normally handled by a database. In fact, we know that Stack Overflow is implemented using SQL Server as Jeff Atwood has discussed this in his coding horror blog. I’m not trying to suggest here that the approach I’m suggesting here makes implementing a real life site like Stack Overflow easy but it could make it easier. That said, let us never forget that in software the devil is in the detail, and real life applications might be easy to describe in some architectural overview but they are always immensely difficult to bring to a usable and useful state.
The normal approach to this kind of web forum is to have threads started for each connection request. The thread is given the input and output streams of the socket. A request comes in as GET data or POST data and it is up to the code running in the thread to decode all of this and figure out what the user wants to do. So our assumption here is that the request handler threads have figured out the operation required and have constructed a ForumOperation object.
public abstract class ForumOperation extends Persistable { public static ForumObject factory(int type, byte[] buffer) { switch (type) ... case typeA: TypeAForumObject ta = new TypeAForumObject(); ta.deserialize(buffer); return ta; ...
} public abstract String perform(DataAccessService das); }
public abstract class Persistable { public abstract byte[] serialize(); public abstract deserialize(byte[] buffer);
}
I have stayed away from the Java Serializable interface and Java serialization as it has issues and its future is in doubt(see article). So I have called this concept Persistable for this example. For the main part I prefer to do explicit serialization for each object as it provides the most compact encoding. I have also used byte[] as the encoding and decoding medium rather input and output streams. This is just because I think this makes it clearer that the result is really a bunch of bytes that encode the object so that it can be later decoded. Of course, actual code would use streams and use a ByteArrayOutputStream when converting to a byte buffer.
The factory method allows us to reconstitute an object after reading the byte buffer from a persistent store. I’m assuming here the type is an int, but clearly it’s anything that can identify the operation uniquely. So when operations are persisted we write the type down first and then follow by the length of the serialized buffer and then the buffer. This allows us to read the type and buffer length, and then read the correct amount of bytes into a buffer. Using the factory method with the type and buffer gives us the object back.
Finally notice that each operation must implement a perform method that takes the DataAccessService as a parameter and returns HTML to be incorporated into the request output stream . For example the most common operation is ReadForumOperation which just reads a single post and displays it. Because the request was the result of clicking a link on an HTML page, it already has the unique post id in the read operation object. So the perform method has to use the data access service to locate that post object and get it to return an HTML representation of itself.
A forum post object looks something like:
public class ForumPost { private ArrayList<ForumPost> replies; private ArrayList<String> keywords; private LocalDateTime utcTimeStamp; private long uniqueID; private User author; private ForumPost postRepliedTo; // null if top level String htmlText; public String getDisplayHTML(){...} public String getSummaryHTML(){...} }
Our strategy here is to put as much work into the request handler thread as possible and make the perform method as fast as possible. However these operations are really quick. A post request will already have a ForumPost object constructed and all the perform method has to do is get the data access service to add it to the various lists e.g. the posts for a user, the keyword lists, the list of replies on another post etc.
So we can regard the data access service as a consumer of a series of operations. These operations update the data model (in this example basically a bunch of lists). As we shall see the advantage of this is that the state of the data model can always be rebuilt by replaying all of the operations in the order they were originally encountered.
public class DataAccessService { private HashMap<String,ArrayList<ForumPost>> keywordLists; private ArrayList<User> users; // methods that are used to process operations public ForumPost getPost(int postId); public String addPost(Post post) public int getNumberPostsForUser(User user) public ArrayList<ForumPost> getUserPosts(int start, int number) public int getNumberPostsForKeyword(String keyword) public ArrayList<ForumPost> getPostsForKeyword(String keyword, int start, int number):
... any other methods that make handling requests easier ...
}
We can see that the interface to the data access service is basically what you need to run this application. As requests come in the request handler threads have to be able to display sections of lists (as a user moves down a list a page at a time) and to format individual posts with all of the replies trailing after. The methods in the DataAccessService make accessing the data for these reasons simple.
Most applications like this have a similar internal interface, even if they are using a relational database. In that case when you want a section of the list of all the posts for a keyword there is going to be an SQL statement constructed. Executing that SQL statement is going to return a result set. From that result set you can construct a list of ForumPost objects. Think how much easier it is, and quicker, to just go to the memory and get the list from a HashMap.
If this interface already exists in an application then it is not hard to change from a relational database to a simple memory based system. If you haven’t encapsulated your SQL queries and have them smeared throughout your code base — well what can I say.
One thing we should consider is the ACID criteria. This is an acronym for Atomic, Consistent, Independent and Durable. These need to be satisfied if one is to do reliable information processing. The Durable criteria is just another name for Persistence (I guess ACIP didn’t make the cut). The ACI criteria concern data access and making sure that multiple users don’t step on each other’s toes. This article is describing a pattern about separating access and persistence. So the D part of ACID (persistence) will be handled separately below.
Because the lists are in memory the operations are very fast. They should execute in a few microseconds at most. So this means that we can follow the KISS principle and adopt a really simple synchronization strategy. We add the following to the DataAccessService class:
private PersistentStore persistentStore; private ReentrantLock dataAccessLock = new ReentrantLock(true); public performOperation(ForumOperation operation) { dataAccessLock.lock(); try { operation.process(this); } finally { dataAccessLock.unlock(); } if (operation.isPersistent()) { persistentStore.write(operation); } }
You should never use a more complicated synchronization strategy than necessary. They are very tricky to debug and can be surprising in the ways that conflicts can occur. A scheme like the above is dead simple. The operation on the access service is done in a single-threaded manner. Only one person in the room at a time. This won’t work if the operations are lengthy as it stalls every waiting operation. But in this case the operations are very lightweight so you can get away with it.
The beauty of a brain-dead simple synchronization strategy cannot be overemphasized. It lets you sleep nights.
So let us examine the ACI criteria and see how this checks out.
Atomicity — either all in or all out
This means that an operation cannot be left half done. We can see that in our application this is true. It completes before relinquishing the lock. There is also an implication here for persistence. The persistent store cannot be left with partial updates. We shall see below that this issue is handled completely.
Consistency — leave the place as neat and clean as you found it
This means that an operation cannot leave the data in an inconsistent state — e.g. a post that doesn’t use a keyword shouldn’t be on that keyword list. As long as all our perform methods are correctly written we should have not problem with this.
Independence — you are going to get the same result regardless of other user’s processing
This means that once you start you are going to get a determined result even if the data state is being changed by others. Seeing that we have adopted a dead simple synchronization strategy this isn’t an issue. Once a thread has seized the lock the state of the data access memory model is fixed and any changes are done by this thread. This means that a post suddenly can’t be deleted while another operation is trying to get a list that includes it.
Now if you have an application that has very time expensive operations then this simple strategy won’t work. However, it is easy to get an integer state key that can be used to get a consistent snapshot of the model despite other threads modifying it. This is not difficult to implement and I leave it as an exercise for the reader.
Building our data access model in this way means that all the posts are in memory. Now Stack Overflow at present has around 50,000,000 posts. So lets double that and say 100 million posts. If you look at a typical stack overflow page you see that posts are divided largely into long posts and some very short comments. I would say that the average length is probably 500 characters but let’s say 1K a post. This will allow space for all the list pointers that are going to get created. Each one will be 8 bytes ( in today’s 64 bit world).
So that will take 100 GB of memory. Which is reasonable in today’s hardware environment. You can easily get servers with 256GB and up for a few thousand dollars. I recently saw a laptop advertised with 128GB of memory.
Notice in our example none of the arrays are large. The 100 million posts are spread around four million users. The keyword lists are probably the largest. In the actual Stack Overflow one of the most popular keyword ‘javascript’ has about 1.6 million posts tagged with it. So that is probably the longest list in the data model. There is nothing untoward here that should cause processing headaches. (Remember in Java array indexing is done with a positive 32 bit int, so that limits the size to 2.1 billion items. If you have more you will have to divide your lists into chunks).
Persistence
So let us look at the persistence part of the pattern that is separate. You can see from the DataAccessService code that we can persist operations one by one after they have been used to change the memory state of the data access service. In a real application this should be just putting the operation into a queue so a background service can do the actual persistence. This stops the whole processing loop for waiting for the write call to finish. However in simpler applications this is probably not a factor as disk writes are so fast these days.
We have a persistence store that is really just a random access file. We write to the file by appending to the end a new record. There is no concept of delete. That can be done by voiding — which is writing an instruction to delete the post into the store.
public class PersistentStore { private RandomAccessFile raf; private long nextPosition; private static long currentRecordLocation;
public PersistenceStore(String path) { raf = new RandomAccessFile(path); if (raf.length() < 8) { nextPosition = 8; raf.writeLong(8); } else { nextPosition = raf.readLong(); raf.seek(nextPosition); } } }
public write(int type, byte[] serializedOperation) { raf.writeInt(type); raf.writeInt(serializedObject.length);
raf.write(serializedOperation); nextPosition += serializedOperation.length; raf.seek(0)l raf.writeLong(nextPosition); raf.seek(nextPosition);
} public void prepareForReading() { raf.seek(0)l nextPostion = raf.readLong(); currentLocation = 8L; }
public ForumOperation readNext() { if (currentLocation == nextPosition) { return null; } int type = raf.readInt(); int numberBytes = raf.readInt(); currentLocation += numberBytes;
byte[] buffer = raf.readBuffer(numberBytes); return ForumOperation.factory(type, buffer);
}
The way it works is that Serializable objects are stored by writing them to a random access file. The first 8 bytes to that file is a long integer that points to the end of data i.e. the spot to write the next Serializable object.
That works like this. You convert the Persistable object to a byte buffer and then write those bytes into the random access file starting at the position in the first 8 bytes. You then add the number of bytes in that object to the starting position and write that down as the first 8 bytes of the file.
Notice here that that write is atomic. That 8 bytes either makes the file on disk or it doesn’t. In either case the store is perfectly valid. It either points to the end of the new Persistable in which case it includes it, or it points to the beginning in which case the new Persistable didn’t make it. This guarantees the atomicity of the persistent store.
I’ve been using this pattern for over 10 years. Despite all the nasty things I’ve done to it during debugging and copying it when the application is still writing, the data store has never corrupted.
So that gives us a persistent store. It is write only and objects are written in a serial fashion.
Now notice that there is no data access being done, in fact the only read of the store is done by the next() method. The only time the store is read is when the memory data model is being restored. This just requires an iteration through the Serializable objects. This is done by calling next() until a null is returned.
Now what about rebooting. That is simply done by
ForumOperation operation; while (true) { persistentStore.prepareForReading();
ForumOperation operation = persistentStore.next(); if (operation == null) { break; }
dataAccessService.processOperation(operation); }
That is kind of cool, because the store is just a list of operations that when played into the DataAccessService restores the state to the current point and then we are ready to start processing new requests.
There are a few details here. Each post is assigned a unique ID which will be done when the lock has been seized to guarantee uniqueness and being sequential. When the data is being restored from the persistent store the number is already there so the operation is slightly different on a restore from when it is being performed for the first time. These and date time issues are minor and can be easily handled.
Now typically data on the disk is about 1/8 the size of data in memory. The increase of size when stored in memory I call the ‘puff’ factor. This is because data on disk is usually compressed to some extent and doesn’t have all the pointers that link things up in memory. However that doesn’t apply to data sets with large amounts of text. In a persistent store like the one for our Stack Overflow example there won’t be a lot of difference between the size in memory and the size on disk.
Separating the persistence from the access has huge advantages that pay off all over the place. For example:
Encryption Is Much Easier
Relational databases have massive problems with encryption. Think of searching the database for a particular row in a table. If that record is encrypted how do you search for it?, you have to decrypt it every time. How do you have indices to find it? It is an absolute algorithmic nightmare.
On the other hand when persistence is separated out it is trivial to add encryption.
Encryption is just a simple interface
public class EncryptionService { public EncryptionService(String key) {...} public byte[] decrypt(byte[] encryptedData)[..] public byte[] encrypt(byte[] clearData)[...] }
I always wrapper encryption libraries in this kind of simple interface. They have all kinds of methods to do specialized things, but I just want to pass byte buffers in and out.
So all we have to do to persistence pattern is change the persistenceStore is just make a one line change to the write and a one line change to the read to encrypt and decrypt. It’s that easy. This means that all the data on the disk is encrypted and it is only decrypted when the application is being rebooted.
So if a bad actor gains access to the server the data is encrypted and can’t be read. The only way is to get the encryption key.
Now if the encryption key is stored on a file on the disk then you haven’t accomplished much. But you can specify the key from the command line when the application is started or use a command to communicate it to the application when it is running. This means that the key will not exist on the persistent store but will be only in memory. This means that it is not trivial to read the data. Memory disclosure is much much harder than just reading unencrypted data from the disk and it probably requires a state actor if it can be done at all.
Backup is Much Easier
The persistent store is written as a stream of sequential byte buffers. It is easy to write these buffers to more than one disk file. This makes real time backup trivial.
Also it is provably impossible to copy an invalid version of the store. When you copy the sector with the long pointer to the next available data slot that will define a valid file. So making a backup when the system is firing buffers into the store will still always get a valid store that captures some state in the middle of that stream of writes.
You Can Delete And Redact if Necessary
The store is write only so deletion is done by a delete operation. If you wish in our version of Stack Overflow to delete posts, we can just remove them from lists. When the persistent store is replayed into the access service the post will appear and then be deleted before the current state is reached and the access service is ready for operation.
However this is a public forum so there may be objectionable material that has to be expunged. So that can be done by traveling the random access file and finding the offending operation and replacing the undesired data with an equal number of characters. Encryption complicates this algorithm as there is no guarantee that when the data is re-encrypted the size will be the same. This can be handled in a number of ways, but it is something to pay attention to. This is redacting the data. Of course this would have to be done to any backup persistent stores as well.
Access is Fast and Can be Centralized
The access is really fast as it is all in memory and done with lists all set up for the tasks at hand. It could be done by a centralized computer servicing satellite computers over a fast network using UDP.
The normal load of Stack Overflow appears to be less than a post/second. But it could certainly be 100 requests/second as most requests are people looking at existing posts. That allows 10ms per request which is not really a problem for in-memory access.
One of the advantages of having a clear consistent pattern is that it tends to pay off in other areas. If we have such a communications load that we have to do load balancing, then we have requests being handled by a group of computers, however we want the data store to be centralized. We can connect all the communication computers up to a centralized one via a high speed optical fiber network and use UDP messages for sending these operations to the centralized data access service. Our operations are persistable and hence can be sent over a net so they can be re-constructed on the service and performed.
What About Huge Data Sets?
I can foresee one objection that critics will make is that they cannot fit their data set into memory. This may be true for some really massive systems, but for the huge majority of enterprise systems it is easy to fit a data model in the memory of a modern computer. You don’t want to take what the disk size says, you want to figure out how what the ‘real-data’ size is. How many bytes does it take to store each table. Take the size or average size for each SQL data type and add them all up for a row and multiply by the number of rows. That is the size of the real data in a table. Add that up for all the tables and this is pretty much the real data you have. If you do that it might surprise you how much smaller that is than what you are carrying on disk.
Now of course this shouldn’t include any static assets like videos, photos, documents etc. They are not really part of a database as they can have their locations stored so they can be accessed when needed.
If that real-data amount is more than a terabyte then you do indeed have a giant data issue that might require some a different approach.
Relational Database Corruption
Using relational databases has a nasty downside in that relational databases corrupt. Basically while processing a relational database things can happen in the course of operation that cause them to enter an internally contradictory state. Because disk speeds are very low compared to memory, to do data access on a slow block storage device requires some pretty tricky algorithms. You have to store indices on the disk because you can’t spend all the time rebuilding indices. Because of the complicated nature of keeping the data on disk with complicated indices to enable faster access it is possible to get things out of step and have different parts of that disk structure not agreeing with other parts. Then the database is corrupt and when the database software tries to read it in to initialize it crashes.
If you want proof of this go and search for “relational database corruption.” You will find a host of results that talk about relational database corruption. There are services that will, for a hefty fee, help you fix your corrupted database. This is one of the negative consequences of mixing persistence with data access.
In the pattern I specified here of separating access and persistence the persistent store cannot be corrupted. If you analyze the logic you realize that the store is updated with an algorithm that guarantees atomicity and always leaves it in a valid state.
Wrapping Up
This is how I think modern systems should be structured, that is with a clear division between persistence and access. It is much more reliable and is easier to use in all respects. It has the immense advantage that data sets are encrypted on disk and even if an intruder gains access to the file system it doesn’t reveal any private information if they don’t have the encryption keys.
I think this approach gives a much better alternative to relational databases. It takes advantage of modern hardware and is much more efficient.
So this is my answer to those of you who asked what I was suggesting in the place of relational databases. With a few lines of code you can replace thousands of lines that prepare SQL statements and take result sets apart. Smaller, simpler code is always better.