The Gatekeeper
It’s fairly obvious that whenever you work on a development team, everyone fills different roles. This can sometimes make it difficult to evaluate the individual contributions of particular team members, especially when their roles overlap to a large degree.
These roles vary from team to team, as do specific titles but most are essentially the same. You have the grunt, who does a lot of coding. You have the architect, who does mostly design work. There’s a team leader, who puts together a schedule, assigns tasks, and generally tries to make sure things get done on time. Most people fall somewhere between the grunt and the architect. Some grunts do design work, and some architects write code, so obviously there’s some overlap.
What I’ve noticed is that most teams have skillsets that overlap in so many different areas that it is pretty rare when one person truly owns something. There is one place where this becomes a serious problem: databases.
If your application uses a database, you need a database gatekeeper.
Every team that works with databases needs a gatekeeper for each application they are working on. It doesn’t need to be the same person for each application and arguably, is better if it is not. But you need someone to fill this role nonetheless. The gatekeeper I will focus on is the database gatekeeper. You can easily argue for other gatekeepers. Others that come immediately to mind are gatekeepers for the build, source code check-ins, branches, website applications, code versioning, bug tracking, etc. The fact is that any important part of a project could probably use a gatekeeper to keep things working smoothly.
Even on development teams which rely extremely heavily on databases, it has been my experience that there are perhaps only a small percentage of people on the team who truly know how to use a database effectively. Most developers view a database as an overglorified list of objects for which exists solely so they don’t need to worry about important details like saving the data or maintaining it between application starts.
This is not only inherently wrong, but it’s a dangerous line of thought. The fact is that a database can be an extremely complex part of your software and whomever designs it really needs to know what they are doing. Database design is very important for several different reasons. A poorly designed database is like a poorly designed algorithm. It still works, but not nearly as effectively or as efficiently as it could. Operations can take orders of magnitude longer if they aren’t done correctly.
If not used properly, a database can also end up filled with a lot of useless data. Consider this simple example. You’re building a nifty little forum for your website. You create a table called ‘User’ to store… err… well, the users. And you create a table called ‘Topic’, and ‘Post’ to store the topics and posts to that topic. Seems pretty simple.
But there are some basic concepts that must be followed. For example, what happens if you delete a user? Are all of the topics they created deleted? Are all of their posts deleted?
Yes, they are deleted. If someone comes in and spams your forum, perhaps you do want to get rid of everything they did, especially if it was a script that logged in and posted into every single topic.
The answer is also, No, they are not deleted. Perhaps you do routine maintenance and delete users who have been inactive for a period of time. You then implement a mechanism in the code to state that a post was made by a user who has been deleted.
People who don’t know how to use a database tend to fall into one of these two categories because they simply don’t know any better, and not because they made a conscious design decision. Their expectation is that the code they are writing will take care of every situation and will handle all of the business logic. These are the developers who have lots of extraneous logic in their code to handle referential integrity checks, and use a “select max(id) from table_x”, then increment the value by some constant value to determine what the id of the new element they are inserting should be.
This is not how it’s supposed to work. Every database I have ever used has some form of sequencing to help you determine the next identifier, be it a sequence in Oracle, or an Autonumber in SQL Server. The referential integrity should be enforced by the database with foreign keys. Unique data should be maintained unique with constraints. Every database comes with a set of functions that help you maintain the data.
“The purpose of a database is to maintain and manage your data”
I’m going to avoid getting into how a database really should be used. The fact is, that someone on your team should be assigned to be the Database Gatekeeper. It is his or her job to make sure that the database is designed correctly to begin with. If a developer needs to add a column to a table, it should be cleared by the gatekeeper. Not only should it be cleared by the gatekeeper, it should be him who adds the column to the database scripts.
What good can possibly come of this?
For starters, you get a solid naming convention. People experienced with databases tend to have a naming convention they use to help them figure out what the relationships between tables is by glancing at the database schema. Feel free to try using an established standard such as Hungarian Notation.
“The great thing about standards is that there are so many to choose from.”
Of course, this only gets you so far. What about the use of ‘s’ at the end of table names? Some people will name a table ‘User’, and others would name the same table ‘Users’. Mostly, it’s a matter of opinion. My personal preference is to avoid the use of ‘s’ at the end of a table name unless the table is used (solely for) relating two tables to one another. Regardless of your mechanism, at the end of the day, don’t you want all of your database tables to follow the same convention?
This is where a gatekeeper comes in handy. Since it’s his responsibility to make the changes, it is also his responsibility to make sure that those changes adhere to the established convention. One day when your database gatekeeper is long gone, the next one can look at the database design and at least have a chance of making changes without making things difficult for everyone else.
So far I’ve only addressed naming conventions. What else can a database gatekeeper can do for your team?
Prevent data redundancy.
Your database gatekeeper can tell you if the data you want to add to the database already exists. Having been the gatekeeper on a team before, I can’t count how many times I’ve prevented a developer from storing redundant information in the database. These redundancies are not only confusing, but waste space in the database.
Hard drive space is generally cheap these days, but what about access speed or the cost of storing that extra data? A pre-existing column might have been indexed, while the new one being created is not. Backup applications need to store that extra data as well. Both of these have hidden costs that a developer who is intent on implementing a feature might not think of.
Enforce database integrity.
Your database should enforce the integrity of your data, not the application code. When a database requires that relationships exist between tables, you can’t insert or delete data without satisfying that relationship. Using a sufficiently complex database merely as a storage center will cause you innumerable headaches trying to root out data inconsistencies. Most developers find these restrictions to be annoying to no end. “I want to delete this data, and the database won’t let me! Stupid database. I hate this thing!” That’s why they don’t use foreign keys and constraints. The fact is, if the database is preventing you from doing something, there’s probably a good reason.
Use the right data types for the job.
In SQL Server, it’s very easy to make a column of the type int. Too easy in fact. Enterprise Manager was/is a great tool, to be sure, but it certainly gives too much power to those who might not know what to do with it. Most people use an int without thinking twice about it, but is it really the right data type for the job? In SQL Server, anything that’s likely to go over 2 billion is probably not a good candidate to be an int. Similarly, any number which isn’t going to exceed 255 isn’t a good choice for an int either. Unicode, non-Unicode data, and variable length data can have serious impacts on the resulting data storage.
Using Unicode where ASCII text is being stored immediately doubles your storage requirements. The difference between 50 and 100 bytes doesn’t sound like a lot, but it can cost your customers money, nonetheless. Imagine using the wrong data type on every single column in a table because a developer simply didn’t know any better. Companies are pretty paranoid these days about backing up their data, and although hard drives are generally cheap, RAID 5 isn’t. Tape drives are terrible to work with, and every byte counts. An extra 100 bytes in a single row can result in 2GB of extra data in a 20 million record table. Multiply that by 100 tables and you run into some serious storage problems. Reloading a database in its entirely presents its own problems.
SQL Server 2000 ships with 27 built-in datatypes. Are you using the right ones?
Index important bits of data for speed.
Your gatekeeper can help speed up certain operations by tuning the indexes on the database. In laymans terms, it means building a cache to help find certain bits of data quicker. If a specific operation in your application is too slow, talk to your database gatekeeper and see what he can do. He might be able to speed it up by a factor of 100 without a single algorithm change in the application code.
Use Stored Procedures for complicated operations.
While Stored procedures are somewhat new to mySQL, they’ve been around for years in SQL Server and Oracle. Using a stored procedure moves some of the business logic out of the application and into the database. This is faster for a number or reasons, first and foremost is that the data gathering is done on the database server instead of over the network. Stored procedures also offer more security in applications by preventing SQL injection attacks.
Use scripted operations for all changes.
When I worked for Wegmans, it was departmental policy to refuse any database changes which were not scripted. While security certainly played a role in this, reproducibility did as well. If table changes were made via a gui, data is loaded, and then the database server crashes, how can you be sure that everything is restored exactly the way that it was before if you used the gui?
The short answer is that you can’t. Most tweaks are small in nature, but even minor tweaks can have huge performance impacts, especially when you have a database that is several terrabytes in size. A database that is entirely scripted is much easier to manage. It is easier to test, and easier to tweak for different releases.
Intelligently choose relationships.
I’ve seen people use descriptions to maintain relationships between database tables. This is bad practice at best, and assinine at worst. Descriptions have a tendancy to change. And if someone is using it to maintain relationships between tables, you will need to change it in every table it occurs. Without a gatekeeper, it will be difficult to track down those occurrences. If your application is using a description field as a table relationship, you probably have more serious problems.
What is the single worst way to choose a gatekeeper?
Use his resume. I recall interviewing someone who seemed to have pretty extensive SQL Server 2000 experience on his resume with the intent that he would be our database gatekeeper. Our team was really lacking in people who had database experience, so I was really looking forward to the interview. I asked him some database questions, and he seemed to know what he was talking about. Then I asked him about triggers, and what he had used them for.
He explained a very elaborate system of triggers that he had implemented to delete data from a very large database that was his current project. To simplify things, recall my example of nifty forum software above, imagining that he was attempting to delete a user and wanted to delete all of the topics the user had created, all of those topic posts, and delete the posts of that specific user. A trigger is a database mechanism for intercepting inserts, updates or deletes to the database and performing additional operations before the actual insert/update/delete takes place. In this case, he would delete the user, then the trigger would intercept the delete.
This trigger would in turn attempt to delete the Topic, thus firing a second trigger, which attempted to delete the Posts that the user had made. The original trigger would also delete all of this specific users’ posts. Traversing down the chain of triggers and then back up again results in the ability for the developer to delete a user and all associated data, while maintaining the database constraints, and referential integrity.
After patiently listening to his explanation of how he had used triggers, I asked him how well the resulting implementation worked. Apparently, not very well. Data inconsistencies were still being found as some of the table hierarchies traversed 10 levels deep. It was extremely difficult to track them down. He had spent the last 3 months trying, and still wasn’t finished. I asked why he didn’t use SQL Server’s built in ‘Cascade Delete’ and he had no idea what I was talking about.
After about five more minutes of questions, it quickly became clear to me that this developer had some serious holes in his knowledge of databases and wasn’t going to cut it as our database gatekeeper. He understood that referential integrity was important, but he lacked sufficient knowledge of the tools he was using to do it properly and efficiently.
How do you choose a good gatekeeper?
This is a lot harder to answer. Even the person I interviewed in the above example would probably not have been a terrible database gatekeeper if he weren’t by himself. I included the example to prove that a resume is not a tell all indicator of suitability for any given task. I think that so long as the person you have chosen has a decent knowledge of more than one type of database, and has a genuine interest in learning more about them, he is likely to be a good candidate. Someone who used to be a DBA would be an excellent choice, but ex-DBA’s turned developer are hard to come by.
Another possible solution is to use a couple of people who have database training to discuss any database issues. Two heads are better than one, and while a lot of their knowledge may overlap, some of it is likely to be distinct. People hate committees, but this can work well on some teams without overloading any one developer or trampling on their egos.
Conclusion:
Every development team using a database could gain some benefit from a full time DBA. Unfortunately, very few of us are going to have that luxury. Your best chance for avoiding some nasty database snags is to assign someone to the role of Database Gatekeeper. It is their responsibility to maintain the database design, make changes, and be the resident database expert for that application. It will save you time, money, and ultimately a ton of headaches.