Object-Oriented programming taught us that generalizing is a good thing and, whenever possible, we should do it. Complex class hierarchies are a good way of reusing code, hitting the specialized classes only when a special implementation is needed.
In the database world, the concept doesn’t play exactly well.
What they say in Heaven
In Heaven, there is a lookup table for each attribute, no matter how simple and no matter how small is the lookup table.
For instance, if your database is about sales, you probably have a Customers table and an Orders table, each with its own attributes resolved through a Foreign Key. The lookup tables are usually very small, with just a handful of rows in them:
Temptation comes from our own desires
Wouldn’t that be great if you could stop adding small, insignificant tables to your database schema? Wouldn’t it be a lot easier if you had ONE table to store all that lookup nonsense? “Less is more” after all, isn’t it?
If you had a “One True Lookup Table”, everything would be more elegant and simple. Look at this database schema:
Isn’t it elegant and clean?
INSERT INTO LookupTable (table_name, lookup_code, lookup_description) VALUES ('Order_Status', 'OP', 'Open'), ('Order_Status', 'CL', 'Closed'), ('Order_Status', 'SH', 'Shipped');
Devil’s in the details
You may have less tables to deal with now, but there’s a price to pay. A bigger price than you would have expected.
- No foreign keys: Did you notice that the foreign keys are gone? In order to create a foreign key, you would have to add the lookup table name to the Orders and Customers tables, for each attribute stored in the lookup table. I don’t think you would like it.
- Generic data type: In order to merge all lookup tables in one, you need to choose a “generic” data type that fits for all. The most generic data type is a character-based type, so you’ll probably end up with a huge nvarchar column. You probably don’t want the same huge column in the referencing tables and you could end up having different data types between the main tables and the lookups. One more not-so-good idea. Moreover, when you’re joining your tables with the lookup table, you will have implicit (or explicit) conversions happening, which is a performance nightmare.
- Single Hotspot: Instead of hitting multiple tables for lookups, everyone will hit the same table over and over. This will create a hotspot in the database, with locking and latching issues all over the place.
- Acrobatic constraints: Defining constraints on a generic table becomes very difficult. Not an impossible deal, but very difficult. For the schema in this example, you could define a CHECK constraint to enforce the use of the correct data type, but the syntax of the constraint will not be very straightforward:
CHECK( CASE WHEN lookup_code = 'states' AND lookup_code LIKE '[A-Z][A-Z]' THEN 1 WHEN lookup_code = 'priorities' AND lookup_code LIKE '[0-9]' THEN 1 WHEN lookup_code = 'countries' AND lookup_code LIKE '[0-9][0-9][0-9]' THEN 1 WHEN lookup_code = 'status' AND lookup_code LIKE '[A-Z][A-Z]' THEN 1 ELSE 0 END = 1 )
It could get even worse
As soon as you start to realize that trading multiple lookup tables for an OTLT is not a good deal, devil will raise the bid and offer the ultimate generalization: the Entity Attribute Value, also known as “EAV”.
If you come to think of it, who needs fixed attributes in a table when you can have as many attributes as you want in a general-purpose table? Why messing with ALTER TABLE statements when you can have a single table that can store an infinite number of attributes that you can bind to any row in any table?
A typical EAV schema looks like this:
This way, you can have any type of attribute bound to your main entities. For instance, to define a “ship_date” attribute in your Orders table, you just have to insert a couple of rows in your EAV schema:
INSERT INTO Entities (entity_id, entity_name) VALUES (1, 'Orders'); INSERT INTO AttributeNames (attribute_id, entity_id, attribute_name) VALUES (1, 1, 'ship_date'); INSERT INTO AttributeValues (attribute_id, entity_id, id, value) VALUES (1, 1, 123, '2015-06-24 22:10:00.000');
Looks like a great idea, doesn’t it? Unfortunately, it is not.
- Generic data types: again, what would prevent a date such as ‘2015-02-30 18:30:00.000’ from being assigned to the ship date? Uh-oh: nothing.
- No foreign keys: again, enforcing foreign key constraints would be impossible.
- A single hotspot in the database: every attribute for every table involved in this nonsense would have to be looked up in the same table.
- No constraints: how would you enforce a constraint as simple as “NOT NULL”? Good luck with that.
- Dreadful reporting queries: when you will be asked to create a report on a table that uses this paradigm (I said “when”, not “if”, because it will happen), you will have to OUTER JOIN to the EAV table for each and every attribute that you want to retrieve. In case you are wondering if this is good or bad, take into account that the optimizer starts to freak out when it finds too many JOINS in a query and will likely timeout looking for a decent execution plan, feeding you the best it could come up with (usually, a mess).
Some software solutions are entirely based on user-defined attributes and the ability to define them is a central feature. For instance, many CRM solutions are heavily dependent on user-defined attributes. However, there are many ways to achieve the same results without resorting to an EAV design. For instance, one could wonder why ALTERing the database schema seems to be a less desirable solution.
There are also many flavors of EAV, with different degrees of evil involved. Some implementations at least provide different columns for different data types, some others use XML or JSON.
The EAV design comes with the intent of solving a real world problem that doesn’t have a definitive answer in the relational model. In partial defense of the “generalizers”, it has to be said that this is a challenging problem. Nevertheless, like Dante put his political enemies to hell, I am the “poet” and I’m afraid that the generalizers will have to get accustomed to sulfur. It just takes a couple of thousand years, after all.
Who will be damned next?
In the next circle of the SQL Server hell we will meet the shaky typers – the poor souls that chose the wrong data types for their columns. Stay tuned for more!
There’s a special place in the SQL Server Hell for those who design their schema without following the Best Practices. In this first episode of SQL Server Infernals, we will explore together the Row of the Poor Schema Designers, also known as “undernormalizers”.
What they say in Heaven
In Heaven, where all Best Practices are followed and everything runs smoothly while angels sing, they design their databases following the rules of normalization. Once upon a time, there was a man who spent a considerable amount of his life working on defining the rules of the relational model. That man was Edgar Codd.
Mr. Codd laid down the rules of normalization, which are known as “normal forms”. The normal forms define the attributes of a well-designed database schema. While there are more normal forms, it is widely accepted that a schema is normalized when it follows the first three normal forms. Here is the simplest possible enunciations of each:
- 1NF – Every relation has a primary key, every relation contains only atomic attributes
- 2NF – 1NF + Every attribute in a relation depends on the whole key
- 3NF – 2NF + Every attribute in a relation depends only on the key
In a single line: “The key, the whole key, nothing but the key (so help me Codd)”.
Clues you’re doing it wrong
- Repeating data (redundancies): the same information has to be saved in multiple places
- Inconsistent data between tables (anomalies): the same information has different values in different tables
- Data separated by commas
- Structured data in “note” columns
- Columns with a numeric suffix (e.g. Zone1, Zone2, Zone3…)
What will put you to hell
- No primary key: did you notice that the normal forms talk about “relations” rather than “tables”? The relational model is a mathematical model, which, at some point has to be translated to a physical implementation. Tables are exactly this: the physical implementation of relations.
If your table has no primary key and relations must have a primary key, chances are that your table is the physical implementation of something else (a bin, a pile, a bag… whatever: not a relation anyway).
When tables have no primary key, any data can be stored inside them, even duplicate rows. Once duplicate data is inside the table, there is no way to tell which row is good and which one is the duplicate.
- Surrogate keys only: this is in fact a variation on the “no primary key” sin: if your table has a surrogate key (such as an identity or uniqueidentifier column), make sure that it is not the only unique key in the table, otherwise you will end up storing duplicates, with only the surrogate key as a difference. This is no different from having no primary key at all.
If you decide that your table is best implemented with a surrogate key (often because the natural key is composite or too wide), make sure that you create a UNIQUE constraint on the natural key.
- Non-atomic attributes: if your table has columns that contain multiple values, the likelihood of a design mistake goes to the roof. For instance, when you find data such as “firstname.lastname@example.org,email@example.com” in a “email” column, chances are that the designer of the database forgot to take into account that the entity (for instance a customer) might have multiple email addresses.
While many efficient split algorithms are available, storing the data in this format has many downsides:
- Indexing individual items is impossible
- Searching for individual items is hard
- Updating an item requires writing the whole comma separated value
- Locking a single item is impossible (reduced concurrency)
- CHECK constraints are hard to implement
Whenever you find non-atomic attributes, be prepared to refactor the database schema, because something is really wrong and there is no way to fix it without moving the attribute to a different table.
- Use of NULL when not necessary: NULL is a constraint on the data: if an attribute is mandatory, make it mandatory! Allowing NULLs on mandatory data will open the door to data that does not meet the business rules. What are you doing with rows that are missing mandatory attributes?
If your table has too many NULLs in it, you probably have designed it wrong and you are trying to fit too many attributes in a single table (an implicit dependency?): move them to a separate table.
- Use of dummy data: The other side of the coin is “no NULLs allowed anywhere”. Not all attributes are mandatory: if you pretend it is so, the users will start putting dummy data into your columns to work around the restriction. Typical examples are “.” or empty strings for character-based columns and “0” for numeric-based columns. Once those dummy values are in, can you tell the difference between “dummy” zeros and “real” zeros? Of course you can’t.
- Designing the database when specs are incomplete/unclear: This is the worst of all mistakes. Changing the database schema once it is in production is a bloodbath: everything built on top of that schema will have to change. Many of the design mistakes described above are the consequence of incomplete specifics or lack of analysis.
It is discouraging to note how some popular design patterns do not take into account the intrinsic complexity of refactoring a database schema and demand the implementation of the schema to automated tools, based on the object classes that represent the domain. Yes, I’m talking about you, Code First. In my book, “code first” is a synonym of “design someday”. Don’t let any automated tool design your database schema: you know better than that!
- Premature denormalization: some devil’s advocates will tell you that normalization slows down the database and that you should be denormalizing your schema from the start. Don’t believe what they say! Most normalized schemas can cope with sustained reads and SQL Server offers many features (such as indexed views) to deal with high numbers of joins, which is usually the point for denormalization. Of course, if you’re working on a BI project, denormalization is expected and desirable instead.
An old saying goes: “Normalize ‘til it hurts, then denormalize ‘til it works”, but there’s nothing preventing the database from working on a normalized schema. If reads on the database are slow, it is quite unlikely that the cause is over-normalization: it is much more presumable that your queries and/or your indexes
In the next episode of SQL Server Infernals I will discuss a particular database design sin: the dynamic schema. Stay tuned for more.