SQL Server Infernals – Circle 2: Generalizers
Object-Oriented programming taught us that generalizing is a good thing and, whenever possible, we should do it. Complex class hierarchies are a good way of reusing code, hitting the specialized classes only when a special implementation is needed.
In the database world, the concept doesn’t play exactly well.
What they say in Heaven
In Heaven, there is a lookup table for each attribute, no matter how simple and no matter how small is the lookup table.
For instance, if your database is about sales, you probably have a Customers table and an Orders table, each with its own attributes resolved through a Foreign Key. The lookup tables are usually very small, with just a handful of rows in them:
Temptation comes from our own desires
Wouldn’t that be great if you could stop adding small, insignificant tables to your database schema? Wouldn’t it be a lot easier if you had ONE table to store all that lookup nonsense? “Less is more” after all, isn’t it?
If you had a “One True Lookup Table”, everything would be more elegant and simple. Look at this database schema:
Isn’t it elegant and clean?
Whenever you need a new lookup table, you just have to add rows to your OTLT™ (thanks Phil Factor for the acronym):
INSERT INTO LookupTable (table_name, lookup_code, lookup_description) VALUES ('Order_Status', 'OP', 'Open'), ('Order_Status', 'CL', 'Closed'), ('Order_Status', 'SH', 'Shipped');
Devil’s in the details
You may have less tables to deal with now, but there’s a price to pay. A bigger price than you would have expected.
- No foreign keys: Did you notice that the foreign keys are gone? In order to create a foreign key, you would have to add the lookup table name to the Orders and Customers tables, for each attribute stored in the lookup table. I don’t think you would like it.
- Generic data type: In order to merge all lookup tables in one, you need to choose a “generic” data type that fits for all. The most generic data type is a character-based type, so you’ll probably end up with a huge nvarchar column. You probably don’t want the same huge column in the referencing tables and you could end up having different data types between the main tables and the lookups. One more not-so-good idea. Moreover, when you’re joining your tables with the lookup table, you will have implicit (or explicit) conversions happening, which is a performance nightmare.
- Single Hotspot: Instead of hitting multiple tables for lookups, everyone will hit the same table over and over. This will create a hotspot in the database, with locking and latching issues all over the place.
- Acrobatic constraints: Defining constraints on a generic table becomes very difficult. Not an impossible deal, but very difficult. For the schema in this example, you could define a CHECK constraint to enforce the use of the correct data type, but the syntax of the constraint will not be very straightforward:
CHECK( CASE WHEN lookup_code = 'states' AND lookup_code LIKE '[A-Z][A-Z]' THEN 1 WHEN lookup_code = 'priorities' AND lookup_code LIKE '[0-9]' THEN 1 WHEN lookup_code = 'countries' AND lookup_code LIKE '[0-9][0-9][0-9]' THEN 1 WHEN lookup_code = 'status' AND lookup_code LIKE '[A-Z][A-Z]' THEN 1 ELSE 0 END = 1 )
It could get even worse
As soon as you start to realize that trading multiple lookup tables for an OTLT is not a good deal, devil will raise the bid and offer the ultimate generalization: the Entity Attribute Value, also known as “EAV”.
If you come to think of it, who needs fixed attributes in a table when you can have as many attributes as you want in a general-purpose table? Why messing with ALTER TABLE statements when you can have a single table that can store an infinite number of attributes that you can bind to any row in any table?
A typical EAV schema looks like this:
This way, you can have any type of attribute bound to your main entities. For instance, to define a “ship_date” attribute in your Orders table, you just have to insert a couple of rows in your EAV schema:
INSERT INTO Entities (entity_id, entity_name) VALUES (1, 'Orders'); INSERT INTO AttributeNames (attribute_id, entity_id, attribute_name) VALUES (1, 1, 'ship_date'); INSERT INTO AttributeValues (attribute_id, entity_id, id, value) VALUES (1, 1, 123, '2015-06-24 22:10:00.000');
Looks like a great idea, doesn’t it? Unfortunately, it is not.
- Generic data types: again, what would prevent a date such as ‘2015-02-30 18:30:00.000’ from being assigned to the ship date? Uh-oh: nothing.
- No foreign keys: again, enforcing foreign key constraints would be impossible.
- A single hotspot in the database: every attribute for every table involved in this nonsense would have to be looked up in the same table.
- No constraints: how would you enforce a constraint as simple as “NOT NULL”? Good luck with that.
- Dreadful reporting queries: when you will be asked to create a report on a table that uses this paradigm (I said “when”, not “if”, because it will happen), you will have to OUTER JOIN to the EAV table for each and every attribute that you want to retrieve. In case you are wondering if this is good or bad, take into account that the optimizer starts to freak out when it finds too many JOINS in a query and will likely timeout looking for a decent execution plan, feeding you the best it could come up with (usually, a mess).
Some software solutions are entirely based on user-defined attributes and the ability to define them is a central feature. For instance, many CRM solutions are heavily dependent on user-defined attributes. However, there are many ways to achieve the same results without resorting to an EAV design. For instance, one could wonder why ALTERing the database schema seems to be a less desirable solution.
There are also many flavors of EAV, with different degrees of evil involved. Some implementations at least provide different columns for different data types, some others use XML or JSON.
The EAV design comes with the intent of solving a real world problem that doesn’t have a definitive answer in the relational model. In partial defense of the “generalizers”, it has to be said that this is a challenging problem. Nevertheless, like Dante put his political enemies to hell, I am the “poet” and I’m afraid that the generalizers will have to get accustomed to sulfur. It just takes a couple of thousand years, after all.
Who will be damned next?
In the next circle of the SQL Server hell we will meet the shaky typers – the poor souls that chose the wrong data types for their columns. Stay tuned for more!
Posted on June 24, 2015, in SQL Server, SQL Server Infernals and tagged Database Design, EAV, normalization, SQL Server, SQL Server Infernals, Worst Practices. Bookmark the permalink. 1 Comment.
Reblogged this on Alessandro Alpi's Blog.