SQL Server Infernals – Circle 1: Undernormalizers


Infernals

There’s a special place in the SQL Server Hell for those who design their schema without following the Best Practices. In this first episode of SQL Server Infernals, we will explore together the Row of the Poor Schema Designers, also known as “undernormalizers”.

What they say in Heaven

In Heaven, where all Best Practices are followed and everything runs smoothly while angels sing, they design their databases following the rules of normalization. Once upon a time, there was a man who spent a considerable amount of his life working on defining the rules of the relational model. That man was Edgar Codd.

Mr. Codd laid down the rules of normalization, which are known as “normal forms”. The normal forms define the attributes of a well-designed database schema. While there are more normal forms, it is widely accepted that a schema is normalized when it follows the first three normal forms. Here is the simplest possible enunciations of each:

  • 1NF – Every relation has a primary key, every relation contains only atomic attributes
  • 2NF – 1NF + Every attribute in a relation depends on the whole key
  • 3NF – 2NF + Every attribute in a relation depends only on the key

In a single line: “The key, the whole key, nothing but the key (so help me Codd)”.

Clues you’re doing it wrong

  • Repeating data (redundancies): the same information has to be saved in multiple places
  • Inconsistent data between tables (anomalies): the same information has different values in different tables
  • Data separated by commas
  • Structured data in “note” columns
  • Columns with a numeric suffix (e.g. Zone1, Zone2, Zone3…)

databasedesign

What will put you to hell

  1. No primary key: did you notice that the normal forms talk about “relations” rather than “tables”? The relational model is a mathematical model, which, at some point has to be translated to a physical implementation. Tables are exactly this: the physical implementation of relations.
    If your table has no primary key and relations must have a primary key, chances are that your table is the physical implementation of something else (a bin, a pile, a bag… whatever: not a relation anyway).
    When tables have no primary key, any data can be stored inside them, even duplicate rows. Once duplicate data is inside the table, there is no way to tell which row is good and which one is the duplicate.
  1. Surrogate keys only: this is in fact a variation on the “no primary key” sin: if your table has a surrogate key (such as an identity or uniqueidentifier column), make sure that it is not the only unique key in the table, otherwise you will end up storing duplicates, with only the surrogate key as a difference. This is no different from having no primary key at all.
    If you decide that your table is best implemented with a surrogate key (often because the natural key is composite or too wide), make sure that you create a UNIQUE constraint on the natural key.
  1. Non-atomic attributes: if your table has columns that contain multiple values, the likelihood of a design mistake goes to the roof. For instance, when you find data such as “sales@company.com,marketing@company.com” in a “email” column, chances are that the designer of the database forgot to take into account that the entity (for instance a customer) might have multiple email addresses.
    While many efficient split algorithms are available, storing the data in this format has many downsides:

    • Indexing individual items is impossible
    • Searching for individual items is hard
    • Updating an item requires writing the whole comma separated value
    • Locking a single item is impossible (reduced concurrency)
    • CHECK constraints are hard to implement

    Whenever you find non-atomic attributes, be prepared to refactor the database schema, because something is really wrong and there is no way to fix it without moving the attribute to a different table.

  1. Use of NULL when not necessary: NULL is a constraint on the data: if an attribute is mandatory, make it mandatory! Allowing NULLs on mandatory data will open the door to data that does not meet the business rules. What are you doing with rows that are missing mandatory attributes?
    If your table has too many NULLs in it, you probably have designed it wrong and you are trying to fit too many attributes in a single table (an implicit dependency?): move them to a separate table.
  1. Use of dummy data: The other side of the coin is “no NULLs allowed anywhere”. Not all attributes are mandatory: if you pretend it is so, the users will start putting dummy data into your columns to work around the restriction. Typical examples are “.” or empty strings for character-based columns and “0” for numeric-based columns. Once those dummy values are in, can you tell the difference between “dummy” zeros and “real” zeros? Of course you can’t.
  1. Designing the database when specs are incomplete/unclear: This is the worst of all mistakes. Changing the database schema once it is in production is a bloodbath: everything built on top of that schema will have to change. Many of the design mistakes described above are the consequence of incomplete specifics or lack of analysis.
    It is discouraging to note how some popular design patterns do not take into account the intrinsic complexity of refactoring a database schema and demand the implementation of the schema to automated tools, based on the object classes that represent the domain. Yes, I’m talking about you, Code First. In my book, “code first” is a synonym of “design someday”. Don’t let any automated tool design your database schema: you know better than that!
  1. Premature denormalization: some devil’s advocates will tell you that normalization slows down the database and that you should be denormalizing your schema from the start. Don’t believe what they say! Most normalized schemas can cope with sustained reads and SQL Server offers many features (such as indexed views) to deal with high numbers of joins, which is usually the point for denormalization. Of course, if you’re working on a BI project, denormalization is expected and desirable instead.
    An old saying goes: “Normalize ‘til it hurts, then denormalize ‘til it works”, but there’s nothing preventing the database from working on a normalized schema. If reads on the database are slow, it is quite unlikely that the cause is over-normalization: it is much more presumable that your queries and/or your indexes suck are sub-optimal.

In the next episode of SQL Server Infernals I will discuss a particular database design sin: the dynamic schema. Stay tuned for more.

Advertisements

Posted on June 17, 2015, in SQL Server, SQL Server Infernals and tagged , , , , . Bookmark the permalink. 9 Comments.

  1. Nicely done, Gianluca! Keep ’em coming.

  2. Very well done Gianluca, nice idea using Dante’s Inferno! Also some great reads about relational database theory could be found in http://www.dbdebunk.com/ by Fabien Pascal!

  3. Reblogged this on Alessandro Alpi's Blog and commented:
    Considerations on normalization, one of the most popular topics speaking about RDBMS.

  4. I agree with everything except Number 6.

    “This is the worst of all mistakes. Changing the database schema once it is in production is a bloodbath: everything built on top of that schema will have to change. ”

    Of course everything needs to change but why is that a bloodbath. It’s simple to search source code for referenced objects and then it is just a matter of refactoring. If you make judicious use of ORMs, views, and stored procedures this should be a breeze.

    “Many of the design mistakes described above are the consequence of incomplete specifics or lack of analysis.”

    Welcome to agile. It’s impossible to know all requirements up-front in modern projects. Technology and business needs change too quickly. If design must change then we relational guys must be willing to change our designs. The alternative is more people using schema-less solutions to avoid curmudgeonly DBAs that refuse to refactor design. I will say that it is CRITICAL that all stakeholders understand that incomplete/changing requirements will add overhead to the data design process…but I don’t see why this is such a big deal.

    Frankly, 15 years ago when everything was still waterfall I/we still screwed up our schema designs and had to refactor anyway. Nothing will be perfect on the first try despite best efforts.

    “It is discouraging to note how some popular design patterns do not take into account the intrinsic complexity of refactoring a database schema …”

    What intrinsic complexity? I can refactor pretty much any design to any new design without making things complex or risky. Assumptions are good ORMs, stored procs, and views. That’s your abstraction/interface that needs to be refactored. The actual schema refactoring should just be a matter of getting the DDL and data migration scripts right. Challenging for sure, but not as terrible as you make it out. If you have to do more than this then frankly you have more wrong with your overall design than just the data persistence piece.

    The other piece is the data migration. That’s just an exercise in creative TSQL. Yes, it is tedious, but not that painful. I work on huge healthcare databases and we are totally “agile”. We are constantly refactoring designs based on new requirements. And we often have terabytes of data that needs to be massaged into the new structures. And we do all of this with SECONDS of downtime. And it’s always well tested (has to be or the FDA would freak out). It’s not something we are scared of.

    Seriously though, it’s comments like this that are leading more people to schema-less solutions…to avoid the “can’t do” attitude. Then that leads to a future where nothing is queryable or reportable.

    “…and demand the implementation of the schema to automated tools, based on the object classes that represent the domain. Yes, I’m talking about you, Code First. In my book, “code first” is a synonym of “design someday”. Don’t let any automated tool design your database schema: you know better than that!”

    The reality is you absolutely can do Code First with the data tier. You absolutely can refactor schema as new requirements come in. We can’t always assume requirements will be static. Those days are gone. In our current world we need to accept that change, even radical change, is inevitable. What we as data practitioners should do is ensure that our stakeholders understand that incomplete requirements WILL require extra effort on the backend to change later. And this is no different for schema-less solutions (if those folks were intellectually honest). Changing a schema will ALWAYS require data migration and that will ALWAYS be tedious.

    I agree with the rest of your post but you should really think about your attitudes towards fuzzy data requirements. Where I consult I find stakeholders are less tolerant of DBAs and data professionals that aren’t able to be “agile”.

    • Dave, you seem to have better understanding and tools to work with “agile” databases than the vast majority of DBAs and developers out there.
      I’m glad to hear that ever-changing requirements are not an issue for you. Just take into account that it’s not so for everyone. Projects do fail due to lack of analysis. They do it all the time.
      Yes, the waterfall days are gone, the days when you have to get your database design right from the start are not.
      BTW, whenever I hear the word “agile”, I go for my gun. All the agile buzz has done at least the same amount of harm to the software industry as waterfall has. But this is the subject for a different discussion.

      • I agree, when I hear “agile project” I also cringe. But “being agile” and “drinking the agile kool-aid” are totally distinct. I despise those who tout “agile” as the savior of our industry. Nothing is the savior except good people that can work with stakeholders and deliver profitable products.

        I just ask that you think about your hard-nosed approach to firm upfront requirements and getting the database design right from the start. Look at the trends in our industry, are they moving more toward or away from your views? Why? Then look at how you can fit into this world and be an influencer.

      • Dave, I don’t see the software industry moving from “do good analysis upfront” to “forget about analysis beacause change is cheap”. I have noticed a decreased quality in analysis documents over the last few years, but I highly doubt it’s intended.
        Well, it’s just my opinion obviously. I don’t want anybody to agree with me forcefully. If my views don’t work for you, that’s ok: we’ll agree that we disagree and move on. My way of being an influencer in this “modern” world is to give advice that hopefully will save everyone else all the pain I had to go through in my career. You seem to know better than the average target audience of my advice, so maybe it’s just too basic stuff for you.

  1. Pingback: SQL Server Infernals – Circle 4: Anarchic Designers | spaghettidba

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: