SQL Server Infernals – Circle 3: Shaky Typers


Infernals

Choosing the right data type for your columns is first of all a design decision that has tremendous impact on the correctness of the database schema. It is not just about performance or space usage: the data type is the first constraint on your data and it decides what can be persisted in your columns and what is not acceptable.

Choosing the wrong data type for your columns is a mistake that might make your life as a DBA look like hell.

What they say in Heaven

Guided by angelic spells, the hands that design databases in Heaven always choose the right data type. Database architects always look at the logical schema and ask the right questions about each attribute and they always manage to understand what the attribute is used for and what it will be used for in the future.

What will put you to hell

Choosing the wrong data type is like trying to fit a square peg in a round hole. The worst thing about it is that you end up damaging the peg… ahem… the data.

SquarePegRoundHole

  1. Using numeric data types for non-numeric attributes: Even if a telephone number contains only digits and it’s called telephone number, it is not a number at all. It does not allow mathematical operations and it has no order relation (saying that a telephone number is greater than another one makes no sense). In fact, a telephone number is a code you have to dial to contact a telephone extension. The same can be said for ZIP codes, which only allow numeric digits, but are nothing like a number. Storing this data in a numeric column is looking for trouble.
  1. Storing data as their human-readable representation: A Notable example is dates stored as (var)char. The string representation of a date is not a date at all: without the validation rules included in the date types, any invalid date could be saved in your column, including ‘2015-02-30’ or ‘2015-33-99’. Moreover, varchar columns do not allow date manipulation functions, such as DATEADD, DATEDIFF, YEAR, MONTH and so on. Another reason why this is a terrible idea is that dates have their own sorting rules, which you lose when you store them as strings. You also need more storage space to save a string representation of a date compared to the proper date type. If you really want to convert a date to a string, you can find many algorithms and functions to perform the conversion in this article I wrote for SQLServerCentral in 2012, but please do it in your presentation layer, not when storing the data in your tables.
    Another surprisingly common mistake in the AS/400 world is storing dates in three separate integer columns for year, month and day. I have no idea where this pattern comes from, but it definitely belongs to hell.
    While much more uncommon in the wild, the same applies to numbers: storing them as varchars is a terrible idea.
    Extra evil bonus: you get double evil points for storing dates and numbers as nvarchar: double the storage, double the pain.
  1. Using deprecated data types: (n)text and image are things of the past: get over it. The replacement (n)varchar(max) and varbinary(max) are much more powerful and flexible.
  1. Using “extended” data type just to “be safe”: This applies both to numeric and character columns: using a bigger data type just to play it safe can be a good idea at times, but not when the size of the column is well known upfront and is instead a vital constraint on the data itself. For instance, a ZIP code longer than 5 characters is obviously an error. A social security number longer than 9 digits is not valid.
    Along the same lines, storing years in a int column is only going to be a waste of storage space. The same can be said about small lookup tables with just a handful of rows in them, where the key column can be a smallint or even a tinyint: it won’t save much space in the lookup table itself, but it can save lots of space in the main tables (with many more rows) where the code is referenced.
  1. Storing fixed-size information in varchar columns: Similarly to the previous sin, when your attribute has a fixed character size, there is no point in using a varying character type. If your attribute has exactly 3 characters, why use varchar(3)?
    Extra evil bonus: varchar(1) will get you double points.
  1. Storing duration in time or datetime columns: Datetime and time represent points in time and they are not meant for storing durations. If you really want to store a duration, use a numeric column to store the number of seconds (it’s the ANSI standard unit measure for representing a duration). Even better, you could store the start/end date and time in two separate datetime columns. SQL Server 2016 also supports periods. 
  1. Getting Unicode wrong: Choosing nvarchar for attributes that will never contain Unicode data and choosing varchar for attributes that can contain Unicode data are equally evil and will get you instant damnation. For instance, a ZIP code will only contain numeric characters, so using Unicode data types will have the only outcome of wasting space. At the same time, storing customer business names or annotations in varchar columns means that you won’t be able to persist international characters. While it may appear quite unlikely that such characters will ever appear in your database, you will regret your decision when that happens (and it will). 
  1. Messing with XML: I’m not a big fan of XML in the database, but sometimes it can come handy. Storing XML data in a plain varchar column is a very bad idea. The XML data type provides validation rules that won’t allow in invalid or malformed XML and also provides functions to manipulate the XML data. Storing schema-less XML is another bad idea: if you have an XML schema use it, otherwise you will end up saving invalid data. On the other hand, using XML to go “beyond relational” and mimic Oracle’s nested tables will only get you damned. Fun times. 
  1. Using different data types in different tables for the same attribute: there’s only one thing worse than getting your data types wrong: getting them wrong in multiple places. Once you decided the data type to store an attribute, don’t change your mind when designing new tables. If it is a varchar(10), don’t use varchar(15) in your next table. Usually proper foreign key constraints help you avoid this issue, but it’s not always the case.
    If this query returns rows, chances are that you have schizophrenic columns in your database schema:

    WITH my_schema AS (
        SELECT OBJECT_NAME(c.object_id) AS table_name,
            c.name AS column_name,
            t.name AS type_name,
            c.max_length,
            c.precision,
            c.scale
        FROM sys.columns AS c
        INNER JOIN sys.types AS t
            ON c.system_type_id = t.system_type_id
    ),
    incarnations AS (
        SELECT *,
            DENSE_RANK() OVER (
                PARTITION BY column_name
                ORDER BY type_name, max_length, precision, scale
            ) AS incarnation_number
        FROM my_schema
    ),
    incarnation_count AS (
        SELECT *,
            MAX(incarnation_number) OVER (
                PARTITION BY column_name
            ) AS incarnation_count
        FROM incarnations
    )
    SELECT *
    FROM incarnation_count
    WHERE incarnation_count > 1
    ORDER BY incarnation_count DESC,
        column_name,
        type_name,
        max_length,
        precision,
        scale;
    

The lack of proper constraints will be the topic of the next post, when we will meet the anarchic designers. Stay tuned!

Posted on July 2, 2015, in SQL Server, SQL Server Infernals and tagged , , , , . Bookmark the permalink. 14 Comments.

  1. Reblogged this on The Blobfarm and commented:
    Gianluca’s post is spot-on if you are looking for advice when picking data types for your database design !!

  2. One strange datatype is nullable bit. I’ve not checked actual implementation but it looks like implemented in variable block (cost 3 bytes) + one bit in nullability mask.

    Rather to use 2 bits – one about data and other about knowledge about data

    crate table x
    (

    , gender bit null — 0 = F, 1 = M and NULL = Unknown

    )

    rather to include
    (

    , gender bit not null — 0 = F, 1 = M
    , gender_known bit not null — 0 = Unknown, 1 = known

    )

    in view (or query)

    , case when gender_known=1 then gender end as gender — this is enough to add

    Henn

  3. I think it’s short sighted to assume zipcodes are always 5 characters. What about zip+4 in the US? What about other countries? Sometimes, even when you believe a field is “exactly” a number of characters, it’s still good to pad because, as you said, we don’t always understand the logical data and its future uses. In your case, where you set zip code to char(5), zip+4 fails. In my case where I always use varchar(10), I’m good to go. This is why some of us add padding. If I had built my app in the 1980s when zipcodes were only 5 digits, and I set that as my length, when zip+4 came out, I would have had potentially many data layer changes. By adding padding, I just have application layer changes.

    • Thanks for your comment: it’s spot on and I learnt that ZIP codes are longer than 5 digits in the US 🙂
      Maybe ZIP codes are a poor example, but until the time they changed from 5 to 9 characters, the length was a constraint on the data itself: using varchar 10 from the start would have saved you the hassle of enlarging the column later (not a big deal in my opinion), but it would have let in invalid data.
      My point is that future uses of the data should not have precedence over current uses of the data, especially when the shape of the data is a constraint for its validity.

      • Bodhi Densmore

        Flexibility is better than raw efficiency. The postal code constraint normally includes the state or province column. It is a big pain to change a data type from char(9) to varchar(10), especially when foreign key constraints are involved. Better to start with varchar(16) and a constraint that limits the data as needed. Constraints can be changed in an instant but data type changes for length can take hours.

      • I’m sorry, but I disagree. Enlarging a column might be painful, but when the length is a constraint on the data, the length must be set exactly, without leaving room for eventual future changes. My bad for choosing ZIP code as an example, but think of SSN if it helps getting the point across better.

  4. Brian Feifarek

    Gianluca, Thank you for a great summary of type issues that will cause headaches later. One minor point on #6 is that the period link refers to SQL Server creating history tables, not a new datatype– the period it uses is two datetime2 columns that SQL Server manages automatically, so it does not add any more than the sentence before it for this use case.

    • Thank you Brian. I agree that the implementation is not complete yet, but you can still mark two columns as period. All the features that could be useful in other contexts are available only to temporal tables though. Thanks for pointing it out

  5. Henn Sarv:
    You should be using the ISO standard sex code 0 = unknown, 1 = male, 2 = female, 9 = lawful person (corporations, institutions, etc).

    The NULL-able BIT was the result of changing this proprietary data type to a NUMERIC. In SQL, all data types are NULL-able by definition. Originally, the BIT was what us computer geeks think of as a bit = {0,1} and was automatically non-NULL-able. So we never bothered with the explicit NOT NULL constraint. The switch screwed up code even worse than replacing the original Sybase *= outer joins. You got no error messages!

    Dman2306:
    I strongly disagree with padding. When you declared the original column, did you use this DDL?

    zip_code CHAR(5) NOT NULL
    CHECK (zip_code LIKE ‘[0-9][0-9][0-9][0-9][0-9]’)

    most of us did not do this, much less disallow impossible ZIP codes. My favorite was ‘99999’ which routed lots of junk mail to an Eskimo village in Alaska. The old COBOL convention had been to fill unknown values with 9’s (see my remarks on ISO sex codes above), so bulk mailers inherited that kind of data. Arrgh! But it gave the Eskimos winter fuel to burn.

    By allowing garbage data, you put an undue burden on the applications. When we did get ZIP+4, we needed an ALTER TABLE with a

    CHECK (zip_code LIKE ‘[0-9][0-9][0-9][0-9][0-9][ 0-9][ 0-9][ 0-9][ 0-9]’)

    so we could trim the trailing blanks if the application needed only the original ZIP. The +4 changes more than people think, so lots of sites do not use it unless they do bulk mailings. I still get ZIP+4 rejected in on-line forms.

  6. I’m not sure SSNs are any better than zip codes: there are already >300 million people in the US so no capacity for a check digit. And pretty soon there will have to be a format change to prevent reuse of existing numbers

    • Well, it looks like I’m not well informed about the situation in the US. I hope the point is clear anyway.

      • Actually SSN is awful. There is no check digit and we have so many illegals there is a 3-5% forgery rate. Now, to make it worse, the old pattern for allocation is void and the numbers are random!

  1. Pingback: SQL Server Infernals – Circle 4: Anarchic Designers | spaghettidba

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: