Most DBAs have faced the situation where one of their applications requires
a code that is used to identify products, accounts, or some other business
object, and all of the values for the code always must be numeric. But, for
reporting purposes, users or developers are asking that the codes be displayed
and printed out with leading zeroes. For example, if the code value is 34 it
should always display as a four-digit code, namely 0034.
So, the users request that the column be defined as a character data type
to ensure that leading zeroes are always shown. But what are the drawbacks, if
any, to doing this? Let’s examine the pros and cons of using character data
types to store numeric values.
Data Integrity
If the code is stored as a character data type, say CHAR(4), inserts and
updates could place invalid alphabetic characters into the product code.
Without proper edit checks, any valid character can be stored in the CHAR(4)
column. Non-valid data can be a very valid concern if ad hoc data
modifications are permitted. This is rare in production databases, because
most data modifications occur via pre-defined transactions, instead of by ad
hoc SQL. But even with pre-defined transactions data problems can still occur
if the proper edit checks are not coded into each and every program that can
modify the data.
Perhaps a better solution to ensure data integrity would be to code a CHECK
constraint on the CHAR(4) column that enforces numeric values. For example,
here is the SQL to create a CHECK constraint on a four byte character column
named PROD_CODE:
CREATE
TABLE T1
(PROD_CODE
CHAR(4) NOT NULL
CONSTRAINT CK_PROD_CD
(PROD_CODE LIKE '[0-9][0-9][0-9][0-9]'),
.
. .
The LIKE clause in the CHECK constraint will ensure that only numeric
values can be stored in each of the four positions of PROD_CODE. This is
accomplished using the mask [0-9] to indicate all values within the range from
the number 0 to the number 9.
From an integrity viewpoint the CHECK constraint solves the problem.
However, from a performance viewpoint the CHECK constraint will be a little
less efficient than simply using a numeric data type and letting SQL Server
perform the data integrity checking. The CHECK constraint, however, should
outperform similar checks coding into your application programs.
So let's assume that proper edit checks are coded either into the program
or using a CHECK constraint. If the edit checks are never bypassed this will
remove the data integrity question.
Performance
There is another problem that is related to the performance of using a CHAR
data type to store numeric values. This one revolves around the access plan
used by SQL Server when the data is accessed. Let’s assume we are choosing
between a data type of CHAR(4) and SMALLINT. The SMALLINT data type is the
smallest numeric data type that can be used to store numbers of up to 4 bytes
in length. The TINYINT data type ranges only from 0 to 255, and the INTEGER
data type ranges from -2,147,483,648 through 2,147,483,647.
Now consider the possible number of values that a CHAR(4) column can assume
versus a SMALLINT column. Even if edit checks are coded for the column, SQL
Server is not aware of these when determining the access plan for the
retrieval SQL. Instead, SQL Server will assume that all permissible characters
could be stored in the column. Assuming 26 alphabetic letters, 10 numeric
digits, and the space character can be specified, there is a possibility of 37
distinct characters. Therefore, for a four-byte character column there would
be 374 or 1,874,161 possible values.
By contrast, a SMALLINT column can range from -32,768 to 32,767 producing
65,536 possible small integer values. This produces a much smaller range of
possible values with more of the values actually valid values than the CHAR
solution. There are drawbacks to this solution, too. For example, negative
product codes could be entered. It is quite likely that negative values should
not be used for these type of codes. Additionally, five digit product codes
could be entered, even though we wish to limit the size to four digits.
However, if we adhere to our proper edit check assumption, the data integrity
problems will be avoided here, as well.
An alternate strategy would be to deploy the DECIMAL data type instead.
Specifying DECIMAL(4,0) as the data type removes the possibility of 5 digit
product codes, but negative numbers are still permissible. Once again, though,
with proper edit checking we can avoid this situation.
SQL Server will use the statistics populated by the UPDATE STATISTICS
command to determine the access plan for each SQL statement. For character
columns, the range of values between the highest and lowest in the database
will be larger than numeric columns because there are more total possible
values. Therefore SQL Server will have less accurate information for character
data types that store only numeric data. For this reason, favor a numeric data
type over the CHAR data type. The numeric data types include TINYINT, SMALLINT,
INTEGER, and DECIMAL.
(Note: the NUMERIC data type is equivalent to the DECIMAL data type and is
also a feasible choice.)
Reporting Considerations
The leading zeroes problem might be able to be solved using methods other
than fiddling with the data type in the table. When using a reporting or query
tool you usually can ensure that leading zeroes are shown by using built-in
edit codes (or other features of the reporting tool). Report programs can be
coded to display leading zeroes easily enough by moving the retrieved data to
appropriate display fields.
Summary
In general, it is wise to choose a data type that is closest to the domain
for the column. If the column is to store numeric data, favor choosing a
numeric data type: TINYINT, SMALLINT, INTEGER, or DECIMAL. If the data is a
date or time, use the DATE and TIME data types. For monetary amount use the
SMALLMONEY and MONEY data types. If the data is a single 0 or 1 bit, use the
BIT data type. And finally, if the data is alphanumeric, choose a CHAR,
VARCHAR, or TEXT data type. Simply stated, it is usually best to choose the
data type that is closest to the actual data that is to be stored in the
column. This will reduce the need to code edit checks and may improve overall
performance of those queries accessing the data.
And, when required, always be sure to code appropriate edit checks to
ensure data integrity.