SQL Server Update: "Creating SQL Results Sets with Aggregate and Detail Data"

Craig S. Mullins
Return to Home Page
September 2000
	Character Versus Numeric Data Types By Craig S. Mullins Most DBAs have faced the situation where one of their applications requires a code that is used to identify products, accounts, or some other business object, and all of the values for the code always must be numeric. But, for reporting purposes, users or developers are asking that the codes be displayed and printed out with leading zeroes. For example, if the code value is 34 it should always display as a four-digit code, namely 0034. So, the users request that the column be defined as a character data type to ensure that leading zeroes are always shown. But what are the drawbacks, if any, to doing this? Let’s examine the pros and cons of using character data types to store numeric values. Data Integrity If the code is stored as a character data type, say CHAR(4), inserts and updates could place invalid alphabetic characters into the product code. Without proper edit checks, any valid character can be stored in the CHAR(4) column. Non-valid data can be a very valid concern if ad hoc data modifications are permitted. This is rare in production databases, because most data modifications occur via pre-defined transactions, instead of by ad hoc SQL. But even with pre-defined transactions data problems can still occur if the proper edit checks are not coded into each and every program that can modify the data. Perhaps a better solution to ensure data integrity would be to code a CHECK constraint on the CHAR(4) column that enforces numeric values. For example, here is the SQL to create a CHECK constraint on a four byte character column named PROD_CODE: CREATE TABLE T1 (PROD_CODE CHAR(4) NOT NULL `CONSTRAINT CK_PROD_CD` `(PROD_CODE LIKE '[0-9][0-9][0-9][0-9]'), . . .` The LIKE clause in the CHECK constraint will ensure that only numeric values can be stored in each of the four positions of PROD_CODE. This is accomplished using the mask [0-9] to indicate all values within the range from the number 0 to the number 9. From an integrity viewpoint the CHECK constraint solves the problem. However, from a performance viewpoint the CHECK constraint will be a little less efficient than simply using a numeric data type and letting SQL Server perform the data integrity checking. The CHECK constraint, however, should outperform similar checks coding into your application programs. So let's assume that proper edit checks are coded either into the program or using a CHECK constraint. If the edit checks are never bypassed this will remove the data integrity question. Performance There is another problem that is related to the performance of using a CHAR data type to store numeric values. This one revolves around the access plan used by SQL Server when the data is accessed. Let’s assume we are choosing between a data type of CHAR(4) and SMALLINT. The SMALLINT data type is the smallest numeric data type that can be used to store numbers of up to 4 bytes in length. The TINYINT data type ranges only from 0 to 255, and the INTEGER data type ranges from -2,147,483,648 through 2,147,483,647. Now consider the possible number of values that a CHAR(4) column can assume versus a SMALLINT column. Even if edit checks are coded for the column, SQL Server is not aware of these when determining the access plan for the retrieval SQL. Instead, SQL Server will assume that all permissible characters could be stored in the column. Assuming 26 alphabetic letters, 10 numeric digits, and the space character can be specified, there is a possibility of 37 distinct characters. Therefore, for a four-byte character column there would be 37⁴ or 1,874,161 possible values. By contrast, a SMALLINT column can range from -32,768 to 32,767 producing 65,536 possible small integer values. This produces a much smaller range of possible values with more of the values actually valid values than the CHAR solution. There are drawbacks to this solution, too. For example, negative product codes could be entered. It is quite likely that negative values should not be used for these type of codes. Additionally, five digit product codes could be entered, even though we wish to limit the size to four digits. However, if we adhere to our proper edit check assumption, the data integrity problems will be avoided here, as well. An alternate strategy would be to deploy the DECIMAL data type instead. Specifying DECIMAL(4,0) as the data type removes the possibility of 5 digit product codes, but negative numbers are still permissible. Once again, though, with proper edit checking we can avoid this situation. SQL Server will use the statistics populated by the UPDATE STATISTICS command to determine the access plan for each SQL statement. For character columns, the range of values between the highest and lowest in the database will be larger than numeric columns because there are more total possible values. Therefore SQL Server will have less accurate information for character data types that store only numeric data. For this reason, favor a numeric data type over the CHAR data type. The numeric data types include TINYINT, SMALLINT, INTEGER, and DECIMAL. (Note: the NUMERIC data type is equivalent to the DECIMAL data type and is also a feasible choice.) Reporting Considerations The leading zeroes problem might be able to be solved using methods other than fiddling with the data type in the table. When using a reporting or query tool you usually can ensure that leading zeroes are shown by using built-in edit codes (or other features of the reporting tool). Report programs can be coded to display leading zeroes easily enough by moving the retrieved data to appropriate display fields. Summary In general, it is wise to choose a data type that is closest to the domain for the column. If the column is to store numeric data, favor choosing a numeric data type: TINYINT, SMALLINT, INTEGER, or DECIMAL. If the data is a date or time, use the DATE and TIME data types. For monetary amount use the SMALLMONEY and MONEY data types. If the data is a single 0 or 1 bit, use the BIT data type. And finally, if the data is alphanumeric, choose a CHAR, VARCHAR, or TEXT data type. Simply stated, it is usually best to choose the data type that is closest to the actual data that is to be stored in the column. This will reduce the need to code edit checks and may improve overall performance of those queries accessing the data. And, when required, always be sure to code appropriate edit checks to ensure data integrity. From SQL Server Update (Xephon) October 2000. © 2000 Craig S. Mullins, All rights reserved. Home.