Unicode vs Non-Unicode Data Types in SQL Server

This – and probably most of my future blog posts – is based off of a conversation I had with a colleague recently. He mentioned that he keeps all of his string data types in his databases as Unicode; so, in this post we’ll discuss the pros and cons of such an approach.

Advantages of Using Unicode for All Columns

Future Maintainability

As always, appropriate planning can alleviate a lot of the problems which a database can suffer.

If we know for certain that our column will never contain Unicode data then we should use a non-Unicode type. For a lot of columns though, this can be unlikely. Names, addresses etc. could all, at some point, contain Unicode data. They might not do now, but as the organisation expands this may become a future requirement. Changing data types on a database which has already expanded can cause a big headache. Planning ahead can reduce this burden. Having a business analyst on board will help with this significantly; you can never plan for every scenario, but you can help your future self!

Indexing

Whilst I have put this as an advantage to always using Unicode, if you plan your databases and your development appropriately, this is negated as you will see below.

This depends on which collation your SQL Server is running on: SQL collation or Windows collation. For reference: SQL Collations are prefixed with “SQL_”. It also depends on the data type your index is on and the data type you are comparing against; whether this is through a join or through the WHERE clause. If we can ensure that both sides of the comparison are Unicode, then we will have no problems at all.

I have prepared the table with the following code:

IF OBJECT_ID('tempdb..#Person', 'U') IS NOT NULL
	DROP TABLE #Person;

CREATE TABLE #Person
(
	FirstNameUnicodeSQL NVARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
	FirstNameNonUnicodeSQL VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
	FirstNameUnicode NVARCHAR(50) COLLATE Latin1_General_100_CI_AS NOT NULL,
	FirstNameNonUnicode VARCHAR(50) COLLATE Latin1_General_100_CI_AS NOT NULL
);

INSERT INTO #Person (FirstNameUnicodeSQL, FirstNameNonUnicodeSQL, FirstNameUnicode, FirstNameNonUnicode)
VALUES (N'Dave', 'Dave', N'Dave', 'Dave');

CREATE INDEX Idx_#Person_Unicode_Sql ON #Person (FirstNameUnicodeSQL);
CREATE INDEX Idx_#Person_NonUnicode_Sql ON #Person (FirstNameNonUnicodeSQL);
CREATE INDEX Idx_#Person_Unicode ON #Person (FirstNameUnicode);
CREATE INDEX Idx_#Person_NonUnicode ON #Person (FirstNameNonUnicode);

As you can see below, it doesn’t matter if the comparison side is Unicode or non-Unicode if the index is on a Unicode column, not matter the collation.

Index on the Non-Unicode Column – Windows Collation

In this instance we have a non-Unicode column with an index and we are comparing it to a Unicode value.

You will take a small performance hit here as SQL Server must use an implicit conversion from non-Unicode to Unicode. Both sides of the comparison need to have the same data type and so SQL Server calls the internal function CONVERT_IMPLICIT().

Index on the Non-Unicode Column – SQL Collation

You will have a significant degradation in performance with this approach. As with the above, we will need to use an implicit conversion and as you can see below we have a warning. Additionally though, we will also sacrifice the use of our index and our query plan will now show a scan and instead of a seek. The index was on the non-Unicode column and that’s not the data type we’re using any more and due to the limitations of the SQL collation, SQL Server can’t take this into account. 

This isn’t the focus of the article but there is a lot more interesting information on indexes and collations that’s definitely worth reading up on.

Disadvantages of Using Unicode for All Columns

Storage Requirements

One of the main disadvantages to using Unicode data types is that they have twice the storage requirement of the non-Unicode counterpart. If we know that our column will never contain a Unicode character then this can present a massive saving in storage requirement – especially for those big tables with millions of rows.

As you can imagine, for big tables which hold millions of transactions, a transaction description column, for example, could require a large amount of storage without proper planning.

Overflowing Pages

SQL Server can store data in various different ways depending on what type of data it is.

  • In-row data: rows which do not exceed the 8KB limit
  • Row overflow data: data which would have exceeded the 8KB limit
  • LOB: Large Object data

Rows which do not fit onto a page (8KB) must be stored using row overflow data. Rows cannot span across multiple pages.

VARCHAR(8000) and NVARCHAR(4000) use row overflow pages

VARCHAR(MAX) and NVARCHAR(MAX) use LOB pages.

As you can see, using a Unicode data type means you can store less data without going into row-overflow or LOB. If we have to store out-of-row, this can have consequences for performance.

It’s worth noting that anything which exceeds or could exceed the 8KB limit will also be unavailable for indexing.

Transaction Logging

This also related to size and simply put, the more data we have to move, the more of an impact this will have on our IO performance. Logs could take longer to write and if we want to do transaction replication then this will also have to be taken into consideration. More data means more work.

The considerations with this will not only be on IO performance but also on your networking capacity. If you’re moving data between servers, buildings, jurisdictions, then this will be a bandwidth requirement.

Conclusion

Many of us don’t get to design new databases and new tables on a regular basis and, in most circumstances, we can’t just re-write what we already have to work with. If we do have the opportunity to create new databases and tables, ensure that you have spent a significant amount of time planning them.

Advertisements