SSIS ETL: Handling Character Encodings and Conversions

SSIS ETL - Handling Character Encodings and Conversions

The Character Encoding Issue

You’ve been happily loading data into your data warehouse for years. But suddenly your SSIS ETL processes have started to fail. You dig into the issue and find that there are some funny looking text characters now arriving in your data. So now what?

I often see this this scenario occur for the first time in a company in the names of people or locations. Maybe your business has recently expanded and you now have international employees. When your new foreign employees are inputting data into your systems they could be utilizing characters and symbols native to their language that you’re not used to seeing. Maybe the same conditions are occurring, but from an expanded international customer base?

It could as simple as a system getting upgraded that your data warehouse sources its data from. The ramifications of that upgrade that made your software Unicode compliant have finally filtered down into your reporting environment. Whatever the reasons, you are now getting this data in character encodings that you’re not used to. You could try to push back on the people that maintain the source systems and applications to ‘keep the data clean’ on their side, but in many cases they may want the data the way it is. And with many of those applications likely being 3rd party behemoths, trying to overcome the issue from their side is often futile. Sooner rather than later you are going to have to figure out how to deal with these new character encoding data issues yourself.

Handling Extended Character Encodings

Support Them

Supporting the new character encodings is one way to solve the problem. Change your data warehouse to support Unicode and you’ll be all set, right? Well, that could become quite a time consuming and costly endeavor. Think about all of the tables, stored procedures, and functions in your database you’d have to convert. Think about the additional space requirements of your database if you had to convert all of your character and string based columns to be wide. Think about the analysis services cubes, the reports, SSIS ETL processes, the external apps, etc. that access your DW data, all expecting one data type that you’re thinking about changing to another.

Now, that’s an oversimplification of the issues you could run into if you tried to modify your data warehouse to support extended character encodings. There would likely only be a subset of your tables with columns that would potentially receive these new characters that you’d have to worry about changing. But you get the point, the changes and effects of those database changes would not be trivial.

Convert Them

Converting those character encodings into something you can work with is another way to handle the problem. Using the right techniques to convert your Unicode data down to a lower encoding format, your SSIS ETL will still be able to load all of your data without errors and your converted data will still be legible. You do run a slight risk of some characters not being able to down-convert into something meaningful, but in cases where this happens the conversion will result in the same output for those characters every time, so it shouldn’t cause problems in your ETL and database architecture. The risk is often a small price to pay compared to the benefit you get of your external apps being able to merrily plug along using your data like they always have without any changes.

Converting Character Encodings with a SSIS Script Task

Let’s start with the main SSIS script task functions you will want to keep handy for your character conversion work.

	////////////////////////////////////////////////////////////////////////////////////////////////////////////////
	// Converts a file from one specified character encoding to another and stores the result in a new file.
	//
	static void ConvertFileEncoding(string inputFile, string inputEncoding, string outputFile, string outputEncoding) {
		using (StreamReader reader = new StreamReader(inputFile, Encoding.GetEncoding(inputEncoding))) {
			using (StreamWriter writer = new StreamWriter(outputFile, false, Encoding.GetEncoding(outputEncoding))) {
				char[] buffer = new char[8192];
				int len;
				while ((len = reader.Read(buffer, 0, buffer.Length)) != 0) {
					writer.Write(buffer, 0, len);
				}
			}
		}
	}

	////////////////////////////////////////////////////////////////////////////////////////////////////////////////
	// Converts a string from one specified character encoding to another and returns the result as a new string.
	//
	static string ConvertStringEncoding(string inputString, string inputEncoding, string outputEncoding) {
		char[] outputChars = new char[inputString.Length];

		Encoding inputCode = Encoding.GetEncoding(inputEncoding);
		Encoding outputCode = Encoding.GetEncoding(outputEncoding);

		byte[] inputBytes = inputCode.GetBytes(inputString);
		byte[] outputBytes = Encoding.Convert(inputCode, outputCode, inputBytes);
		outputCode.GetChars(outputBytes, 0, outputBytes.Length, outputChars, 0);

		return new string(outputChars);
	}

	////////////////////////////////////////////////////////////////////////////////////////////////////////////////
	// This function can be used to replace characters with diacritics(the tildes, umlauts, circumflexs, etc.
	//   that appear above letters) with their character equivalent with the diacritic removed.
	//
	static string RemoveDiacritics(string inputString) {
		string stFormD = inputString.Normalize(NormalizationForm.FormD);
		StringBuilder sb = new StringBuilder();

		// In a FormD normalized string, a character with diacritic marks is stored as a base character followed by
		//   one or more diacritic characters that modify the base character.
		//   This loop only keeps the base character information and drops the diacritic information.
		for (int i = 0; i < stFormD.Length; i++) {
			UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[i]);
			if (uc != UnicodeCategory.NonSpacingMark) {
				sb.Append(stFormD[i]);
			}
		}

		// Where FormD normalization returns a completely decomposed string at its lowest component level, FormC
		//   recomposes a string if single codepoints exist that can represent a character that woud require
		//   multiple codepoints in FormD.  For most purposes and in most apps, FormC is the preferred form to use.
		return (sb.ToString().Normalize(NormalizationForm.FormC));
	}

If your SSIS data loads are file-based, in many cases the first function, ConvertFileEncoding, can fit your needs. It’s particularly important to make sure your conversions are done correctly when you are importing a file with fixed-width formatted data. A single multi-byte character you’re not expecting can shift all of your columns out of alignment. A function like this will make sure this kind of issue does not happen.

Maybe you have several file-based loads that you’ve been loading with SSIS’s default 1252 code page selected in your flat file connection manager, and just recently a few extended characters have been sneaking into them. Before you attempt to do anything with your flat file, run it through the ConvertFileEncoding function in a script task with a call similar to this:

ConvertFileEncoding(inputFile, "utf-8", inputFile + ".1252.txt", "windows-1252");

Given that you’ve first assigned the variable “inputFile” as the path to your file, this will create a 1252 code page converted copy of your input file with a “.1252.txt” extension at the same location as the original file. You would then change your SSIS file connection manager to point to the converted file instead of the original, and it should load successfully again like it always used to!

What about the other conversion functions? Well, ConvertStringEncoding does the same encoding conversion as the file converting function, but it does it on strings. This can be useful if, for example, you’ve already successfully pulled data into your data flow and you want to convert it before you write it off somewhere. In this case you could run the conversion on specific columns in your data flow to clean them up. And RemoveDiacritics is a function that can transform characters with diacritic marks into their base unmarked characters. In some scenarios this can be useful to help to make sure your output string is still readable.

Conversion examples

Imagine you have a string in one of your data columns that is having problems loading in your SSIS package. For testing purposes, as our string we will use the phrase “I can eat glass” (borrowed from the UTF-8 SAMPLER) in a standard transcription of Sanskrit:

kācaṃ śaknomyattum; nopahinasti mām.

First, let’s analyze why this phrase will be useful for our testing purposes. As you should be able to see if you’re using a modern browser, the phrase contains interesting characters with diacritic markers. Let’s investigate the phrase with the following function:

	////////////////////////////////////////////////////////////////////////////////////////////////////////////////
	// This function returns in the corresponding character position the count of the number of bytes used
	//   by each character of a string.
	//
	static string CountBytes(string inputString) {
		StringBuilder sb = new StringBuilder();

		char[] testChar = new char[1];
		for (int i = 0; i < inputString.Length; i++) {
			testChar[0] = inputString[i];
			sb.Append(Encoding.UTF8.GetByteCount(testChar));
		}

		return (sb.ToString());
	}

Running our test phrase through this function and displaying the results with the original string, we have:


	Input:          kācaṃ śaknomyattum; nopahinasti mām.
	Bytes per char: 221131211111111111111111111111111211

As you can see, some of our characters are stored as 2 bytes in the UTF-8 unicode encoding, and one even requires 3. This test string would cause us problems if we tried to load it as part of a fixed column width file load in 1252 format, as it would shift the position of all following columns. It would also fail in a data flow where you’re trying to save into a non-wide database character column that expects single-byte characters.

So let’s see what happens when we run this test string through our conversion functions in different ways.


	Input:                                      kācaṃ śaknomyattum; nopahinasti mām.

	utf8 to 1252 conversion:                    kaca? saknomyattum; nopahinasti mam.
	dia. removed:                               kacam saknomyattum; nopahinasti mam.
	dia. removed & utf8 -> 1252 conversion:     kacam saknomyattum; nopahinasti mam.
	utf8 to us-ascii conversion:                k?ca? ?aknomyattum; nopahinasti m?m.
	dia. removed & utf8 -> us-ascii conversion: kacam saknomyattum; nopahinasti mam.

Every one of these output strings contains a single byte per character, but as you can see some appear better than others. Here you get a better idea of how the function to remove diacritics can help make a conversion output more readable text. But in practical use, you’ll likely find that the characters that are sneaking into your system are of a type that just the normal conversion function can handle.

Here is a demo dtsx package you can grab to play with the conversion scripts yourself. The baseline tests in this article are provided in the package, taking the test string from a variable and running the different conversions shown in this article against it. There is also an example of how to run a file through a complete conversion before using it in a data flow. And finally, there is an example of how Unicode data already in a data flow can be converted before you store it. Just disable or enable whichever element you want test out and have at it.

Conclusion

Now you have a few tools in your belt for tackling those pesky Unicode extended character sets that inevitably appear in your SSIS ETL source data from time to time. Hopefully you find the examples and demos I have provided useful in your SSIS projects, and as always feel free to ask questions or share how these methods helped you out.

Leave a Reply

Your email address will not be published. Required fields are marked *