Framework Tips IV: Check if character exists for given Encoding (CodePage)

In a project I’m currently working on, I needed to check if particular character is a part of given CodePage. Problem with .NET’s Encoding class, is that although it maintains a table mapping Unicode characters to codes in particular CodePage, it keeps it as private field. Moreover it does its best to replace characters it does not contain, with some fallback character.

One might use this fact, and compare character received this way from Encoding’ instance, with original character, assuming, that if they are different, this character is not a part of that CodePage, but this is not an elegant solution. And involves lot of overhead, by first converting char to byte[] and next the other way around.

Another solution is to use an overload of Encoding’s static GetEncoding method, like this:

Encoding.GetEncoding(1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);

this way, when user tries to convert a character that is not a part of given Encoding’s CharSet, fallback encoder throws an exception. So one might use try/catch and be happy with it, but this too is an awful solution, and also limiting, as you have to create Encoding instance yourself, so you’re helpless in cases when you receive arbitrary encoding.

After little bit of poking around I came up with yet another solution, that seems to be better, faster and more elegant than those two. I however didn’t test it thoroughly so it may have flaws as well (or may not even work at all in some cases). First, let the code speak:

using System;

using System.Text;

 

namespace ConsoleApplication3

{

    public class Program

    {

        static void Main(string[] args)

        {

            char s = '\x015f';  //exists in 1250 but not in 1252

            char a = 'a';       //exists in both

            char t = '\x00fe';  //exists in 1252 but not in 1250

            Encoding ce = Encoding.GetEncoding(1250);

            Encoding we = Encoding.GetEncoding(1252);

            Print(ce, we, s);

            Print(ce, we, a);

            Print(ce, we, t);

            Console.ReadKey();

        }

 

        private static void Print(Encoding ce, Encoding we, char c)

       {

            Console.WriteLine("{0}: {3}: {1,-6} {4}: {2,-6}",

                c, ce.Contains(c), we.Contains(c), ce.WebName, we.WebName);

        }

    }

 

    public static class EncodingExtensions

    {

        public static bool Contains(this Encoding encoding, char character)

        {

            Encoding enc = encoding;

            if (!(enc.EncoderFallback is EncoderFallbackCheckExists) && enc.IsReadOnly)

            {

                //you might want to cache these, in order to avoid having to

                //clone given encoding every time.

                enc = (Encoding)encoding.Clone();

                enc.EncoderFallback = new EncoderFallbackCheckExists();

            }

            int result = enc.GetByteCount(new char[] { character }, 0, 1);

            return result > 0;

 

        }

    }

    

    internal class EncoderFallbackCheckExists:EncoderFallback

    {

        public override int MaxCharCount { get { return 1; } }

 

        public override EncoderFallbackBuffer CreateFallbackBuffer()

        { return new FallbackBufferCheckExists(); }

    }

 

    internal class FallbackBufferCheckExists:EncoderFallbackBuffer

    {

        public override int Remaining { get { return 0; } }

 

        public override bool Fallback(char charUnknown, int index)

        { return false; }

 

        public override bool Fallback(char charUnknownHigh, char charUnknownLow, int index)

        { return false; }

 

        public override char GetNextChar() { return '\0'; }

 

        public override bool MovePrevious() { return false; }

    }

}

I created two classes: one inheriting from EncodierFallback, and one inheriting from EncoderFallbackBuffer. Basically my idea was, that I will provide Encoding instance with fake fallback encoder, that should not try to provide any fallback character. That way Encoding will silently (and fast) fail and its GetBytes and GetByteCount methods will return respectively empty array and 0y.

Only problem I had was to inject actual EncoderFallbackCheckExist instance info Encoding’s EncoderFallback property. Although this property has setter, when IsReadOnly is true, trying to set it, will raise an exception. Encoding however implements ICloneable, and cloning it, does not preserve its readonly state. So after its cloned, you can safely assign its EncoderFallback.

I also created simple EncodingExtensions class, with single extension method, to wrap the whole logic, and attach it to Encoding class, so that you can write:

Encoding encoding = Encoding.GetEncoding(1256);

bool b = encoding.Contains('ź');

Looks good to me, and as far as I’ve checked – works. However if you have better idea how to accomplish this, please leave a comment.

Technorati Tags: , , ,

Comments

Pavel says:

Thanks!