Replace text elements with Regular Expressions

I love/hate regular expressions. I love them for their flexibility and amount of time you can save using RegEx as opposed to manipulating strings manually. I hate them, because writing them is such a pain in the… you get the point. Today I had to quickly assemble a small tool that would replace certain elements in text file. To be more accurate it had to read lots small text files that were kind of bilingual, meaning English/Chinese, and change them to true Unicode bilingual. I said kind of, because files were written in plain ASCII with English text written normally, and Chinese encoded like this: #$2536#$5231#$AFF, that is #$ then one or two chars denoting older byte’s code, and two chars denoting younger byte’s code. It would be quite hard to do it manually, especially that file was a little bit more complicated than I presented here.

I used Regex class’s method Replace, that is specifically designed to help you replace elements in a string. It gets a string that you want to modify and a MatchEvaluator delegate. MatchEvaluator gets called every time match occurs on a given input string, it gets Match object representing said match, and returns string that substitutes matched element. It may seem complicated, but actual code is plain and simple:

public string Decode(string encodedString, Regex pattern)
{
    return pattern.Replace(encodedString, Replace);
}
 
private string Replace(Match match)
{
    string older = match.Groups["Older"].Value;
    string younger = match.Groups["Younger"].Value;
    char character = Convert(older, younger);
    return character.ToString();
}
 
private char Convert(string first, string second)
{
    if (second.Length == 1)
        second = "0" + second;
    return (char)(Convert.ToInt32(first + second, 16));
}

Groups ‘Older’ and ‘Younger’ denote location of older and younger byte of Unicode character code.

Method ‘Replace’ simply takes you byte codes from matched string and calls Convert that returns single character that is represented by this code, and it is put in the place of matched string. Using this simple approach I can easily substitute those codes with actual characters they represent.

Technorati Tags: Regular Expressions, Unicode, Replace substrings

July 11, 2007

In .NET, c#, code snippets