Bug in UTF8Encoding class?

As I work at translation/localization company, I deal a lot with files with different code pages, and strange formats. While working on a project recently I stumbled across strange thing, that I think is a bug in UTF8Encoding class. I was reading a plain text file, encoded with Windows 1252 code page, using FileStream and StreamReader. Everything was working fine, but I noticed that when reading one particular file, program produced strange results. TO be precise when reading file with string like this:

before

Results looked like this:

after

StreamReader ate up umlaut characters. When file was saved as UTF8 everything worked fine, when I used StreamReader with Encoding.GetEncoding(1252) it worked as well. But hey! It eat my characters! It’s Utf, it should be able to handle non-ASCII characters properly. It should throw an exception if it can’t but not eat them up silently. When I opened the file with different, non-.NET tools, like jEdit, where you can say to the tool “open the file using this or that encoding” It showed properly, so I guess it’s not me misunderstanding valid behavior of UTF8, it’s simply a bug. I’m going to report it to Microsoft and we’ll see what they say about it.

Technorati tags: UTF8, Encoding, UTF8Encoding, .NET bug

June 23, 2007

In .NET