So recently I ended up needing to convert an array of characters in Windows-1252 aka CP-1252, pointed to by char *
, to UTF-8. As always, my first solution: Google. My first hit was this: https://codereview.stackexchange.com/a/40857
(And as I soon learned, CP-1252 is not actually the exactly same as ISO 8859-1. Characters from 128 to 159 differ. But for my purposes, this didn't actually turn out to matter. CP-1252 is also erroneously known as ANSI in the Windows community.)
So I plugged it in and.. Nope! So my trusted 2nd solution: Time to actually learn something new!
You can surely find better resources elsewhere, so I wont go too deeply to UTF-8 specification. But basically it's a varying width character encoding, meaning that the amount of bytes a character takes depends on the character. For standard ASCII, there's full compatibility. Characters from 0 to 127 are represented by the exact same bit sequences in both ASCII and UTF-8. For ISO 8859-1, the mapping has kindly been kept the same; That is, the 256 characters of ISO 8859-1 are the same as the first 256 UTF-8 characters. The byte encoding, however, is not exactly the same.
If you look at Unicode character table for the first 256 characters, you can see that after 0x7f, the 127th character, UTF-8 starts using two bytes to store its characters. The mapping is pretty straightforward, as you can see.
Here's my function in C:
char* ISO88959ToUTF8(const char *str)
{
char* utf8 = (char*) malloc(1 + (2 * strlen(str)));
int len = 0;
char *c = utf8;
for (; *str; ++str) {
if (!(*str & 0x80)) {
*c++ = *str;
len++;
} else {
*c++ = (char) (0xc2 | ((unsigned char)(*str) >> 6));
*c++ = (char) (0xbf & *str);
len += 2;
}
}
*c++ = '\0';
realloc(utf8, len+1);
return utf8;
}
(You can remove the realloc call if you don't need it)
Here's the C++ version:
std::string ISO88959ToUTF8(const char *str)
{
std::string utf8("");
utf8.reserve(2*strlen(str) + 1);
for (; *str; ++str)
{
if (!(*str & 0x80))
{
utf8.push_back(*str);
} else
{
utf8.push_back(0xc2 | ((unsigned char)(*str) >> 6));
utf8.push_back(0xbf & *str);
}
}
return utf8;
}
The basic idea is pretty simple.
If the character code is above 127, we need two bytes.
The first byte should be either 0xc2 or 0xc3, depending on whether the 2nd highest bit is set (First we cast to unsigned char due to doing bit shift; if you didn't do this, the sign would be kept and you'd end up with something like 11111110 rather than 00000010).
The second byte starts from 0x80 and ends in 0xbf for both 0xc2 and 0xc3.
!(*str & 0x80)
checks if the highest bit is set with the bitwise AND operator. In this case, it would do str AND 10000000. So if the highest bit it set, it returns a true value. It could be replaced with (*str > 127)
.
0xc2 | ((unsigned char)(*str) >> 6)
moves str to the right and then does a logical OR with 0xc2, which is 11000010. This works because for character codes from 128 to 191, the 2nd highest bit is not set; For numbers above that, it is set. It could be rewritten as *str <= 191 ? 0xc2 : 0xc2;
for this case.
Finally, 0xbf & *str
unsets the second highest bit. 0xbf is 10111111 so combined with a bitwise AND, it will result in the 2nd highest bit being always unset; that is, we'll get hex value from 0x80 to 0xbf (as long as the high bit is set, which it will be, since we only end up to this branch when it is set). This could be written as 0x80 + (*str % 64);
Cool thing is that as I was writing this, I actually happened to a bug! Instead of 0xbf & *str
I had 0x3f & *str
. So I was using 00111111. It happened to work correctly on the compiler I had originally written this for. I would presume it worked due to char conversion; bitwise operations may convert to unsigned data types, and converting back may have flipped the highest bit. I tested this on an online compiler (https://onlinegdb.com/BJ-ucfRaf - that should print yxØ) and my original code didn't work on that one.
Making the writeup was clearly worth it!
Thanks for reading, and hopefully this could be useful to someone one day.