Hacker News new | past | comments | ask | show | jobs | submit login

Ah, I forgot 10xxxxxx was not usable, but I also forgot 0xxxxxxx was. What about 11111111? If that's valid then it's 8, if I'm thinking straight.



11111111 is technically possible to use, but it would cause some problems. Sending it over the wire would break telnet, for example. Also since we already introduced 11111110 for 7-byte encodes, we're getting dangerously close to making the UTF-16 BOM character (11111111 11111110) accidentally show up in UTF-8 (this is also why 11111110 wasn't in the original maximum-6-byte UTF-8 spec). I still don't think it's possible to have the UTF-16 BOM show up in our hypothetical extended UTF-8, since 11111111 could never be immediately followed by 11111110 (or vice versa) in a well-formed UTF-8 stream.

Also note that if you did add 11111111 as a valid head octet representing an 8 octet long encoding, you'd still only have 42 usable bits (since the first byte is still entirely consumed by the length indicator)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: