Unicode, Perl 6, and You

SwellJoe · on Dec 7, 2015

Perl has had decent Unicode support longer than most similar languages (years before Ruby and Python, for instance), but Perl 6 is just ridiculously good at it, and I hope other languages follow suit. I'm unaware of any other language that handles Unicode this well...am I missing any languages that do? I guess JavaScript is coming along on this front and ES6 includes support for Unicode regexps, which is progress, so maybe that's the closest mainstream language.

bitserf · on Dec 7, 2015

Swift does a pretty good job as well for a first attempt.

https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-s...

lelf · on Dec 7, 2015

> ES6 includes support for Unicode regexps

Will it provide \X for example? (\X matches extended grapheme cluster.)

> am I missing any languages that do?

Swift is one notable example. It has built-in and simple enough grapheme handling.

SwellJoe · on Dec 7, 2015

I haven't looked at Swift, at all. I don't buy Apple products, so have no familiarity with their ecosystem. But, now that it's been opened up, I'll have a look at it, though it seems likely to remain predominantly a language for Apple products for the foreseeable future (I think?), so not something I'd find myself using in production any time soon. But, I guess we'll see how that shakes out over time now that it is open.

Given the rate at which JavaScript is converging on a really nice set of modern features and is having warts removed and performance is accelerating, I wonder if any other language is as relevant long-term.

saurik · on Dec 7, 2015

Swift's Unicode support is sufficiently awesome that my web browser was having issues rendering the documentation for their string class due to the epic working examples ;P.

OopsCriticality · on Dec 7, 2015

> ...am I missing any languages that do?

Tcl supported Unicode in 1999 with version 8.1. Much like Zathras, Tcl is the beast of burden that is easy to overlook.

ploxiln · on Dec 7, 2015

On the downside, even today, Tcl can't handle characters outside the basic multi-lingual plane. It only does UCS-2, it can't handle UTF16 surrogate pairs. If you convert an astral-plane codepoint, such as some popular emoji, from UTF8, TCL will convert each UTF8 byte into a separate unicode codepoint. There are similar catches all over, it's just not practical to deal with non-BMP codepoints in TCL, even just to round-trip them.

OopsCriticality · on Dec 7, 2015

Rereading OP's comment, I misread it a bit and responded to the wrong part. You're absolutely right, progress on Unicode in Tcl stalled out after the low-hanging fruit of UCS-2 was achieved.

kbenson · on Dec 7, 2015

Well, it's not about "supports unicode" as much as it's about the level of support. IIRC, I've heard that Tcl had fairly good unicode support (at least for the time), but I have no idea how it compares to some contemporary versions of languages, or Perl 6 specifically.

kbenson · on Dec 7, 2015

> Don’t worry though, standard Perl 6 does not demand that you be able to type Unicode. If you can’t, there are so-called “Texas” variants:

I've always loved the "everything's bigger in Texas" joke implicit inthe "texas" variant on some operators.

> If you’re interested in working within a particular normalization, there’s the self-explanatory types of NFC, NFD, NFKC, and NFKD.

That would probably be better with a "Well, it's self explanatory at the point you know you want to work in a particular normalization", since I only vaguely know what those are, and I've beenhearing about some of them for years. ;)

Great post though!

Grue3 · on Dec 7, 2015

>say "नि".codes; # returns 3

How is "नि" 3 codepoints? There is only two: न and ि . Could this be a bug?

cygx · on Dec 7, 2015

Perl6 agrees that it consists of only 2 codepoints:

    say "नि".NFC>>.uniname.perl;
    #=> ("DEVANAGARI LETTER NA", "DEVANAGARI VOWEL SIGN I")

Nothing more serious than a typo in the article would be my guess.

cursork · on Dec 7, 2015

Seems to be a typo?

    $ perl6 -e 'say "नि".codes'
    2

EdiX · on Dec 7, 2015

Hexdumped the html of the webpage, there's only two codepoints here.

kamaal · on Dec 7, 2015

Not sure how to interpret this, are you suggesting the hindi नि should be two code points because when translated to English it is 'Ni'(two letters)?

Grue3 · on Dec 7, 2015

I'm suggesting this string has two Unicode codepoints. I don't know anything about Hindi language or Devanagari script.

hotkeys · on Dec 7, 2015

They were probably using an outdated version of rakudo, the latest version returns 2 as you would expect.

wtetzner · on Dec 7, 2015

I haven't finished reading through the whole post yet, but if Perl 6 works on graphemes, are ligatures considered to be only one character?

lelf · on Dec 7, 2015

Yes. Bear in mind however that ligatures are in Unicode only for the backward compatibility.

  > 'ﬄ'.chars
  1

logicallee · on Dec 7, 2015

that's sure to ruﬄe some feathers.

(see what I did there)

evmar · on Dec 7, 2015

Try highlighting substrings of that text -- my browser doesn't even know it's multiple characters. (Separate is the idea of a browser displaying ligatures, which already works. But that's because ligatures are a display issue and the source text must have non-ligature text in it.)