Hacker News new | past | comments | ask | show | jobs | submit login
Unicode, Perl 6, and You (perl6advent.wordpress.com)
91 points by kamaal on Dec 7, 2015 | hide | past | favorite | 21 comments



Perl has had decent Unicode support longer than most similar languages (years before Ruby and Python, for instance), but Perl 6 is just ridiculously good at it, and I hope other languages follow suit. I'm unaware of any other language that handles Unicode this well...am I missing any languages that do? I guess JavaScript is coming along on this front and ES6 includes support for Unicode regexps, which is progress, so maybe that's the closest mainstream language.


Swift does a pretty good job as well for a first attempt.

https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-s...


> ES6 includes support for Unicode regexps

Will it provide \X for example? (\X matches extended grapheme cluster.)

> am I missing any languages that do?

Swift is one notable example. It has built-in and simple enough grapheme handling.


I haven't looked at Swift, at all. I don't buy Apple products, so have no familiarity with their ecosystem. But, now that it's been opened up, I'll have a look at it, though it seems likely to remain predominantly a language for Apple products for the foreseeable future (I think?), so not something I'd find myself using in production any time soon. But, I guess we'll see how that shakes out over time now that it is open.

Given the rate at which JavaScript is converging on a really nice set of modern features and is having warts removed and performance is accelerating, I wonder if any other language is as relevant long-term.


Swift's Unicode support is sufficiently awesome that my web browser was having issues rendering the documentation for their string class due to the epic working examples ;P.


> ...am I missing any languages that do?

Tcl supported Unicode in 1999 with version 8.1. Much like Zathras, Tcl is the beast of burden that is easy to overlook.


On the downside, even today, Tcl can't handle characters outside the basic multi-lingual plane. It only does UCS-2, it can't handle UTF16 surrogate pairs. If you convert an astral-plane codepoint, such as some popular emoji, from UTF8, TCL will convert each UTF8 byte into a separate unicode codepoint. There are similar catches all over, it's just not practical to deal with non-BMP codepoints in TCL, even just to round-trip them.


Rereading OP's comment, I misread it a bit and responded to the wrong part. You're absolutely right, progress on Unicode in Tcl stalled out after the low-hanging fruit of UCS-2 was achieved.


Well, it's not about "supports unicode" as much as it's about the level of support. IIRC, I've heard that Tcl had fairly good unicode support (at least for the time), but I have no idea how it compares to some contemporary versions of languages, or Perl 6 specifically.


> Don’t worry though, standard Perl 6 does not demand that you be able to type Unicode. If you can’t, there are so-called “Texas” variants:

I've always loved the "everything's bigger in Texas" joke implicit inthe "texas" variant on some operators.

> If you’re interested in working within a particular normalization, there’s the self-explanatory types of NFC, NFD, NFKC, and NFKD.

That would probably be better with a "Well, it's self explanatory at the point you know you want to work in a particular normalization", since I only vaguely know what those are, and I've beenhearing about some of them for years. ;)

Great post though!


>say "नि".codes; # returns 3

How is "नि" 3 codepoints? There is only two: न and ि . Could this be a bug?


Perl6 agrees that it consists of only 2 codepoints:

    say "नि".NFC>>.uniname.perl;
    #=> ("DEVANAGARI LETTER NA", "DEVANAGARI VOWEL SIGN I")
Nothing more serious than a typo in the article would be my guess.


Seems to be a typo?

    $ perl6 -e 'say "नि".codes'
    2


Hexdumped the html of the webpage, there's only two codepoints here.


Not sure how to interpret this, are you suggesting the hindi नि should be two code points because when translated to English it is 'Ni'(two letters)?


I'm suggesting this string has two Unicode codepoints. I don't know anything about Hindi language or Devanagari script.


They were probably using an outdated version of rakudo, the latest version returns 2 as you would expect.


I haven't finished reading through the whole post yet, but if Perl 6 works on graphemes, are ligatures considered to be only one character?


Yes. Bear in mind however that ligatures are in Unicode only for the backward compatibility.

  > 'ffl'.chars
  1


that's sure to ruffle some feathers.

(see what I did there)


Try highlighting substrings of that text -- my browser doesn't even know it's multiple characters. (Separate is the idea of a browser displaying ligatures, which already works. But that's because ligatures are a display issue and the source text must have non-ligature text in it.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: