"unicode-utf8 just got merged! Unicode strings are now internally represented as utf-8 in PyPy, with an optional extra index data structure to make indexing O(1). We'll write a blog post about it eventually."
Python strings are random access indexable. UTF-8 has variable length characters, so it is not directly randomly accessable. So the Python 3 representation inflates them to 1, 2 or 4 bytes per character, depending on the longest character in the string.
In practice, most strings are not random accessed. Most subscripts are +-1 from a previous subscript. So most actions can be optimized into uses of "advance one UTF-8 char" or "back up one UTF-8 char". Hard cases require generating an index array of the entire string. That takes a full string scan. Apparently that's what PyPy is doing.
This raises the cost of functions like "find()", which search the string for something and return an integer which is a position in the string suitable for random access use. I'd once suggested that "find", etc. return an opaque object which was really a byte index into the string. If you used that in a subscript, the right thing would happen. If you converted it to an integer, the index array would have to be computed. Is PyPy doing that, or does the use of "find" force index generation?
In various languages, there are generally two cases where I'll use the result of 'find' (or similar).
* Is there ANY occurrence of the byte sequence within the 'string'?
(I only care if it's 'invalid', right at the start or end, etc)
* I'm about to use it to create a splice / as part of parsing.
I could be convinced to use strings.Split() if I knew the resulting objects were more like go's slices.
Thin array bounds wrappers indirectly pointing to the same backing array.
In Python, `"substring" in "string"` computes a boolean result suitable for use in an if statement. find() is for the second case. Python strings are immutable, so substring and split can be very cheap (pointing to same buffer without copying), but explicitly copying a substring to enable the GC to collect the large string is not very nice.
You might think that, but Java had the same problem with substring and decided that it WAS actually better than the alternative of sharing the original buffer. I might be wrong, but I think Java 6 or 7 changed the behavior to always copy after significant analysis of the trade off space. Pathological cases do occur and often enough to render a process entirely dead.
Go has functions like Index(s) and LastIndex(s), which return a byte index into a UTF-8 string. These just replicate similar C library functions.
The C library is perhaps not the place to look for good string semantics.
Go could have returned a slice covering the rest of the string instead. Or maybe slices of "everything up to the match", "the matching part", and "everything after the match". That would avoid dealing with direct indices into a UTF-8 byte string. If the inputs are valid UTF-8, the outputs will be valid UTF-8. With indices, it's so easy to mess it up. Or worse, create a string parsing vunerability.
> This raises the cost of functions like "find()", which search the string for something and return an integer which is a position in the string suitable for random access use. I'd once suggested that "find", etc. return an opaque object which was really a byte index into the string.
Good idea, but it can lead to some hard to handle special cases. Consider for example running find() on a number of large strings and storing the opaque results in a list. Since the objects would require the strings to evaluate the opaque objects Python wouldn't be able to garbage collect them, potentially leading to hard-to-track-down and unexpected out-of-memory errors. One could mitigate the problem by using a buffer to limit the size of the strings kept alive in this way, but eh, it gets complicated. :)
You wouldn't have to use an opaque object - you could use an object that contains both the character index and the byte index, which the indexing operation would understand how to use.
Sorry if I am totally wrong but could you optimise UTF-8 by using some kind of character to say this is a UTF-8 character.
For example:
AÉBéC would give
0x41FF42FF42
where FF would mean, refer to another table with the index of the UTF8 char like the following:
---------------
|------|------|
|0 |1 |
|------|------|
|0xC389|0xC3A9|
|-------------|
This would make random access fast however will also increase overhead at other places.
Also IIRC, UTF8 doesn't use values 0x80 <= unused_utf8_values <= 0xFF. Value between 0x80 and 0xFF could be used to refer to an index of the table. eg 0x80 = index 0, 0x81 = index 1 ... 0xFE = index 126 and 0xFF = refer to other table.
Regardless of values over 0x80. The index in the special table will still have to be kept when iterating over it for find and stuff.
You can recognize a "UTF-8 character" in a UTF-8 string easily. You can get the next character, and you can back up to a previous character, all unambiguously. The encoding is kind of neat. See Wikipedia.
The "0" "1" on the top of your table make me curious how I translate a 0xff octet into a position in the table. As it's written, it seems like I need to know whether the 0xff I'm looking at is the first, second, etc., which would require scanning the string to figure that out, which would defeat the lookup table entirely. Perhaps we could fix this by storing the index of the 0xff, mapped to codepoint:
index => codepoint
1 => 0xc389
3 => 0xc3a9
This would require a binary search; we could change that to a hash table, to make it O(1) w/o too much trouble. However…
This degrades pretty badly for strings that don't contain any/much ASCII (Greek, Russian, any Asian language, Indian, …); each of these table indexes is also presumably sizeof(size_t) (for the index) + 4 (for the codepoint), so 12B, probably 16 bytes/slot (after padding). That's 16B per codepoint. Perhaps some more could be squeezed out w/ some sort of variable length deal, but we're already in over our heads, I think.
> UTF8 doesn't use values 0x80 <= unused_utf8_values <= 0xFF.
UTF-8 uses all octet values up to an including 0xf5, except 0xc0 and 0xc1 (see the table[1]), so there's a few values, but not many.
Curious, any one know why 64-bit Windows isn't supported? Is it because a long is only 32-bits?
I'd potentially willing to help with the development to get that working. Current day job is a 64-bit Windows shop in finance and we have several teams expanding their use of Python with very large datasets. A 64-bit PyPy on Windows could be big for us.
Yeah, about that. Speed.pypy.org is having problems updating, the seven year old fork of codespeed needs refreshing. The benchmarks there are python2, so we would need to rework them for python3.
In the meantime, the python perf and performance packages are driving speed.python.org, and I started trying to get pypy3 runs reported there, but we need to provide a warmup parameter for each benchmark so that too will take some time.
The best benchmark of course is your use case, try it out and let us know where we are slow
Not really. The UnicodeObject has utf8 bytes, a length in codepoints, and an optional index map that is only calculated if needed to traverse the string. Most strings won’t need the map. String matching is utf8 aware, and indexes by codepoints.
The main speedup is in converting unicode to internal representation once, then passing around the object with no further conversions.
why is PyPy not just skipping current Python 3.6 effort and goes directly to 3.7 or at least doing that once in a while (skipping in between versions) ?
No. RPython still supports unicode strings. It now also supports utf8 string classes for things like regular expressions and ffi calls. PyPy just uses the non-unicode rpython apis.
UTF-8 is one way to encode Unicode code points. There's also UTF-16LE, UTF16-BE, UTF-32LE, UTF-32BE, and others like UTF-7. This makes pypy use UTF-8 internally for the additional memory and speed savings, at the cost of some time later on when you use certain features that require the string index to be built.
In rpython (a restricted python2 language) unicode refers to a specific python object. Utf-8 classes refers to rpython classes that hold a bytes, not unicode, but are aware of the encoding of the bytes.
From PyPy's Twitter feed:
"unicode-utf8 just got merged! Unicode strings are now internally represented as utf-8 in PyPy, with an optional extra index data structure to make indexing O(1). We'll write a blog post about it eventually."
Python strings are random access indexable. UTF-8 has variable length characters, so it is not directly randomly accessable. So the Python 3 representation inflates them to 1, 2 or 4 bytes per character, depending on the longest character in the string.
In practice, most strings are not random accessed. Most subscripts are +-1 from a previous subscript. So most actions can be optimized into uses of "advance one UTF-8 char" or "back up one UTF-8 char". Hard cases require generating an index array of the entire string. That takes a full string scan. Apparently that's what PyPy is doing.
This raises the cost of functions like "find()", which search the string for something and return an integer which is a position in the string suitable for random access use. I'd once suggested that "find", etc. return an opaque object which was really a byte index into the string. If you used that in a subscript, the right thing would happen. If you converted it to an integer, the index array would have to be computed. Is PyPy doing that, or does the use of "find" force index generation?