Hacker News new | past | comments | ask | show | jobs | submit login
String Deduplication – A new feature in Java 8 Update 20 (codecentric.de)
101 points by thescrewdriver on Sept 1, 2014 | hide | past | favorite | 39 comments



Related: I'm still frustrated by Java randomly (in a minor release, none the less!) switching to copy-on-substring. If I, a random nobody, had something that had its running time increased by a factor of ~50 (simple recursive-descent-ish parser for coercing tabulated data from one format to another - just call substring to trim off the first token repeatedly), how many dev hours were required overall to fix the results of the change? And there's no simple alternative either or way to preserve the old behavior - the simplest one, rolling your own String class or wrapper, ends up being relatively slow and annoying.

And "all" of this would be solved by having a proper way of doing array slicing - for things like substring's previous worst case (something like a single character being referenced in a substring holding up a gigabyte+-sized string) Java's garbage collector could recognize that the array was only referenced through slices of part of the array, copy that section into another array (updating references to it), and free the large one.

Also, Java's lack of a way to specify that a class is immutable (and that all children classes thereof must also be) is frustrating. Because optimizations like this can and should apply to more than just strings!


Agreed. I am not sure how big a change this is but it seems to me that after Oracle took over Java they have been getting into the bad habit of putting fairly big changes in an "update" only ticking up the minor version after the silly underscore (eg. update 20 is version 1.8.0_20-b26). For example the changes in applet security policies killed thousands of perfectly legit applications that had been running for years - all in an automated rolled out update.


I've largely started avoiding Java applets altogether. Why? Because it's become really really annoying to try to get them to run at all.


I believe this was done as part of the unification of JavaSE and JavaME VMs. AFAIK saving three instance variables per String instance is a big deal on JavaME.

ByteBuffer and CharBuffer are a way of doing array slicing. CharBuffer even implements Appendable and CharSequence so it's half way polymorphic with String.

I sometimes whish there would be multiple String classes tailored for specific purposes like there are Collection classes. Especially a String class for ASCII strings that uses only eight bytes per character.


My biggest gripe with String is that it's not an immutable Iterable<Char>. They could have at least implemented the interface.


You must be pretty happy with String then, if that's your biggest gripe :-)

Do you really mean Iterable<Character> and not Iterable<Integer>. Because a Unicode code point can spawn two chars.


Personally, I wish there was a GraphemeCluster class, with a way to coerce a String into an Iterable<GraphemeCluster>, Collection<GraphemeCluster>, etc. Because most of the time people don't want to iterate over code points - they want to iterate over user-perceived characters instead. It makes more sense that by default "école".reverse() (that's the two-code-point version of "é") would be "elocé" rather than "eloće".

But I've said a lot on the subject of strings already:

https://news.ycombinator.com/item?id=8198811

https://news.ycombinator.com/item?id=8066158

https://news.ycombinator.com/item?id=7606084

https://news.ycombinator.com/item?id=7574665

https://news.ycombinator.com/item?id=7075858


Those changes are only specific to Oracle's implementation, there are lots of others to choose from.

Most of those changes don't have any relation whatsoever with the Java Language Specification or Java Virtual Machine Specification.


I find it... odd, that, considering the overspecification of Java, maximum big-Os are not specified for things like this.

You do have a point, although it does not refute mine. However, Oracle's implementation is the most widespread - the chances are high that at some point most non-specialized Java code will be run on Oracle's implementation.


In theory yes. In practice 99.99% of mainstream Java developers will run their code on Oracle's JVM.


I doubt only 00,01% accounts for:

- Real time deployment scenarios

- J2EE containers, which tend to work best with the respective vendor JVM

- factory control JVMs

- car infotaiment JVMs

- smart card JVMs

- Compilation to native code

- Android

- Embedded deployments like MicroEJ

- Commercial UNIX systems

- Mainframes

- ...


Here's a counterargument for this, just to be a fuddy-duddy:

One of the key activities in programming is reasoning about time and space cost, and this is a space optimization that's nondeterministic. It kicks in sometimes, or sometimes not at all, and happens behind the scenes at garbage collection time when it's nearly invisible. If you're sloppy, your program may have a huge asymptotic space usage, and this may paper over it. But the impl has heuristics, it may not work all the time -- even their example program needed Thread.sleep() calls! Unpredictable semantics help nobody. So I always liked explicit string interning (whatup, Lisp!).

All the same, faster is better and I'm sure this makes things faster.

Oh and can we talk about how broken it was that Java 6 and under had a fixed size pool for .intern()'d strings?


Your applications already sit upon a mountain of non-determinism.

A typical web app runs against a database with a genetic query optimizer, on top of a VM with a concurrent generational GC, sharing virtual memory with a dozen other processes, arranged in a massive pyramid of caches, which are competing for CPU time from a multi-core monstrosity of a data-flow engine.

The sooner we as programmers embrace stochastic methods, the better.


It's controlled with a switch. Don't turn it on.


This sounds extremely valuable, if just because it requires far less tuning than interning strings.

Not so long ago I had to do maintenance on a pretty large Swing application, still stuck on Java 6, that was built around displaying huge amounts of data, holding it all in memory, after retrieving it from some rather slow web services. The poor thing easily ended up using about a couple of gigs, mostly due to the many Strings it held, many of them pretty repetitive. While I tried to reduce the memory footprint, I couldn't just intern everything coming from the service: In Java 6, interned strings come from PermGen.

Using a hashmap to allocate everything would have been better than nothing, but the dataset had a whole lot of strings that didn't repeat themselves, so the hashmap would have been far bigger than we needed for the application.

What I ended up having to do what to figure out where the data was the most repetitive, and then intern only those strings.

I cut memory use by over 30%, but it took days of profiling, evaluating the data and making simple code changes to get there, as opposed to just a runtime flag. Now, I wonder how much better, or worse, it performs that the hashmap solution in cases like the one I faced: a few hundred strings repeated tens of thousands of times, and hundreds of thousands of strings with almost no repetition.


Wow,that sounds very familiar including the swing app part! Except in my case data came from database and code used StringBuffers leading to massive amounts of duplication. I had to build a COW wrapper and profile the app to find that 95% of the data was only ever read in normal use cases!

Thankfully Java tooling is fairly mature - things like MAT and OQL were of great use in finding the memory hogs and leaks.


For those interested in the same feature in plain old C, have a look at the DSO Howto, section 2.4.2 "Forever const" and the LD flags SHF_MERGE and SHF_STRING.

With a little bit of magic, C compilers and linkers are even allowed to turn (simplified example)

    const char s1[] = "some string";
    const char s2[] = "string";
into

    const char s1[] = "some string";
    const char s2[] = s1+5;
and place these constants in a read-only section shared between multiple loaded instances of the same library or program.


I think it's standard java behaviour to .intern string constants appearing in code (i.e. since they are immutable, just share them). The new thing is that the JVM is going to do this automatically for strings that _don't_ exist at compile time, IIUC.


From http://openjdk.java.net/jeps/192

> Taking the above into account, the actual expected benefit ends up at around 10% heap reduction. Note that this number is a calculated average based on a wide range of applications. The heap reduction for a specific application could vary significantly both up and down.


I don't think an average is useful in any way here. It will reduce the heap use by some amount, since you reduce the number of char arrays in N strings to some number less than - or equal - than N. There won't be any new arrays allocated.

But in my team alone, we have services which deal with many equal but transient strings, some services with long-living, mostly equal strings, some services with long-living, radically different strings. In some of them, I expect rather massive reductions, in others, interning probably has most of the work done already and in others, there's no application or the GC will handle the issue already.


The garbage collector needs to store its internal arrays.


Boost flyweight is pretty useful for doing this in C++

http://www.boost.org/doc/libs/1_56_0/libs/flyweight/doc/inde...


We once maintained a hashmap with key and value as the instance of string to avoid duplication in a search application. Wouldn't that be more beneficial than keeping it in GC if the application uses more strings?

Edit: changed avoid deduplication to avoid duplication


> We once maintained a hashmap with key and value as the instance of string

No love for hashset?


Why did you want to avoid deduplication? You can't even tell it's happened as it only works on the char[] which is internal to the string. Did you find it didn't work as expected.


It was in a typeahead search application built on 20GB of names. These names have common first names and last names which were stored as different strings. With deduplication, string memory was reduced to 20%

Will benchmark that application with +UseStringDeduplication


So what was the downside of deduplication? Why did you want to avoid it?


Sorry deduplication was a typo. Corrected


Ah right. The reason they're doing it in the GC rather than in the mutator threads is that it only has an impact on strings long lived enough to be evacuated. Short lived strings don't get deduplicated, and probably don't need to be. Without the GC I don't know how you'd automatically determine that it was a good idea to deduplicate.


As a curiosity, global string cache is an old feature of R -- however strings are matched upon creation rather than detected by the GC.


The problem with that is every time you create a string you have to do the work to look it up in the cache. The benefit of the JVM's approach here is that it only bothers to deduplicate it if it is long-lived enough to be evacuated.


Sure; in R strings are copied way more frequently than created, so it pays off.


It's also possible to do this in python with the intern () builtin.


Does this mean "" == ""?


No, but that should be true anyway as they are string literals and automatically interned.


No, string deduplication takes place in the internal String char array. Each String will still have it's own object.


Not according to the article:

> In fact the String Deduplication is almost like interning with the exception that interning reuses the whole String instance, not just the char array.


I think that might mean the opposite of what you think it does.

Interning reuses whole String instances. Deduplication is like interning with the exception that it does not reuse the whole String instance, it just reuses the char array. Therefore surely deduplication does not reuse String instances.


This is only specific to the Oracle JVM.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: