Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Doesn't interning usually refer to when you only consider identical copies, as opposed to dictionary compression where you allow for concatenations? E.g.

  Interning:
  1: "foo"
  2: "bar"

  my_string = "foo" // stored as ref->1
  my_other_string = "foobarbaz" // not found & too long to get interned, stored as "foobarbaz"

  Dictionary compression:
  1: "foo"
  2: "bar"
  
  my_string = "foo" // stored as ref->1
  my_other_string = "foobarbaz" // stored as ref->1,ref->2,"baz" (or ref->1,ref->2,ref->3 and "baz" is added to the dict)




  my_string = "foo" // stored as ref->1
Only if you explicitly intern the string. Interning can be expensive because

- the runtime has to check whether the string already is interned.

- you’re permanently creating the string, so if it isn’t reused later, that means your memory usage needlessly goes up.

Both get more expensive the more strings you intern.

I think interning is a hack that very rarely, if ever should be used. It likely won’t work well for large strings, as these tend to be unique, and use cases where it helps for shirt strings often are better handled by using enums or symbols, or by using a custom set of strings. If you do the latter, you have more control over Emory usage; you can do such things such as removing the least recently used strings, or ditching the entire cache when you’re done needing it (prime example: parsing a large XML file with many repeated nodes)


> you’re permanently creating the string, so if it isn’t reused later, that means your memory usage needlessly goes up.

Nowadays lots of runtime will GC interned strings to avoid this, this can mean churn in the interning table but avoids mis-estimations bloating the process too much.


Makes me think of an iolist in erlang/elixir.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: