I have a hard time getting past it being mostly unusable on 32 bit Linux [1]. Th...

enneff · on July 4, 2012

There's a patch to make the current conservative GC a precise GC, which will totally fix this problem. It's in the process of being reviewed and integrated into the core.

http://codereview.appspot.com/6114046/

FWIW, the panelists on the recent "Go in production" panel at I/O responded that garbage collection wasn't an issue for them on 64-bit systems. http://www.youtube.com/watch?v=kKQLhGZVN4A

pcwalton · on July 4, 2012

That's awesome, kudos. How do you deal with maintaining the distinction between pointers and values on the stack and in registers? I didn't see any code to do that, skimming through that patch. Since it's a precise GC, you need to maintain the distinction.

In Rust we have a very large patchset to LLVM to allow us to do this; it involves modifying nearly every stage of the LLVM pipeline. Curious how you did that in gccgo.

Edit: I just looked at the source and it still seems conservative for the stack and registers. So it's not a precise GC yet and will not eliminate false positives. (This is of course totally fine, I just wanted to clarify for myself.)

enneff · on July 4, 2012

This isn't my patch, so I can't speak to the specifics of it. Also, it isn't for gccgo, but the gc compiler.

atombender · on July 4, 2012

That's too bad. Hopefully someone will improve `gccgo` as well. I actually started out experimenting with Go using the `gc` compiler, and found performance (basically manipulating large arrays sequentially) atrocious. Moving to `gccgo` improved performance four-fold, making my unoptimized Go test code in some cases faster than my unoptimized C test code.

luriel · on July 5, 2012

How long ago was that? The gc compilers have seen lots of improvement recently, and more will be merged soon.

Also, if the problem is so pathological, it might be worth filling an issue.

atombender · on July 5, 2012

Go 1.0.2, which is the newest stable release. I will see if I have some extra time to file a report.

Slightly off topic (but relevant to my comment above), do you know why this:

    func TraverseInline() {
      var ints := make([]int, N)
      for i := range ints {
        sum += i
      }
    }

should be slower than:

    func TraverseArray(ints []int) int {
      sum := 0
      for i := range ints {
        sum += i
      }
      return sum
    }

    func TraverseWithFunc() {
      var ints := make([]int, N)
      sum := TraverseArray(ints)
    }

Seeing a huge difference here. With an array of 300000000 ints, TraverseInline() takes about 750ms, whereas TraverseWithFunc() takes 394ms. With gccgo (GCC 4.7.1), the different is much slighter, but there's still a difference: 196ms compared to 172ms. (These are best-case figures, I'm doing lots of loops.) Go is compiled, not JIT, so I don't see why a function should offer better performance.

genbattle · on July 5, 2012

I don't have any immediate insight for the difference in performance you're seeing (my hunch is that it's to do with where and how memory is being allocated), but I would suggest you ask about it on the Go-Nuts mailing list: https://groups.google.com/forum/?fromgroups#!forum/golang-nu...

The Go authors should be very receptive and informative on such an issue.

dchest · on July 5, 2012

Unrelated to your question, but your code doesn't traverse array. All it does is:

      for i := 0; i < len(ints); i++ {
        sum += i
      }

atombender · on July 5, 2012

Sorry, that was a typo. The actual test program I use is correct. It was obviously supposed to be:

    sum += ints[i]

Djehngo · on July 9, 2012

Not relevant to the performance test but you can do this

for index, value := range mySlice { }

and you get the index and the value, or you can do

for _, value := range mySlice { }

If you only need the value.

agentS · on July 5, 2012

Not sure why the difference in performance between function and non-function, particularly considering this looks inline-able.

2 theories on why gcc is doing better. Maybe its unrolling the loop, and avoids a bunch of jumps. Alternatively, it could be using SSE, but this theory is a much less likely, because I would expect it to be 4 times faster not twice as fast. gc doesn't use SSE instructions yet.

dmit · on July 5, 2012

I'm not seeing any difference between the two functions using the 1.0.2 x64 gc compiler. Here's the code I used (slightly modified to compile): http://play.golang.org/p/gcf0BOGncQ

For more complex functions, you could try profiling using the runtime/pprof package. More info here: http://blog.golang.org/2011/06/profiling-go-programs.html

If you want to dig deeper and are not averse to assembly, you can pass -gcflags -S to 'go build' and see the compiler output. Here's what I got for the above code:

  --- prog list "TraverseInline" ---
  0000 (/tmp/tmp.rkM3kSJ6Hd/trav.go:10) TEXT    TraverseInline+0(SB),$40-8
  0001 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVQ    $type.[]int+0(SB),(SP)
  0002 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVQ    $300000000,8(SP)
  0003 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVQ    $300000000,16(SP)
  0004 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) CALL    ,runtime.makeslice+0(SB)
  0005 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVQ    24(SP),BX
  0006 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVL    32(SP),SI
  0007 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVL    36(SP),BX
  0008 (/tmp/tmp.rkM3kSJ6Hd/trav.go:12) MOVL    $0,DX
  0009 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) MOVL    $0,AX
  0010 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) JMP     ,12
  0011 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) INCL    ,AX
  0012 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) CMPL    AX,SI
  0013 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) JGE     ,16
  0014 (/tmp/tmp.rkM3kSJ6Hd/trav.go:14) ADDL    AX,DX
  0015 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) JMP     ,11
  0016 (/tmp/tmp.rkM3kSJ6Hd/trav.go:16) MOVL    DX,.noname+0(FP)
  0017 (/tmp/tmp.rkM3kSJ6Hd/trav.go:16) RET     ,

  --- prog list "TraverseArray" ---
  0018 (/tmp/tmp.rkM3kSJ6Hd/trav.go:19) TEXT    TraverseArray+0(SB),$0-24
  0019 (/tmp/tmp.rkM3kSJ6Hd/trav.go:20) MOVL    $0,DX
  0020 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) MOVL    $0,AX
  0021 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) MOVL    ints+8(FP),SI
  0022 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) JMP     ,24
  0023 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) INCL    ,AX
  0024 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) CMPL    AX,SI
  0025 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) JGE     ,28
  0026 (/tmp/tmp.rkM3kSJ6Hd/trav.go:22) ADDL    AX,DX
  0027 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) JMP     ,23
  0028 (/tmp/tmp.rkM3kSJ6Hd/trav.go:24) MOVL    DX,.noname+16(FP)
  0029 (/tmp/tmp.rkM3kSJ6Hd/trav.go:24) RET     ,

  --- prog list "TraverseWithFunc" ---
  0030 (/tmp/tmp.rkM3kSJ6Hd/trav.go:27) TEXT    TraverseWithFunc+0(SB),$40-8
  0031 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVQ    $type.[]int+0(SB),(SP)
  0032 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVQ    $300000000,8(SP)
  0033 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVQ    $300000000,16(SP)
  0034 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) CALL    ,runtime.makeslice+0(SB)
  0035 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVQ    24(SP),DX
  0036 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVL    32(SP),CX
  0037 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVL    36(SP),AX
  0038 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) LEAQ    (SP),BX
  0039 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) MOVQ    DX,(BX)
  0040 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) MOVL    CX,8(BX)
  0041 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) MOVL    AX,12(BX)
  0042 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) CALL    ,TraverseArray+0(SB)
  0043 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) MOVL    16(SP),AX
  0044 (/tmp/tmp.rkM3kSJ6Hd/trav.go:30) MOVL    AX,.noname+0(FP)
  0045 (/tmp/tmp.rkM3kSJ6Hd/trav.go:30) RET     ,

atombender · on July 5, 2012

Here is my test, with test output for gc and gccgo:

https://gist.github.com/3053930

The numbers are from Go 1.0 as that's what is available on Ubuntu Precise, my test box. (Getting similar numbers with 1.0.2 on a different box where I don't have gccgo.)

I realized that the first loop was using "i < len(ints)" as a loop test, which turns out to be more expensive than I thought, and the compiler doesn't optimize it into a constant (which would be expecting too much, I guess). After rewriting the test, the function call case is only slightly faster, although it is still significantly faster with gccgo.

sausagefeet · on July 5, 2012

Is this patch going to be part of LLVM? Is one out of luck if they want to implement a precise collector in LLVM today?

pcwalton · on July 5, 2012

I've talked with Chris Lattner about upstreaming it and the LLVM developers seem open to it when it's ready. In the meantime you can find the most recent work here:

https://github.com/elliottslaughter/llvm/tree/noteroots-ir

Currently, you are mostly out of luck if you try to use mainline, unless you're okay with the llvm.gcroot intrinsics, which pin things to the stack so they can never live in registers (causing a corresponding performance loss). Probably your best bet is to do what Go will do once the patch enneff was referring to is merged: scan conservatively on the stack and precisely on the heap. You can still get false positives and leaks, and you can't do moving GC that way, but it's better than fully conservative GC.

By far, the hardest part of GC is precisely marking the stack and registers. Yet you have to be able to do it if you want to prevent leaks. Our goal is to put in the hard work so that new languages will be able to have proper GC without having to roll a custom code generator or target the JVM or CLR.

sausagefeet · on July 5, 2012

> Our goal is to put in the hard work so that new languages will be able to have proper GC without having to roll a custom code generator or target the JVM or CLR.

As someone who has ambitions of writing a fun little language this is exactly what I want out of LLVM. Can't wait for your patches to hopefully get to mainline.

thebluesky · on July 5, 2012

32bit memleak is by far the main reason I stopped looking at go. Many people still have plenty of 32 bit hosts.

When will this be in the official release??

enneff · on July 5, 2012

It'll be in Go 1.1, around the end of the year.

luriel · on July 4, 2012

In practice very few people have issues on 32bit systems, and for those that do there are workarounds.

Also, Go works much better on 64bit systems for many other reasons, if nothing else because that is what most of the core team uses. 64bit certainly does not double your memory footprint, and it increases performance dramatically (in part because the extra registers, etc., and in part because the Go 64bit compilers are much better).

And as Andrew mentioned, there is a patch already to fully solve things for the few people stuck on 32bits that have any issues.

Finally, all this is an implementation detail, and has very little to do with the qualities of the language. which is what the article is really about.

enneff · on July 4, 2012

> In practice very few people have issues on 32bit systems,

I'm not sure what the proportion of people who experience issues is (whether it's "some", "most", or "very few"), but in my anecdotal experience running several long-lived Go processes on a 32-bit VPS it hasn't been a problem.

I think it just comes down to the allocation patterns in your program. Some programs trigger the pathological behavior and others don't.

Regardless of the numbers of affected programs, I'll be glad when the issue is behind us for good.

davidw · on July 5, 2012

> 64bit certainly does not double your memory footprint

Not quite, but it does use a lot of extra memory:

http://journal.dedasys.com/2008/11/24/slicehost-vs-linode

I'm sure Ruby's memory usage characteristics are different than Go's, but still, it's worth paying attention to.

reitzensteinm · on July 4, 2012

To be fair, it's kicking the can 60 years down the street, to when we're fully populating a 64 bit address space with 16 exabytes of ram. Considering the change will be backwards compatible, 64 bit is for all intents and purposes a permanent solution.

You do have a point about 32 bit, though (I use 32 bit Redis for instance, for the same reason).

naner · on July 5, 2012

Yep, they fucked up a bit. It happens. Fix coming in Go 1.1.

jesstaa · on July 5, 2012

It's not really a fuck up, it's a "we didn't have time to do that yet because instead we were focusing on make an awesome language and standard library".

Lots of issues with the runtime had to wait because they can wait because it doesn't matter if you wait 5 years to fix the GC because you won't break any code when you do it.