Notes on concurrency bugs

bluejekyll · on Aug 6, 2016

> all of the programs studied were written in C or C++, and that this study predates C++11. Moving to C++11 and using atomics and scoped locks would probably change the numbers substantially

I keep seeing this claim, "C++ is better now!" Does anyone have any experience that really defends this claim?

One of the reasons that I've become so enthralled with Rust is that it's adopted a memory access system which aligns with everything that I've learned over my career in distributed systems:

-all data const/final by default -No nulls, fully initialized structs -semantics that require the developer to adopt good practices in accessing memory across threads

I can go on, but it's already obvious to me that Rust is a huge leap forward in terms of threading and the guarantees it makes.

I left C++ a long time ago, and it happened to coincide with a atomic increment bug/memory leak that took me 2 weeks to track down in the STL String library. This was back in 2000, early days of multiprocessor x86. This exact thing is solvable with the new C++ atomic support, but does it guarantee that you'll use it across threads like Rust?

MaulingMonkey · on Aug 7, 2016

> This exact thing is solvable with the new C++ atomic support, but does it guarantee that you'll use it across threads like Rust?

No.

That said, stuff like Clang's thread safety annotations can help: http://clang.llvm.org/docs/ThreadSafetyAnalysis.html

> I left C++ a long time ago

Lucky ;).

C++, and all the supporting tooling around it, is better than it used to be, make no mistake. But C++ is still no Rust - make no mistake on that either.

nickpsecurity · on Aug 6, 2016

I only know one form of safe C++ outside MISRA rules: Ironclad C++.

https://www.cis.upenn.edu/~stevez/papers/DENO+13.pdf

Its only drawback is it can't catch all use-after-free. So, they add a heap-precise, conservative GC for that. There's quite a few techniques in the literature for detecting use-after-free. Add one of those to eliminate the GC and... what's left at that point?

helmut_hed · on Aug 7, 2016

It is probably easier to write correct concurrent code now, thanks to the new features he mentions. However, as always plenty of rope is provided, and the old programs are still valid (and have the same semantics). I couldn't say how it compares to Rust, but "better now" seems pretty accurate.

bratfro · on Aug 6, 2016

The footnotes held quite a gem. It's common knowledge that the patent system is broken, but I cannot believe for the life of me someone was able to patent how to swing on a swing. My word. https://www.google.com/patents/US6368227

rumcajz · on Aug 6, 2016

Note that it's sideways swinging, not the normal one that's patented! :)

DHMO · on Aug 6, 2016

The rocket-propelled over-the-top one isn't patented, though.

https://www.youtube.com/watch?v=HrrorPT8jsM&feature=youtu.be...

amelius · on Aug 6, 2016

Well if the original patent was defined in terms of an abstract force, like it should have been, then the rocket-propelled one is patented too. All advice I've heard so far about filing patents is to keep things as abstract as possible.

sitkack · on Aug 6, 2016

> 70% of bugs had simple fixes

> 30% were fixed by ignoring the badly timed message and 40% were fixed by delaying or ignoring the message.

Makes me think our message passing based concurrency frameworks should do this automatically. This is made even simpler if vast portions of the application is pure and generates simpler transactions to be applied to a state store.

sitkack · on Aug 7, 2016

we need a type system over the allowable time constraints that a process will accept.

jeffreyrogers · on Aug 6, 2016

Did Dan change the styling on his site? I remember it being much easier to read before. Now there is almost no styling at all.

joeyespo · on Aug 6, 2016

I've been using My Style [1] for blogs like this. On Dan's, I added the following tweak:

    body { max-width: 768px; margin: 32px auto; font-family: Helvetica; font-size: 20px; }

It really helps my focus on long articles that are otherwise hard on the eyes. I know there's Readability and all, but this extension works without any additional steps once set up.

[1]: https://chrome.google.com/webstore/detail/my-style/ljdhjpmbn...

AstralStorm · on Aug 7, 2016

Pixels, really now. Use em and pt units for future compatibility with high PPI screens (and blame windows for 96 PPI standard)

100k · on Aug 6, 2016

Yeah, I think he did. Ironically, it reads great on mobile -- the text is full width (which on my phone is a great line length) and there's no design getting in the way.

jeffreyrogers · on Aug 6, 2016

Yeah, I think it would look fine with a max-width style added. I get the allure of lightweight design but it doesn't take much effort to make it easier on the eyes. I know I can override the default styles, but it isn't hard to get right in the first place.

nkurz · on Aug 6, 2016

Perhaps an editor could mark this (2016) despite this being the opposite of the usual custom? I've learned not to get excited when someone posts one a Dan Luu article, since it's usually something old that I've already seen. But despite the lack of a date at the top, and despite starting with references to 2010 and 2008 papers, this one is actually new!

On further thought, maybe it would better to change the title to claim it's from (2010), wait for enough people to complain, then "something", then use that momentum to convince Dan to finally put dates on his articles. Just need to figure out what that "something" should be...

--

I looked into Thread Sanitizer (libtsan) recently, and was happy to see that it's supported on recent GCC as well. Documentation is a little strange, as it's split between a Google Wiki on Github and Clang, while the source is in LLVM:

https://github.com/google/sanitizers/wiki/ThreadSanitizerCpp...

http://clang.llvm.org/docs/ThreadSanitizer.html

https://llvm.org/svn/llvm-project/llvm/trunk/lib/Transforms/...

I was spooked by this FAQ on the Google Wiki page, though:

  Q: My code with C++ exceptions does not work with tsan. 
  A: Tsan does not support C++ exceptions.

Does this mean that it does not work at all on code that is written with exceptions, or that it might have false-positives or false-negatives when exceptions actually happen at runtime?

--

For other tools, Intel has offers their "Parallel Inspector": https://software.intel.com/en-us/intel-inspector-xe. I haven't tried it, but it sounds like it would be useful for these issues: https://software.intel.com/en-us/get-started-with-inspector. Does anyone know how it compares with TSan?

--

  An example of an atomicity violation is this bug from MySQL:
  Thread 1:
    if (thd->proc_info)
      fputs(thd->proc_info, ...)
  Thread 2:
    thd->proc_info = NULL;

While definitely a concurrency bug, I'm surprised that this would happen frequently enough to create numerous bug reports unless there is also an undesired compiler optimization that's removing the "guard" in Thread 1. That is, the window of opportunity seems very small if the code is being executed as written. I didn't look at the details of the linked bug reports, but I suspect the compiler is able to reason based on something earlier that thd->proc_info must be non-null at this point, and thus has omitted the check.

If this is the case, it's possible that "Stack" would have caught this bug as well, or at least highlighted it as a place where the generated code was different than the programmer's intent. Stack is painful to install, and seems abandoned, but does catch flag some bugs that other tools miss: https://github.com/xiw/stack/

--

Does anyone know of other tools in this space? I'm still hoping there's a "silver bullet" I haven't found yet.

vardump · on Aug 6, 2016

> While definitely a concurrency bug, I'm surprised that this would happen frequently enough to create numerous bug reports unless there is also an undesired compiler optimization that's removing the "guard" in Thread 1.

Sometimes there's some external factor that causes two threads executing on two different CPU cores to be in lockstep. I've seen a case that had over 50% chance of happening, even though the chances should have been less than 1 per million.

For example logging can cause unintended synchronization. Also kernel device drivers can cause surprising synchronization. And probably a lot of other non-obvious things.

sitkack · on Aug 6, 2016

The only effective tool I have found is Rust, a correctness checker for race conditions.

nickpsecurity · on Aug 6, 2016

The first, practical one was Concurrent Pascal (1975) used in a number of OS's:

http://brinch-hansen.net/papers/

Later, Eiffel's SCOOP model in 90's was immune to races for a long time with researchers doing mods for better speed, deadlock detection, livelock detection, etc. It was ported to Java at one point. The research page in the link below shows they're probably still the top players in this given steady stream of results.

https://en.wikipedia.org/wiki/SCOOP_(software)

Works in combination with Eiffel's Design-by-Contract which can knock out semantic errors he mentions:

https://www.eiffel.com/values/design-by-contract/introductio...

Ada's Ravenscar also did safe concurrency. Ada 2012 and SPARK have Design-by-Contract with SPARK also proving absence of common errors in code automatically. Cyclone was a C variant that used region-based memory management and analysis to show absence of dangling pointers, etc. Rust improved on that with a better language, dynamic safety, and race-free concurrency.

So, there's been stuff resistant to concurrency problems for quite a while among people using safer languages. Rust is just the latest and most open.

sitkack · on Aug 6, 2016

Much of Cyclone was an inspiration for Rust. Digging through the Hansen papers. Thank you.

FreeFull · on Aug 6, 2016

Note that Rust protects you from data races, but you can still run into problems like deadlocks.

Ono-Sendai · on Aug 6, 2016

I've used thread sanitiser on a code-base with exceptions (a while ago), It worked.

spudlyo · on Aug 6, 2016

Why are these academic types analyzing bugs in MySQL!? Is there a reason they didn't choose PostgreSQL, or is it just because these studies were all done in 2009 and 2010? Surely if the analysis would have been done in 2016 they would have picked a database with higher quality concurrency bugs to study.

v4tk · on Aug 6, 2016

PostgreSQL does not have bugs - there is even no bugtracker https://lwn.net/Articles/660468/ . It is hard to analyze bugs if there is no central place for them.

jacquesm · on Aug 6, 2016

> PostgreSQL does not have bugs

What a nonsense. Postgres has over 14K issues listed against it, they just don't use a bug tracker, they use a mailing list instead. It's a holdover from the old days.

https://www.postgresql.org/list/pgsql-bugs/2016-07/

nl · on Aug 6, 2016

You mean like the bit where he says "That’s interesting, but it’s hard to tell from that if the results generalize to projects that aren’t databases, or even projects that aren’t MySQL." and then points to a study which finds the same kind of problems across a number of different projects?