Performance++

Saturday, January 14, 2012

Under the Hood: Java volatile

Java keyword volatile is one that often confuses students and junior developers, making them think it allows to deal with concurrency problems, namely a race condition. Partially this is due to some sources that thoughtlessly explain volatile as a shortcut for wrapping all usages with synchronized (this) block, which is misleading to say the least. In reality the volatile modifier suits only for specific range of tasks and it's important to understand what it does and what it doesn't. Let's have a look at a simple class.

public class TestVolatile implements Runnable {
  private int value = 0;

  public static void main(String[] args) throws InterruptedException {
    TestVolatile test = new TestVolatile();
    for (int i = 0; i < 2; ++i) {
      new Thread(test).start();
    }
    Thread.sleep(200);  // I'd like to keep the code very simple here
    System.out.println(test.value);
  }

  @Override
  public void run() {
    for (int i = 0; i < 100000; i++) {
      value++;
    }
  }
}

The instance test is accessed and modified by two threads and final result is printed to output stream. Thread.sleep(200) of course doesn't ensure both threads are done, but I don't want to focus on barrier issues in this post. Just believe me, 200ms is more than enough for 100000 iterations on any modern hardware. The point is there is a clear race condition on value field: several threads are accessing and modifying it without due synchronization. As a result any number between 100000 and 200000 can be printed out.

The underlying cause for the race is that ++ operator is not atomic (by the way C/C++ increment is not atomic either by default; there is x86 instruction lock xadd, but none of compilers I tried produced it, unless it is explicitly specified with __asm__ block). But let's not digress, here's the bytecode of the run() method:

public run()V
   L0
    LINENUMBER 15 L0
    ICONST_0
    ISTORE 1
   L1
   FRAME APPEND [I]
    ILOAD 1
    LDC 100000
    IF_ICMPGE L2
   L3
    LINENUMBER 16 L3
    ALOAD 0
    DUP
    GETFIELD TestVolatile.value : I
    ICONST_1
    IADD
    PUTFIELD TestVolatile.value : I
   L4
    LINENUMBER 15 L4
    IINC 1 1
    GOTO L1
   L2
    LINENUMBER 18 L2
   FRAME CHOP 1
    RETURN
   L5
    LOCALVARIABLE i I L1 L2 1
    LOCALVARIABLE this LTestVolatile; L0 L5 0
    MAXSTACK = 3
    MAXLOCALS = 2

The interesting part is in the L3 section (line 16 is value++;), which effectively pushes this instance onto the stack twice (see ALOAD 0 and DUP), loads the value field of this and pushes it too, then pushes the constant 1, adds these two and puts the sum back to this.value. So there are two bytecode instructions between GETFIELD and PUTFIELD and if the thread is interrupted anywhere between them, put operation might overwrite the correct value in memory. I won't go into details, but JIT compiled code isn't safer.

Well, now the issue is clear, how can we fix it? What naive programmers are tempted to do is make the value volatile and expect the code to be thread-safe.

public class TestVolatile implements Runnable {
  private volatile int value = 0;
  // ...
}

Run this code yourself and see that it's not. More to that: the run() bytecode didn't change, not a single instruction! Whatever the JVM is doing to a volatile field, it's still incremented in 4 instructions and there aren't any locks around it. There's no way it can be atomic!

It's time to read some spec (Java Language Specification, aka JLS) and find out what volatile actually does: A field may be declared volatile, in which case the Java memory model (§17) ensures that all threads see a consistent value for the variable. Threads see a consistent value? What the heck this means?

Well, it turns out that our first example is more buggy than it appeared: GETFIELD instruction didn't guarantee to return the very last value of the field, because it could take it from L1 or L2 cache which are local to a CPU core. Imagine another thread running on a separate core. What it is likely to see is the value from its own caches. The issue is in fact deeper, because within each thread instructions can be reordered, either by a compiler or at runtime by the CPU, and volatile in addition provides sequential consistency, but it's a different topic. What matters here is that the initial code had two concurrency problems, not one. Yes, volatile modifier solved one of them: it forced all reads and writes to go straight to main memory, so both threads began to see the true most recent value (you may think about it this way, though strictly speaking JVM doesn't guarantee writes to main memory). Another name for this guarantee is visibility of changes, or happens-before in JVM.

But we hardly noticed the improvement, because the issue we tried to achieve is atomicity. And this is what volatile doesn't guarantee, as we saw from reading the bytecode. Take a moment to appreciate the difference between visibility of changes and atomicity.

Synchronized block

Let's try something else: get rid of volatile and wrap the increment with synchronized. What do you think will happen?

public class TestVolatile implements Runnable {
  private int value = 0;

  public static void main(String[] args) throws InterruptedException {
    TestVolatile test = new TestVolatile();
    for (int i = 0; i < 2; ++i) {
      new Thread(test).start();
    }
    Thread.sleep(200);  // I'd like to keep the code very simple here
    System.out.println(test.value);
  }

  @Override
  public void run() {
    for (int i = 0; i < 100000; i++) {
      synchronized (this) {
        value++;
      }
    }
  }
}

The program constantly prints 200000, i.e. race condition is resolved. Below you can examine the bytecode of the new run() method and notice MONITORENTER and MONITOREXIT instructions around same GETFIELD-PUTFIELD stuff. Here's where atomicity comes from, what about visibility of changes?

public run()V
    TRYCATCHBLOCK L0 L1 L2 null
    TRYCATCHBLOCK L2 L3 L2 null
   L4
    LINENUMBER 15 L4
    ICONST_0
    ISTORE 1
   L5
   FRAME APPEND [I]
    ILOAD 1
    LDC 100000
    IF_ICMPGE L6
   L7
    LINENUMBER 16 L7
    ALOAD 0
    DUP
    ASTORE 2
    MONITORENTER
   L0
    LINENUMBER 17 L0
    ALOAD 0
    DUP
    GETFIELD TestVolatile.value : I
    ICONST_1
    IADD
    PUTFIELD TestVolatile.value : I
   L8
    LINENUMBER 18 L8
    ALOAD 2
    MONITOREXIT
   L1
    GOTO L9
   L2
   FRAME FULL [TestVolatile I java/lang/Object] [java/lang/Throwable]
    ASTORE 3
    ALOAD 2
    MONITOREXIT
   L3
    ALOAD 3
    ATHROW
   L9
    LINENUMBER 15 L9
   FRAME CHOP 1
    IINC 1 1
    GOTO L5
   L6
    LINENUMBER 20 L6
   FRAME CHOP 1
    RETURN
   L10
    LOCALVARIABLE i I L5 L6 1
    LOCALVARIABLE this LTestVolatile; L4 L10 0
    MAXSTACK = 3
    MAXLOCALS = 4

JLS again provides a rather sophisticated description in section 17.4.5: If an action x synchronizes-with a following action y, then we also have hb(x, y). But in simple terms it says that visibility of changes is guaranteed between two synchronized sections, which is exactly what we wanted.

It should be noted that synchronized is not the only solution. Another way would be to use AtomicInteger as follows.

import java.util.concurrent.atomic.AtomicInteger;

public class TestVolatile implements Runnable {
  private final AtomicInteger value = new AtomicInteger();

  public static void main(String[] args) throws InterruptedException {
    TestVolatile test = new TestVolatile();
    for (int i = 0; i < 2; ++i) {
      new Thread(test).start();
    }
    Thread.sleep(200);  // I'd like to keep the code very simple here
    System.out.println(test.value);
  }

  @Override
  public void run() {
    for (int i = 0; i < 100000; i++) {
      value.incrementAndGet();
    }
  }
}

And the bytecode. Note INVOKEVIRTUAL call.

public run()V
   L0
    LINENUMBER 17 L0
    ICONST_0
    ISTORE 1
   L1
   FRAME APPEND [I]
    ILOAD 1
    LDC 100000
    IF_ICMPGE L2
   L3
    LINENUMBER 18 L3
    ALOAD 0
    GETFIELD TestVolatile.value : Ljava/util/concurrent/atomic/AtomicInteger;
    INVOKEVIRTUAL java/util/concurrent/atomic/AtomicInteger.incrementAndGet ()I
    POP
   L4
    LINENUMBER 17 L4
    IINC 1 1
    GOTO L1
   L2
    LINENUMBER 20 L2
   FRAME CHOP 1
    RETURN
   L5
    LOCALVARIABLE i I L1 L2 1
    LOCALVARIABLE this LTestVolatile; L0 L5 0
    MAXSTACK = 2
    MAXLOCALS = 2

Benchmark

For the benchmark I used the following template class, whose run method performs both lots of reads and writes. The last increment by (temp > 0 ? 0 : 0) is there only to make sure temp is not optimized away by JIT.

public class Benchmark implements Runnable {
  private static final int LIMIT = 10000000;
  private int value = 0;

  public static void main(String[] args) throws InterruptedException {
    Benchmark benchmark = new Benchmark();

    Thread thread1 = new Thread(benchmark);
    Thread thread2 = new Thread(benchmark);
    Thread thread3 = new Thread(benchmark);

    long start = System.currentTimeMillis();
    thread1.start();
    thread2.start();
    thread3.start();

    thread1.join();
    thread2.join();
    thread3.join();
    long finish = System.currentTimeMillis();

    System.out.println("Time: " + (finish - start) + "ms");
    System.out.println("Value: " + benchmark.value);
  }

  @Override
  public void run() {
    for (int i = 0; i < LIMIT; i++) {
      value++;
    }

    int temp = 0;
    for (int i = 0; i < LIMIT; i++) {
      temp += i * value + 1;
    }

    value += (temp > 0 ? 0 : 0);
  }
}

I made all four modifications discussed above, two thread-safe and two unsafe and measured the time on my machine (java 1.6.0_45, Intel core i7-2600 CPU with 8 cores). Absolute numbers are not very important, but the relative difference is more or less clear.

int: 7-10ms
volatile int: 480-680ms
synchronized (this): 650-700ms
AtomicInteger: 1200-1400ms

I'd like to stress that the idea is to measure the overhead itself in a situation when there is a lot of parallelism. Performance can vary drastically in a situation when there are hundreds of threads.

Update for Java 1.8.0_45 on the same machine. Quite remarkable change:

int: 20ms
volatile int: 500-650ms
synchronized (this): 2700-2800ms
AtomicInteger: 660-690ms

Thursday, December 29, 2011

C tricks: compile time assert in pure C

In my last post about compile time assert, I concluded that with new C++11 static_assert the issue is finally resolved and everyone is encouraged to use it now. Sadly that was too quick to close the discussion, because one case is still actual, namely compile time assert in pure C, where static_assert is not available.

If you recall the C++ solution, it used template specialization, which forced the compiler to evaluate boolean expression at compile time. There are no templates in C, so it won't work as well. But there are typedefs and array size still must be computed at compile time and it can't be negative, so we can start for that.

The following implementation is credited to RBerteig and his answer on StackOverflow. The code I came up with is similar, but ignores the line number. So I'll just copy his version and comment it a little bit.

/** A compile time assertion check.
 *
 *  Validate at compile time that the predicate is true without
 *  generating code. This can be used at any point in a source file
 *  where typedef is legal.
 *
 *  On success, compilation proceeds normally.
 *
 *  On failure, attempts to typedef an array type of negative size. The
 *  offending line will look like
 *      typedef assertion_failed_file_h_42[-1]
 *  where file is the content of the second parameter which should
 *  typically be related in some obvious way to the containing file
 *  name, 42 is the line number in the file on which the assertion
 *  appears, and -1 is the result of a calculation based on the
 *  predicate failing.
 *
 *  \param predicate The predicate to test. It must evaluate to
 *  something that can be coerced to a normal C boolean.
 *
 *  \param file A sequence of legal identifier characters that should
 *  uniquely identify the source file in which this condition appears.
 */
#define CASSERT(predicate, file) _impl_CASSERT_LINE(predicate,__LINE__,file)

#define _impl_PASTE(a,b) a##b
#define _impl_CASSERT_LINE(predicate, line, file) \
    typedef char _impl_PASTE(assertion_failed_##file##_,line)[2*!!(predicate)-1];

CASSERT macro takes two parameters: the predicate to be tested and the file name identifier. The latter is optional and aims to clear the error message, the client can pass any error message there. _impl_PASTE is a just helper macro to concatenate the identifier name from file and line number.

Here's how the CASSERT is used in a client code (file demo.c):

struct foo {
    ...  /* 76 bytes of members */
};
CASSERT(sizeof(struct foo) == 76, demo_c);

And that's it: the problem is solved for both C++ and C. C++11 assertion is obviously prettier, but the code above provides a nice fallback when one has to code pure C.

Sunday, December 25, 2011

C++ tricks: compile time assert

Intro. I've been long thinking to start my own blog, at least a couple of years now. A common objection everyone always has is "OK, I feel enthusiastic right now. But can I keep up in the long-term?" The bad news is I still don't know the answer, though I'll do my best to allocate at least some time on it. But luckily this question does not stop me anymore. "If you feel like you'd love to write, right now, don't think, just write!" That's what I said to myself. And here it is: my first blog post! The topic I've chosen to discuss for the very first post is rather simple, yet nevertheless interesting: compile time assertion in C++.

Motivation: evaluate a static expression and make sure certain contract is always met.

static int PACKAGE_SIZE = 1024;

void send_bytes(char* data) {
  // Assert that PACKAGE_SIZE is less than 4Kb so be TCP/IP friendly?
}

/* Another example */

struct Data {
  char buffer[64];
};

void write_data(const Data& data) {
  // Assert that sizeof(data) is exactly 64 bytes?
}

First of all, one can add a runtime check of course. Second, it's better to document the restrictions anyway to avoid accidental bugs. But: explicit assert is better than implicit and the sooner the problem is found, the better. If we have a static expression, can't we use some magic to assert that at compilation? So that if for some reason the contract is not met, the compilation would fail and report exactly what happened. Yes, we can.

Solution: among plenty of features, C++11 finally comes along with a static assert and allows to use static_assert function like this:

#include <type_traits>

void send_bytes(char* data) {
    static_assert(PACKAGE_SIZE < 4096, "Package size is too big");
}

void write_data(const Data& data) {
    static_assert(sizeof(data) == 64, "Data must be 64 bytes");
}

And these solutions work just fine, supported by all major compilers. But I'd like to draw attention to the alternatives some of which I used long before C++0x and it's nice to know they exist.

Boost solution: BOOST_STATIC_ASSERT from Boost library solves the problem, though it has a limitation: one can't pass a human readable message to the compiler. And everyone knows compiler errors can be very cryptic.

// Boost example.
namespace my_conditions {
   BOOST_STATIC_ASSERT(std::numeric_limits<int>::digits >= 32);
   BOOST_STATIC_ASSERT(WCHAR_MIN >= 0);
}

When the assertion is false, the programmer would see "Illegal use of STATIC_ASSERTION_FAILURE <false>" error. OK, seems good enough, but what if one does not want a dependency on Boost?

Windows solution: Use C_ASSERT macro from Windows.h. A major drawback is quite obvious, this solution is platform dependent. But the implementation is so simple, let's try to port it and in addition make a couple of improvements.

Google solution. My favourite one. I call this a Google solution, because I learned it when I worked for Google back in 2006, though I'm pretty sure it was used even before that. Here's the code:

// Implementation details of COMPILE_ASSERT:
//
// - COMPILE_ASSERT works by defining an array type that has -1
//   elements (and thus is invalid) when the expression is false.
//
// - The simpler definition
//
//     #define COMPILE_ASSERT(expr, msg) typedef char msg[(expr) ? 1 : -1]
//
//   does not work, as gcc supports variable-length arrays whose sizes
//   are determined at run-time (this is gcc's extension and not part
//   of the C++ standard).  As a result, gcc fails to reject the
//   following code with the simple definition:
//
//     int foo;
//     COMPILE_ASSERT(foo, msg); // not supposed to compile as foo is
//                               // not a compile-time constant.
//
// - By using the type CompileAssert<(bool(expr))>, we ensure that
//   expr is a compile-time constant.  (Template arguments must be
//   determined at compile-time.)
//
// - The outer parentheses in CompileAssert<(bool(expr))> are necessary
//   to work around a bug in gcc 3.4.4 and 4.0.1.  If we had written
//
//     CompileAssert<bool(expr)>
//
//   instead, these compilers will refuse to compile
//
//     COMPILE_ASSERT(5 > 0, some_message);
//
//   (They seem to think the ">" in "5 > 0" marks the end of the
//   template argument list.)
//
// - The array size is (bool(expr) ? 1 : -1), instead of simply
//
//     ((expr) ? 1 : -1).
//
//   This is to avoid running into a bug in MS VC 7.1, which
//   causes ((0.0) ? 1 : -1) to incorrectly evaluate to 1.
template <bool>
struct CompileAssert {
};

#define COMPILE_ASSERT(expr, msg) \
typedef CompileAssert<(bool(expr))> msg[bool(expr) ? 1 : -1]

// and the use of it as follows

void write_data(const Data& data) {
  COMPILE_ASSERT(sizeof(data) == 64, data_should_be_exactly_64_bytes);
}

The compiler now triggers a "error: size of array ‘data_should_be_exactly_64_bytes’ is negative". It's still not ideal, but it points to the exact line of the assert, and provides at least some reason. Also note that under the hood the assert is just a typedef, so it's not compiled into assembly instructions and doesn't consume memory.

Real-life usages. Let's repeat that dedicated compiler support (static_assert is not even a macro, but a special directive) is always better and everyone is encouraged to use it now. Here are a few examples when it might be handy:

template <class T>
void swap(T& a, T& b) {
    static_assert(std::is_copy_constructible<t>::value, "Swap requires copying");
    static_assert(std::is_nothrow_move_constructible<t>::value &&
                  std::is_nothrow_move_assignable<t>::value, "Swap may throw");
    auto c = b;
    b = a;
    a = c;
}

template <class T>
struct data_structure {
    static_assert(std::is_default_constructible<t>::value,
                  "Data Structure requires default-constructible elements");
};