Sstrhsrtj

Question

The following Java program takes on average between 0.50s and 0.55s to run:

public static void main(String args) {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * (i * i);

    }

    System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");

    System.out.println("n = " + n);

}

If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65s to run. How come?

Edit: I ran each version of the program 15 times, alternating between the two. Here are the results:

2 * (i * i): 0.5183738, 0.5298337, 0.5308647, 0.5133458, 0.5003011, 0.5366181, 0.515149, 0.5237389, 0.5249942, 0.5641624, 0.538412, 0.5466744, 0.531159, 0.5048032, 0.5232789

2 * i * i: 0.6246434, 0.6049722, 0.6603363, 0.6243328, 0.6541802, 0.6312638, 0.6241105, 0.627815, 0.6114252, 0.6781033, 0.6393969, 0.6608845, 0.6201077, 0.6511559, 0.6544526

The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.

How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment? — yesterday
I ran each version around 10 times, and the 2 * i * i version always takes slightly longer. — yesterday
@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs. — yesterday

score 104 · Accepted Answer · 2018-11-24 13:48:02Z

There is a slight difference in the ordering of the bytecode

2 * (i * i):

     iconst_2

     iload0

     iload0

     imul

     imul

     iadd

vs 2 * i * i:

     iconst_2

     iload0

     imul

     iload0

     imul

     iadd

At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.

So we need to dig deeper into the lower level (JIT).

Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling:

030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006

030     addl    R11, RBP    # int

033     movl    RBP, R13    # spill

036     addl    RBP, #14    # int

039     imull   RBP, RBP    # int

03c     movl    R9, R13 # spill

03f     addl    R9, #13 # int

043     imull   R9, R9  # int

047     sall    RBP, #1

049     sall    R9, #1

04c     movl    R8, R13 # spill

04f     addl    R8, #15 # int

053     movl    R10, R8 # spill

056     movdl   XMM1, R8    # spill

05b     imull   R10, R8 # int

05f     movl    R8, R13 # spill

062     addl    R8, #12 # int

066     imull   R8, R8  # int

06a     sall    R10, #1

06d     movl    [rsp + #32], R10    # spill

072     sall    R8, #1

075     movl    RBX, R13    # spill

078     addl    RBX, #11    # int

07b     imull   RBX, RBX    # int

07e     movl    RCX, R13    # spill

081     addl    RCX, #10    # int

084     imull   RCX, RCX    # int

087     sall    RBX, #1

089     sall    RCX, #1

08b     movl    RDX, R13    # spill

08e     addl    RDX, #8 # int

091     imull   RDX, RDX    # int

094     movl    RDI, R13    # spill

097     addl    RDI, #7 # int

09a     imull   RDI, RDI    # int

09d     sall    RDX, #1

09f     sall    RDI, #1

0a1     movl    RAX, R13    # spill

0a4     addl    RAX, #6 # int

0a7     imull   RAX, RAX    # int

0aa     movl    RSI, R13    # spill

0ad     addl    RSI, #4 # int

0b0     imull   RSI, RSI    # int

0b3     sall    RAX, #1

0b5     sall    RSI, #1

0b7     movl    R10, R13    # spill

0ba     addl    R10, #2 # int

0be     imull   R10, R10    # int

0c2     movl    R14, R13    # spill

0c5     incl    R14 # int

0c8     imull   R14, R14    # int

0cc     sall    R10, #1

0cf     sall    R14, #1

0d2     addl    R14, R11    # int

0d5     addl    R14, R10    # int

0d8     movl    R10, R13    # spill

0db     addl    R10, #3 # int

0df     imull   R10, R10    # int

0e3     movl    R11, R13    # spill

0e6     addl    R11, #5 # int

0ea     imull   R11, R11    # int

0ee     sall    R10, #1

0f1     addl    R10, R14    # int

0f4     addl    R10, RSI    # int

0f7     sall    R11, #1

0fa     addl    R11, R10    # int

0fd     addl    R11, RAX    # int

100     addl    R11, RDI    # int

103     addl    R11, RDX    # int

106     movl    R10, R13    # spill

109     addl    R10, #9 # int

10d     imull   R10, R10    # int

111     sall    R10, #1

114     addl    R10, R11    # int

117     addl    R10, RCX    # int

11a     addl    R10, RBX    # int

11d     addl    R10, R8 # int

120     addl    R9, R10 # int

123     addl    RBP, R9 # int

126     addl    RBP, [RSP + #32 (32-bit)]   # int

12a     addl    R13, #16    # int

12e     movl    R11, R13    # spill

131     imull   R11, R13    # int

135     sall    R11, #1

138     cmpl    R13, #999999985

13f     jl     B2   # loop end  P=1.000000 C=6554623.000000

We observe already that there is 1 register that is "spilled" onto the stack.

And for the 2 * i * i version:

05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006

05a     addl    RBX, R11    # int

05d     movl    [rsp + #32], RBX    # spill

061     movl    R11, R8 # spill

064     addl    R11, #15    # int

068     movl    [rsp + #36], R11    # spill

06d     movl    R11, R8 # spill

070     addl    R11, #14    # int

074     movl    R10, R9 # spill

077     addl    R10, #16    # int

07b     movdl   XMM2, R10   # spill

080     movl    RCX, R9 # spill

083     addl    RCX, #14    # int

086     movdl   XMM1, RCX   # spill

08a     movl    R10, R9 # spill

08d     addl    R10, #12    # int

091     movdl   XMM4, R10   # spill

096     movl    RCX, R9 # spill

099     addl    RCX, #10    # int

09c     movdl   XMM6, RCX   # spill

0a0     movl    RBX, R9 # spill

0a3     addl    RBX, #8 # int

0a6     movl    RCX, R9 # spill

0a9     addl    RCX, #6 # int

0ac     movl    RDX, R9 # spill

0af     addl    RDX, #4 # int

0b2     addl    R9, #2  # int

0b6     movl    R10, R14    # spill

0b9     addl    R10, #22    # int

0bd     movdl   XMM3, R10   # spill

0c2     movl    RDI, R14    # spill

0c5     addl    RDI, #20    # int

0c8     movl    RAX, R14    # spill

0cb     addl    RAX, #32    # int

0ce     movl    RSI, R14    # spill

0d1     addl    RSI, #18    # int

0d4     movl    R13, R14    # spill

0d7     addl    R13, #24    # int

0db     movl    R10, R14    # spill

0de     addl    R10, #26    # int

0e2     movl    [rsp + #40], R10    # spill

0e7     movl    RBP, R14    # spill

0ea     addl    RBP, #28    # int

0ed     imull   RBP, R11    # int

0f1     addl    R14, #30    # int

0f5     imull   R14, [RSP + #36 (32-bit)]   # int

0fb     movl    R10, R8 # spill

0fe     addl    R10, #11    # int

102     movdl   R11, XMM3   # spill

107     imull   R11, R10    # int

10b     movl    [rsp + #44], R11    # spill

110     movl    R10, R8 # spill

113     addl    R10, #10    # int

117     imull   RDI, R10    # int

11b     movl    R11, R8 # spill

11e     addl    R11, #8 # int

122     movdl   R10, XMM2   # spill

127     imull   R10, R11    # int

12b     movl    [rsp + #48], R10    # spill

130     movl    R10, R8 # spill

133     addl    R10, #7 # int

137     movdl   R11, XMM1   # spill

13c     imull   R11, R10    # int

140     movl    [rsp + #52], R11    # spill

145     movl    R11, R8 # spill

148     addl    R11, #6 # int

14c     movdl   R10, XMM4   # spill

151     imull   R10, R11    # int

155     movl    [rsp + #56], R10    # spill

15a     movl    R10, R8 # spill

15d     addl    R10, #5 # int

161     movdl   R11, XMM6   # spill

166     imull   R11, R10    # int

16a     movl    [rsp + #60], R11    # spill

16f     movl    R11, R8 # spill

172     addl    R11, #4 # int

176     imull   RBX, R11    # int

17a     movl    R11, R8 # spill

17d     addl    R11, #3 # int

181     imull   RCX, R11    # int

185     movl    R10, R8 # spill

188     addl    R10, #2 # int

18c     imull   RDX, R10    # int

190     movl    R11, R8 # spill

193     incl    R11 # int

196     imull   R9, R11 # int

19a     addl    R9, [RSP + #32 (32-bit)]    # int

19f     addl    R9, RDX # int

1a2     addl    R9, RCX # int

1a5     addl    R9, RBX # int

1a8     addl    R9, [RSP + #60 (32-bit)]    # int

1ad     addl    R9, [RSP + #56 (32-bit)]    # int

1b2     addl    R9, [RSP + #52 (32-bit)]    # int

1b7     addl    R9, [RSP + #48 (32-bit)]    # int

1bc     movl    R10, R8 # spill

1bf     addl    R10, #9 # int

1c3     imull   R10, RSI    # int

1c7     addl    R10, R9 # int

1ca     addl    R10, RDI    # int

1cd     addl    R10, [RSP + #44 (32-bit)]   # int

1d2     movl    R11, R8 # spill

1d5     addl    R11, #12    # int

1d9     imull   R13, R11    # int

1dd     addl    R13, R10    # int

1e0     movl    R10, R8 # spill

1e3     addl    R10, #13    # int

1e7     imull   R10, [RSP + #40 (32-bit)]   # int

1ed     addl    R10, R13    # int

1f0     addl    RBP, R10    # int

1f3     addl    R14, RBP    # int

1f6     movl    R10, R8 # spill

1f9     addl    R10, #16    # int

1fd     cmpl    R10, #999999985

204     jl     B2   # loop end  P=1.000000 C=7419903.000000

Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved. But of course it is obvious that neither the first nor the second version is any good; the loop could literally be coded in 6 instructions.

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot.

In fact, modern x86 CPU's break down these instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers the loop unrolling idea is now most probably counter-productive. According to Agner Fog's optimization guide:

The gain in performance due to the µop cache can be quite
considerable if the average instruction length is more than 4 bytes.
The following methods of optimizing the use of the µop cache may
be considered:

Make sure that critical loops are small enough to fit into the µop cache.

Align the most critical loop entries and function entries by 32.

Avoid unnecessary loop unrolling.

Avoid instructions that have extra load time

. . .

Regarding those load times - even the fastest L1D hit costs 4 cycles, so yes, even a few accesses to memory will hurt performance in tight loops.

To see how fast it can be, we can compile a similar C application with GCC, which produces simply:

  xor edx, edx

  xor eax, eax

.L2:

  mov ecx, edx

  imul ecx, edx

  add edx, 1

  lea eax, [rax+rcx*2]

  cmp edx, 1000000000

  jne .L2

With a run time of... 0.08 s, or 5 times faster.

Runemoro 1,63611237 · Answer 2 · 2018-11-23 21:44:55Z

When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:

int n = 0;

for (int i = 0; i < 1000000000; i++) {

    n += i * i;

}

n *= 2;

but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.

Here are a few reasons why I think this is the case:

Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same

The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version

Here is the test code that I used to draw these conclusions:

public static void main(String args) {

    long fastVersion = 0;

    long slowVersion = 0;

    long optimizedVersion = 0;

    long modifiedFastVersion = 0;

    long modifiedSlowVersion = 0;



    for (int i = 0; i < 10; i++) {

        fastVersion += fastVersion();

        slowVersion += slowVersion();

        optimizedVersion += optimizedVersion();

        modifiedFastVersion += modifiedFastVersion();

        modifiedSlowVersion += modifiedSlowVersion();

    }



    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");

    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");

    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");

    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");

    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");

}



private static long fastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long slowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}



private static long optimizedVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += i * i;

    }

    n *= 2;

    return System.nanoTime() - startTime;

}



private static long modifiedFastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long modifiedSlowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}

And here are the results:

Fast version: 5.7274411 s

Slow version: 7.6190804 s

Optimized version: 5.1348007 s

Modified fast version: 7.1492705 s

Modified slow version: 7.2952668 s

here is a benchmark: github.com/jawb-software/stackoverflow-53452713 — yesterday
I think on the optimizedVersion, it should be n *= 2000000000; — yesterday
@StefansArya Incorrect.. 2(1) + 2(4) + 2(9) + 2(16) + ... = 2(1 + 4 + 9 + 16 + ...) — 16 hours ago

score 9 · Answer 3 · 2018-11-23 21:30:11Z

ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html

ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer

On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:

public static void main(String args) {

    int repeat = 10;

    long A = 0;

    long B = 0;

    for (int i = 0; i < repeat; i++) {

        A += test();

        B += testB();

    }



    System.out.println(A / repeat + " ms");

    System.out.println(B / repeat + " ms");

}





private static long test() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multi(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multi(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms A " + n);

    return ms;

}





private static long testB() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multiB(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multiB(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms B " + n);

    return ms;

}



private static int multiB(int i) {

    return 2 * (i * i);

}



private static int multi(int i) {

    return 2 * i * i;

}

Output:

...

405 ms A 785527736

327 ms B 785527736

404 ms A 785527736

329 ms B 785527736

404 ms A 785527736

328 ms B 785527736

404 ms A 785527736

328 ms B 785527736

410 ms

333 ms

So why?
The Byte code is this:

 private static multiB(int arg0) { // 2 * (i * i)

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         iload0

         imul

         imul

         ireturn

     }

     L2 {

     }

 }



 private static multi(int arg0) { // 2 * i * i

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         imul

         iload0

         imul

         ireturn

     }

     L2 {

     }

 }

The difference being:

With brackets (2 * (i * i)):

push const stack

push local on stack

push local on stack

multiply top of stack

multiply top of stack

Without brackets (2 * i * i):

push const stack

push local on stack

multiply top of stack

push local on stack

multiply top of stack

Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.

NoDataFound 5,3541739 · Answer 4 · 2018-11-23 22:10:50Z

I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .

@State(Scope.Benchmark)

@Warmup(iterations = 2)

@Fork(1)

@Measurement(iterations = 10)

@OutputTimeUnit(TimeUnit.NANOSECONDS)

//@BenchmarkMode({ Mode.All })

@BenchmarkMode(Mode.AverageTime)

public class MyBenchmark {

  @Param({ "100", "1000", "1000000000" })

  private int size;



  @Benchmark

  public int two_square_i() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * (i * i);

    }

    return n;

  }



  @Benchmark

  public int square_i_two() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += i * i;

    }

    return 2*n;

  }  



  @Benchmark

  public int two_i_() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * i * i;

    }

    return n;

  }

}

The result are here:

Benchmark                           (size)  Mode  Samples          Score   Score error  Units

o.s.MyBenchmark.square_i_two           100  avgt       10         58,062         1,410  ns/op

o.s.MyBenchmark.square_i_two          1000  avgt       10        547,393        12,851  ns/op

o.s.MyBenchmark.square_i_two    1000000000  avgt       10  540343681,267  16795210,324  ns/op

o.s.MyBenchmark.two_i_                 100  avgt       10         87,491         2,004  ns/op

o.s.MyBenchmark.two_i_                1000  avgt       10       1015,388        30,313  ns/op

o.s.MyBenchmark.two_i_          1000000000  avgt       10  967100076,600  24929570,556  ns/op

o.s.MyBenchmark.two_square_i           100  avgt       10         70,715         2,107  ns/op

o.s.MyBenchmark.two_square_i          1000  avgt       10        686,977        24,613  ns/op

o.s.MyBenchmark.two_square_i    1000000000  avgt       10  652736811,450  27015580,488  ns/op

On my PC (Core i7 860, doing nothing much apart reading on my smartphone):

n += i*i then n*2 is first

2 * (i * i) is second.

The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).

Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class

Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI

Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP

I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.

paulsm4 75.7k898122 · Answer 5 · 2018-11-23 21:10:06Z

I got similar results:

2 * (i * i): 0.458765943 s, n=119860736

2 * i * i: 0.580255126 s, n=119860736

I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.

Finally, here is a javap -c -v <.java> decompile of each:

     3: ldc           #3                  // String 2 * (i * i):

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: iload         4

    30: imul

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

vs.

     3: ldc           #3                  // String 2 * i * i:

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: imul

    29: iload         4

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

FYI -

java -version

java version "1.8.0_121"

Java(TM) SE Runtime Environment (build 1.8.0_121-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway. — yesterday
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant" — yesterday
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above — yesterday
Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike. — yesterday
@ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work! — 22 hours ago

score 104 · Accepted Answer · 2018-11-24 13:48:02Z

There is a slight difference in the ordering of the bytecode

2 * (i * i):

     iconst_2

     iload0

     iload0

     imul

     imul

     iadd

vs 2 * i * i:

     iconst_2

     iload0

     imul

     iload0

     imul

     iadd

At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.

So we need to dig deeper into the lower level (JIT).

Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling:

030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006

030     addl    R11, RBP    # int

033     movl    RBP, R13    # spill

036     addl    RBP, #14    # int

039     imull   RBP, RBP    # int

03c     movl    R9, R13 # spill

03f     addl    R9, #13 # int

043     imull   R9, R9  # int

047     sall    RBP, #1

049     sall    R9, #1

04c     movl    R8, R13 # spill

04f     addl    R8, #15 # int

053     movl    R10, R8 # spill

056     movdl   XMM1, R8    # spill

05b     imull   R10, R8 # int

05f     movl    R8, R13 # spill

062     addl    R8, #12 # int

066     imull   R8, R8  # int

06a     sall    R10, #1

06d     movl    [rsp + #32], R10    # spill

072     sall    R8, #1

075     movl    RBX, R13    # spill

078     addl    RBX, #11    # int

07b     imull   RBX, RBX    # int

07e     movl    RCX, R13    # spill

081     addl    RCX, #10    # int

084     imull   RCX, RCX    # int

087     sall    RBX, #1

089     sall    RCX, #1

08b     movl    RDX, R13    # spill

08e     addl    RDX, #8 # int

091     imull   RDX, RDX    # int

094     movl    RDI, R13    # spill

097     addl    RDI, #7 # int

09a     imull   RDI, RDI    # int

09d     sall    RDX, #1

09f     sall    RDI, #1

0a1     movl    RAX, R13    # spill

0a4     addl    RAX, #6 # int

0a7     imull   RAX, RAX    # int

0aa     movl    RSI, R13    # spill

0ad     addl    RSI, #4 # int

0b0     imull   RSI, RSI    # int

0b3     sall    RAX, #1

0b5     sall    RSI, #1

0b7     movl    R10, R13    # spill

0ba     addl    R10, #2 # int

0be     imull   R10, R10    # int

0c2     movl    R14, R13    # spill

0c5     incl    R14 # int

0c8     imull   R14, R14    # int

0cc     sall    R10, #1

0cf     sall    R14, #1

0d2     addl    R14, R11    # int

0d5     addl    R14, R10    # int

0d8     movl    R10, R13    # spill

0db     addl    R10, #3 # int

0df     imull   R10, R10    # int

0e3     movl    R11, R13    # spill

0e6     addl    R11, #5 # int

0ea     imull   R11, R11    # int

0ee     sall    R10, #1

0f1     addl    R10, R14    # int

0f4     addl    R10, RSI    # int

0f7     sall    R11, #1

0fa     addl    R11, R10    # int

0fd     addl    R11, RAX    # int

100     addl    R11, RDI    # int

103     addl    R11, RDX    # int

106     movl    R10, R13    # spill

109     addl    R10, #9 # int

10d     imull   R10, R10    # int

111     sall    R10, #1

114     addl    R10, R11    # int

117     addl    R10, RCX    # int

11a     addl    R10, RBX    # int

11d     addl    R10, R8 # int

120     addl    R9, R10 # int

123     addl    RBP, R9 # int

126     addl    RBP, [RSP + #32 (32-bit)]   # int

12a     addl    R13, #16    # int

12e     movl    R11, R13    # spill

131     imull   R11, R13    # int

135     sall    R11, #1

138     cmpl    R13, #999999985

13f     jl     B2   # loop end  P=1.000000 C=6554623.000000

We observe already that there is 1 register that is "spilled" onto the stack.

And for the 2 * i * i version:

05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006

05a     addl    RBX, R11    # int

05d     movl    [rsp + #32], RBX    # spill

061     movl    R11, R8 # spill

064     addl    R11, #15    # int

068     movl    [rsp + #36], R11    # spill

06d     movl    R11, R8 # spill

070     addl    R11, #14    # int

074     movl    R10, R9 # spill

077     addl    R10, #16    # int

07b     movdl   XMM2, R10   # spill

080     movl    RCX, R9 # spill

083     addl    RCX, #14    # int

086     movdl   XMM1, RCX   # spill

08a     movl    R10, R9 # spill

08d     addl    R10, #12    # int

091     movdl   XMM4, R10   # spill

096     movl    RCX, R9 # spill

099     addl    RCX, #10    # int

09c     movdl   XMM6, RCX   # spill

0a0     movl    RBX, R9 # spill

0a3     addl    RBX, #8 # int

0a6     movl    RCX, R9 # spill

0a9     addl    RCX, #6 # int

0ac     movl    RDX, R9 # spill

0af     addl    RDX, #4 # int

0b2     addl    R9, #2  # int

0b6     movl    R10, R14    # spill

0b9     addl    R10, #22    # int

0bd     movdl   XMM3, R10   # spill

0c2     movl    RDI, R14    # spill

0c5     addl    RDI, #20    # int

0c8     movl    RAX, R14    # spill

0cb     addl    RAX, #32    # int

0ce     movl    RSI, R14    # spill

0d1     addl    RSI, #18    # int

0d4     movl    R13, R14    # spill

0d7     addl    R13, #24    # int

0db     movl    R10, R14    # spill

0de     addl    R10, #26    # int

0e2     movl    [rsp + #40], R10    # spill

0e7     movl    RBP, R14    # spill

0ea     addl    RBP, #28    # int

0ed     imull   RBP, R11    # int

0f1     addl    R14, #30    # int

0f5     imull   R14, [RSP + #36 (32-bit)]   # int

0fb     movl    R10, R8 # spill

0fe     addl    R10, #11    # int

102     movdl   R11, XMM3   # spill

107     imull   R11, R10    # int

10b     movl    [rsp + #44], R11    # spill

110     movl    R10, R8 # spill

113     addl    R10, #10    # int

117     imull   RDI, R10    # int

11b     movl    R11, R8 # spill

11e     addl    R11, #8 # int

122     movdl   R10, XMM2   # spill

127     imull   R10, R11    # int

12b     movl    [rsp + #48], R10    # spill

130     movl    R10, R8 # spill

133     addl    R10, #7 # int

137     movdl   R11, XMM1   # spill

13c     imull   R11, R10    # int

140     movl    [rsp + #52], R11    # spill

145     movl    R11, R8 # spill

148     addl    R11, #6 # int

14c     movdl   R10, XMM4   # spill

151     imull   R10, R11    # int

155     movl    [rsp + #56], R10    # spill

15a     movl    R10, R8 # spill

15d     addl    R10, #5 # int

161     movdl   R11, XMM6   # spill

166     imull   R11, R10    # int

16a     movl    [rsp + #60], R11    # spill

16f     movl    R11, R8 # spill

172     addl    R11, #4 # int

176     imull   RBX, R11    # int

17a     movl    R11, R8 # spill

17d     addl    R11, #3 # int

181     imull   RCX, R11    # int

185     movl    R10, R8 # spill

188     addl    R10, #2 # int

18c     imull   RDX, R10    # int

190     movl    R11, R8 # spill

193     incl    R11 # int

196     imull   R9, R11 # int

19a     addl    R9, [RSP + #32 (32-bit)]    # int

19f     addl    R9, RDX # int

1a2     addl    R9, RCX # int

1a5     addl    R9, RBX # int

1a8     addl    R9, [RSP + #60 (32-bit)]    # int

1ad     addl    R9, [RSP + #56 (32-bit)]    # int

1b2     addl    R9, [RSP + #52 (32-bit)]    # int

1b7     addl    R9, [RSP + #48 (32-bit)]    # int

1bc     movl    R10, R8 # spill

1bf     addl    R10, #9 # int

1c3     imull   R10, RSI    # int

1c7     addl    R10, R9 # int

1ca     addl    R10, RDI    # int

1cd     addl    R10, [RSP + #44 (32-bit)]   # int

1d2     movl    R11, R8 # spill

1d5     addl    R11, #12    # int

1d9     imull   R13, R11    # int

1dd     addl    R13, R10    # int

1e0     movl    R10, R8 # spill

1e3     addl    R10, #13    # int

1e7     imull   R10, [RSP + #40 (32-bit)]   # int

1ed     addl    R10, R13    # int

1f0     addl    RBP, R10    # int

1f3     addl    R14, RBP    # int

1f6     movl    R10, R8 # spill

1f9     addl    R10, #16    # int

1fd     cmpl    R10, #999999985

204     jl     B2   # loop end  P=1.000000 C=7419903.000000

Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved. But of course it is obvious that neither the first nor the second version is any good; the loop could literally be coded in 6 instructions.

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot.

In fact, modern x86 CPU's break down these instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers the loop unrolling idea is now most probably counter-productive. According to Agner Fog's optimization guide:

The gain in performance due to the µop cache can be quite
considerable if the average instruction length is more than 4 bytes.
The following methods of optimizing the use of the µop cache may
be considered:

Make sure that critical loops are small enough to fit into the µop cache.

Align the most critical loop entries and function entries by 32.

Avoid unnecessary loop unrolling.

Avoid instructions that have extra load time

. . .

Regarding those load times - even the fastest L1D hit costs 4 cycles, so yes, even a few accesses to memory will hurt performance in tight loops.

To see how fast it can be, we can compile a similar C application with GCC, which produces simply:

  xor edx, edx

  xor eax, eax

.L2:

  mov ecx, edx

  imul ecx, edx

  add edx, 1

  lea eax, [rax+rcx*2]

  cmp edx, 1000000000

  jne .L2

With a run time of... 0.08 s, or 5 times faster.

Runemoro 1,63611237 · Answer 7 · 2018-11-23 21:44:55Z

When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:

int n = 0;

for (int i = 0; i < 1000000000; i++) {

    n += i * i;

}

n *= 2;

but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.

Here are a few reasons why I think this is the case:

Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same

The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version

Here is the test code that I used to draw these conclusions:

public static void main(String args) {

    long fastVersion = 0;

    long slowVersion = 0;

    long optimizedVersion = 0;

    long modifiedFastVersion = 0;

    long modifiedSlowVersion = 0;



    for (int i = 0; i < 10; i++) {

        fastVersion += fastVersion();

        slowVersion += slowVersion();

        optimizedVersion += optimizedVersion();

        modifiedFastVersion += modifiedFastVersion();

        modifiedSlowVersion += modifiedSlowVersion();

    }



    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");

    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");

    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");

    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");

    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");

}



private static long fastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long slowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}



private static long optimizedVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += i * i;

    }

    n *= 2;

    return System.nanoTime() - startTime;

}



private static long modifiedFastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long modifiedSlowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}

And here are the results:

Fast version: 5.7274411 s

Slow version: 7.6190804 s

Optimized version: 5.1348007 s

Modified fast version: 7.1492705 s

Modified slow version: 7.2952668 s

here is a benchmark: github.com/jawb-software/stackoverflow-53452713 — yesterday
I think on the optimizedVersion, it should be n *= 2000000000; — yesterday
@StefansArya Incorrect.. 2(1) + 2(4) + 2(9) + 2(16) + ... = 2(1 + 4 + 9 + 16 + ...) — 16 hours ago

score 9 · Answer 8 · 2018-11-23 21:30:11Z

ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html

ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer

On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:

public static void main(String args) {

    int repeat = 10;

    long A = 0;

    long B = 0;

    for (int i = 0; i < repeat; i++) {

        A += test();

        B += testB();

    }



    System.out.println(A / repeat + " ms");

    System.out.println(B / repeat + " ms");

}





private static long test() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multi(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multi(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms A " + n);

    return ms;

}





private static long testB() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multiB(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multiB(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms B " + n);

    return ms;

}



private static int multiB(int i) {

    return 2 * (i * i);

}



private static int multi(int i) {

    return 2 * i * i;

}

Output:

...

405 ms A 785527736

327 ms B 785527736

404 ms A 785527736

329 ms B 785527736

404 ms A 785527736

328 ms B 785527736

404 ms A 785527736

328 ms B 785527736

410 ms

333 ms

So why?
The Byte code is this:

 private static multiB(int arg0) { // 2 * (i * i)

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         iload0

         imul

         imul

         ireturn

     }

     L2 {

     }

 }



 private static multi(int arg0) { // 2 * i * i

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         imul

         iload0

         imul

         ireturn

     }

     L2 {

     }

 }

The difference being:

With brackets (2 * (i * i)):

push const stack

push local on stack

push local on stack

multiply top of stack

multiply top of stack

Without brackets (2 * i * i):

push const stack

push local on stack

multiply top of stack

push local on stack

multiply top of stack

Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.

NoDataFound 5,3541739 · Answer 9 · 2018-11-23 22:10:50Z

I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .

@State(Scope.Benchmark)

@Warmup(iterations = 2)

@Fork(1)

@Measurement(iterations = 10)

@OutputTimeUnit(TimeUnit.NANOSECONDS)

//@BenchmarkMode({ Mode.All })

@BenchmarkMode(Mode.AverageTime)

public class MyBenchmark {

  @Param({ "100", "1000", "1000000000" })

  private int size;



  @Benchmark

  public int two_square_i() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * (i * i);

    }

    return n;

  }



  @Benchmark

  public int square_i_two() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += i * i;

    }

    return 2*n;

  }  



  @Benchmark

  public int two_i_() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * i * i;

    }

    return n;

  }

}

The result are here:

Benchmark                           (size)  Mode  Samples          Score   Score error  Units

o.s.MyBenchmark.square_i_two           100  avgt       10         58,062         1,410  ns/op

o.s.MyBenchmark.square_i_two          1000  avgt       10        547,393        12,851  ns/op

o.s.MyBenchmark.square_i_two    1000000000  avgt       10  540343681,267  16795210,324  ns/op

o.s.MyBenchmark.two_i_                 100  avgt       10         87,491         2,004  ns/op

o.s.MyBenchmark.two_i_                1000  avgt       10       1015,388        30,313  ns/op

o.s.MyBenchmark.two_i_          1000000000  avgt       10  967100076,600  24929570,556  ns/op

o.s.MyBenchmark.two_square_i           100  avgt       10         70,715         2,107  ns/op

o.s.MyBenchmark.two_square_i          1000  avgt       10        686,977        24,613  ns/op

o.s.MyBenchmark.two_square_i    1000000000  avgt       10  652736811,450  27015580,488  ns/op

On my PC (Core i7 860, doing nothing much apart reading on my smartphone):

n += i*i then n*2 is first

2 * (i * i) is second.

The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).

Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class

Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI

Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP

I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.

paulsm4 75.7k898122 · Answer 10 · 2018-11-23 21:10:06Z

I got similar results:

2 * (i * i): 0.458765943 s, n=119860736

2 * i * i: 0.580255126 s, n=119860736

I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.

Finally, here is a javap -c -v <.java> decompile of each:

     3: ldc           #3                  // String 2 * (i * i):

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: iload         4

    30: imul

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

vs.

     3: ldc           #3                  // String 2 * i * i:

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: imul

    29: iload         4

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

FYI -

java -version

java version "1.8.0_121"

Java(TM) SE Runtime Environment (build 1.8.0_121-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway. — yesterday
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant" — yesterday
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above — yesterday
Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike. — yesterday
@ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work! — 22 hours ago

搜尋此網誌

Sstrhsrtj

Why is “2 * (i * i)” faster than “2 * i * i”?

5 Answers
5

Your Answer

Post as a guest

5 Answers
5

5 Answers
5

Post as a guest

Popular posts from this blog

サソリ

広島県道265号伴広島線

Setup Asymptote in Texstudio

Why is “2 * (i * i)” faster than “2 * i * i”?

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

5 Answers 5

5 Answers 5

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

サソリ

広島県道265号伴広島線

Setup Asymptote in Texstudio

5 Answers
5

5 Answers
5

5 Answers
5