Why is “2 * (i * i)” faster than “2 * i * i”?











up vote
84
down vote

favorite
18












The following Java program takes on average between 0.50s and 0.55s to run:



public static void main(String args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}


If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65s to run. How come?



Edit: I ran each version of the program 15 times, alternating between the two. Here are the results:



2 * (i * i): 0.5183738, 0.5298337, 0.5308647, 0.5133458, 0.5003011, 0.5366181, 0.515149, 0.5237389, 0.5249942, 0.5641624, 0.538412, 0.5466744, 0.531159, 0.5048032, 0.5232789



2 * i * i: 0.6246434, 0.6049722, 0.6603363, 0.6243328, 0.6541802, 0.6312638, 0.6241105, 0.627815, 0.6114252, 0.6781033, 0.6393969, 0.6608845, 0.6201077, 0.6511559, 0.6544526



The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.









share




















  • 10




    How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
    – Joakim Danielson
    yesterday








  • 3




    I ran each version around 10 times, and the 2 * i * i version always takes slightly longer.
    – Stefan
    yesterday






  • 6




    look at the bytecode using a dissassmber.
    – OldProgrammer
    yesterday






  • 5




    Also please see: stackoverflow.com/questions/504103/…
    – lexicore
    yesterday








  • 4




    @nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
    – Jorn Vernee
    yesterday















up vote
84
down vote

favorite
18












The following Java program takes on average between 0.50s and 0.55s to run:



public static void main(String args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}


If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65s to run. How come?



Edit: I ran each version of the program 15 times, alternating between the two. Here are the results:



2 * (i * i): 0.5183738, 0.5298337, 0.5308647, 0.5133458, 0.5003011, 0.5366181, 0.515149, 0.5237389, 0.5249942, 0.5641624, 0.538412, 0.5466744, 0.531159, 0.5048032, 0.5232789



2 * i * i: 0.6246434, 0.6049722, 0.6603363, 0.6243328, 0.6541802, 0.6312638, 0.6241105, 0.627815, 0.6114252, 0.6781033, 0.6393969, 0.6608845, 0.6201077, 0.6511559, 0.6544526



The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.









share




















  • 10




    How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
    – Joakim Danielson
    yesterday








  • 3




    I ran each version around 10 times, and the 2 * i * i version always takes slightly longer.
    – Stefan
    yesterday






  • 6




    look at the bytecode using a dissassmber.
    – OldProgrammer
    yesterday






  • 5




    Also please see: stackoverflow.com/questions/504103/…
    – lexicore
    yesterday








  • 4




    @nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
    – Jorn Vernee
    yesterday













up vote
84
down vote

favorite
18









up vote
84
down vote

favorite
18






18





The following Java program takes on average between 0.50s and 0.55s to run:



public static void main(String args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}


If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65s to run. How come?



Edit: I ran each version of the program 15 times, alternating between the two. Here are the results:



2 * (i * i): 0.5183738, 0.5298337, 0.5308647, 0.5133458, 0.5003011, 0.5366181, 0.515149, 0.5237389, 0.5249942, 0.5641624, 0.538412, 0.5466744, 0.531159, 0.5048032, 0.5232789



2 * i * i: 0.6246434, 0.6049722, 0.6603363, 0.6243328, 0.6541802, 0.6312638, 0.6241105, 0.627815, 0.6114252, 0.6781033, 0.6393969, 0.6608845, 0.6201077, 0.6511559, 0.6544526



The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.









share















The following Java program takes on average between 0.50s and 0.55s to run:



public static void main(String args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}


If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65s to run. How come?



Edit: I ran each version of the program 15 times, alternating between the two. Here are the results:



2 * (i * i): 0.5183738, 0.5298337, 0.5308647, 0.5133458, 0.5003011, 0.5366181, 0.515149, 0.5237389, 0.5249942, 0.5641624, 0.538412, 0.5466744, 0.531159, 0.5048032, 0.5232789



2 * i * i: 0.6246434, 0.6049722, 0.6603363, 0.6243328, 0.6541802, 0.6312638, 0.6241105, 0.627815, 0.6114252, 0.6781033, 0.6393969, 0.6608845, 0.6201077, 0.6511559, 0.6544526



The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.







java optimization





share














share












share



share








edited yesterday

























asked yesterday









Stefan

52727




52727








  • 10




    How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
    – Joakim Danielson
    yesterday








  • 3




    I ran each version around 10 times, and the 2 * i * i version always takes slightly longer.
    – Stefan
    yesterday






  • 6




    look at the bytecode using a dissassmber.
    – OldProgrammer
    yesterday






  • 5




    Also please see: stackoverflow.com/questions/504103/…
    – lexicore
    yesterday








  • 4




    @nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
    – Jorn Vernee
    yesterday














  • 10




    How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
    – Joakim Danielson
    yesterday








  • 3




    I ran each version around 10 times, and the 2 * i * i version always takes slightly longer.
    – Stefan
    yesterday






  • 6




    look at the bytecode using a dissassmber.
    – OldProgrammer
    yesterday






  • 5




    Also please see: stackoverflow.com/questions/504103/…
    – lexicore
    yesterday








  • 4




    @nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
    – Jorn Vernee
    yesterday








10




10




How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
– Joakim Danielson
yesterday






How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
– Joakim Danielson
yesterday






3




3




I ran each version around 10 times, and the 2 * i * i version always takes slightly longer.
– Stefan
yesterday




I ran each version around 10 times, and the 2 * i * i version always takes slightly longer.
– Stefan
yesterday




6




6




look at the bytecode using a dissassmber.
– OldProgrammer
yesterday




look at the bytecode using a dissassmber.
– OldProgrammer
yesterday




5




5




Also please see: stackoverflow.com/questions/504103/…
– lexicore
yesterday






Also please see: stackoverflow.com/questions/504103/…
– lexicore
yesterday






4




4




@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
– Jorn Vernee
yesterday




@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
– Jorn Vernee
yesterday












5 Answers
5






active

oldest

votes

















up vote
104
down vote



accepted










There is a slight difference in the ordering of the bytecode



2 * (i * i):



     iconst_2
iload0
iload0
imul
imul
iadd


vs 2 * i * i:



     iconst_2
iload0
imul
iload0
imul
iadd


At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.



So we need to dig deeper into the lower level (JIT).



Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling:



030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006
030 addl R11, RBP # int
033 movl RBP, R13 # spill
036 addl RBP, #14 # int
039 imull RBP, RBP # int
03c movl R9, R13 # spill
03f addl R9, #13 # int
043 imull R9, R9 # int
047 sall RBP, #1
049 sall R9, #1
04c movl R8, R13 # spill
04f addl R8, #15 # int
053 movl R10, R8 # spill
056 movdl XMM1, R8 # spill
05b imull R10, R8 # int
05f movl R8, R13 # spill
062 addl R8, #12 # int
066 imull R8, R8 # int
06a sall R10, #1
06d movl [rsp + #32], R10 # spill
072 sall R8, #1
075 movl RBX, R13 # spill
078 addl RBX, #11 # int
07b imull RBX, RBX # int
07e movl RCX, R13 # spill
081 addl RCX, #10 # int
084 imull RCX, RCX # int
087 sall RBX, #1
089 sall RCX, #1
08b movl RDX, R13 # spill
08e addl RDX, #8 # int
091 imull RDX, RDX # int
094 movl RDI, R13 # spill
097 addl RDI, #7 # int
09a imull RDI, RDI # int
09d sall RDX, #1
09f sall RDI, #1
0a1 movl RAX, R13 # spill
0a4 addl RAX, #6 # int
0a7 imull RAX, RAX # int
0aa movl RSI, R13 # spill
0ad addl RSI, #4 # int
0b0 imull RSI, RSI # int
0b3 sall RAX, #1
0b5 sall RSI, #1
0b7 movl R10, R13 # spill
0ba addl R10, #2 # int
0be imull R10, R10 # int
0c2 movl R14, R13 # spill
0c5 incl R14 # int
0c8 imull R14, R14 # int
0cc sall R10, #1
0cf sall R14, #1
0d2 addl R14, R11 # int
0d5 addl R14, R10 # int
0d8 movl R10, R13 # spill
0db addl R10, #3 # int
0df imull R10, R10 # int
0e3 movl R11, R13 # spill
0e6 addl R11, #5 # int
0ea imull R11, R11 # int
0ee sall R10, #1
0f1 addl R10, R14 # int
0f4 addl R10, RSI # int
0f7 sall R11, #1
0fa addl R11, R10 # int
0fd addl R11, RAX # int
100 addl R11, RDI # int
103 addl R11, RDX # int
106 movl R10, R13 # spill
109 addl R10, #9 # int
10d imull R10, R10 # int
111 sall R10, #1
114 addl R10, R11 # int
117 addl R10, RCX # int
11a addl R10, RBX # int
11d addl R10, R8 # int
120 addl R9, R10 # int
123 addl RBP, R9 # int
126 addl RBP, [RSP + #32 (32-bit)] # int
12a addl R13, #16 # int
12e movl R11, R13 # spill
131 imull R11, R13 # int
135 sall R11, #1
138 cmpl R13, #999999985
13f jl B2 # loop end P=1.000000 C=6554623.000000


We observe already that there is 1 register that is "spilled" onto the stack.



And for the 2 * i * i version:



05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006
05a addl RBX, R11 # int
05d movl [rsp + #32], RBX # spill
061 movl R11, R8 # spill
064 addl R11, #15 # int
068 movl [rsp + #36], R11 # spill
06d movl R11, R8 # spill
070 addl R11, #14 # int
074 movl R10, R9 # spill
077 addl R10, #16 # int
07b movdl XMM2, R10 # spill
080 movl RCX, R9 # spill
083 addl RCX, #14 # int
086 movdl XMM1, RCX # spill
08a movl R10, R9 # spill
08d addl R10, #12 # int
091 movdl XMM4, R10 # spill
096 movl RCX, R9 # spill
099 addl RCX, #10 # int
09c movdl XMM6, RCX # spill
0a0 movl RBX, R9 # spill
0a3 addl RBX, #8 # int
0a6 movl RCX, R9 # spill
0a9 addl RCX, #6 # int
0ac movl RDX, R9 # spill
0af addl RDX, #4 # int
0b2 addl R9, #2 # int
0b6 movl R10, R14 # spill
0b9 addl R10, #22 # int
0bd movdl XMM3, R10 # spill
0c2 movl RDI, R14 # spill
0c5 addl RDI, #20 # int
0c8 movl RAX, R14 # spill
0cb addl RAX, #32 # int
0ce movl RSI, R14 # spill
0d1 addl RSI, #18 # int
0d4 movl R13, R14 # spill
0d7 addl R13, #24 # int
0db movl R10, R14 # spill
0de addl R10, #26 # int
0e2 movl [rsp + #40], R10 # spill
0e7 movl RBP, R14 # spill
0ea addl RBP, #28 # int
0ed imull RBP, R11 # int
0f1 addl R14, #30 # int
0f5 imull R14, [RSP + #36 (32-bit)] # int
0fb movl R10, R8 # spill
0fe addl R10, #11 # int
102 movdl R11, XMM3 # spill
107 imull R11, R10 # int
10b movl [rsp + #44], R11 # spill
110 movl R10, R8 # spill
113 addl R10, #10 # int
117 imull RDI, R10 # int
11b movl R11, R8 # spill
11e addl R11, #8 # int
122 movdl R10, XMM2 # spill
127 imull R10, R11 # int
12b movl [rsp + #48], R10 # spill
130 movl R10, R8 # spill
133 addl R10, #7 # int
137 movdl R11, XMM1 # spill
13c imull R11, R10 # int
140 movl [rsp + #52], R11 # spill
145 movl R11, R8 # spill
148 addl R11, #6 # int
14c movdl R10, XMM4 # spill
151 imull R10, R11 # int
155 movl [rsp + #56], R10 # spill
15a movl R10, R8 # spill
15d addl R10, #5 # int
161 movdl R11, XMM6 # spill
166 imull R11, R10 # int
16a movl [rsp + #60], R11 # spill
16f movl R11, R8 # spill
172 addl R11, #4 # int
176 imull RBX, R11 # int
17a movl R11, R8 # spill
17d addl R11, #3 # int
181 imull RCX, R11 # int
185 movl R10, R8 # spill
188 addl R10, #2 # int
18c imull RDX, R10 # int
190 movl R11, R8 # spill
193 incl R11 # int
196 imull R9, R11 # int
19a addl R9, [RSP + #32 (32-bit)] # int
19f addl R9, RDX # int
1a2 addl R9, RCX # int
1a5 addl R9, RBX # int
1a8 addl R9, [RSP + #60 (32-bit)] # int
1ad addl R9, [RSP + #56 (32-bit)] # int
1b2 addl R9, [RSP + #52 (32-bit)] # int
1b7 addl R9, [RSP + #48 (32-bit)] # int
1bc movl R10, R8 # spill
1bf addl R10, #9 # int
1c3 imull R10, RSI # int
1c7 addl R10, R9 # int
1ca addl R10, RDI # int
1cd addl R10, [RSP + #44 (32-bit)] # int
1d2 movl R11, R8 # spill
1d5 addl R11, #12 # int
1d9 imull R13, R11 # int
1dd addl R13, R10 # int
1e0 movl R10, R8 # spill
1e3 addl R10, #13 # int
1e7 imull R10, [RSP + #40 (32-bit)] # int
1ed addl R10, R13 # int
1f0 addl RBP, R10 # int
1f3 addl R14, RBP # int
1f6 movl R10, R8 # spill
1f9 addl R10, #16 # int
1fd cmpl R10, #999999985
204 jl B2 # loop end P=1.000000 C=7419903.000000


Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved. But of course it is obvious that neither the first nor the second version is any good; the loop could literally be coded in 6 instructions.



So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot.



In fact, modern x86 CPU's break down these instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers the loop unrolling idea is now most probably counter-productive. According to Agner Fog's optimization guide:




The gain in performance due to the µop cache can be quite
considerable if the average instruction length is more than 4 bytes.
The following methods of optimizing the use of the µop cache may
be considered:




  • Make sure that critical loops are small enough to fit into the µop cache.

  • Align the most critical loop entries and function entries by 32.

  • Avoid unnecessary loop unrolling.

  • Avoid instructions that have extra load time

    . . .




Regarding those load times - even the fastest L1D hit costs 4 cycles, so yes, even a few accesses to memory will hurt performance in tight loops.



To see how fast it can be, we can compile a similar C application with GCC, which produces simply:



  xor edx, edx
xor eax, eax
.L2:
mov ecx, edx
imul ecx, edx
add edx, 1
lea eax, [rax+rcx*2]
cmp edx, 1000000000
jne .L2


With a run time of... 0.08 s, or 5 times faster.






share|improve this answer






























    up vote
    30
    down vote













    When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:



    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
    n += i * i;
    }
    n *= 2;


    but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.



    Here are a few reasons why I think this is the case:




    • Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same

    • The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version


    Here is the test code that I used to draw these conclusions:



    public static void main(String args) {
    long fastVersion = 0;
    long slowVersion = 0;
    long optimizedVersion = 0;
    long modifiedFastVersion = 0;
    long modifiedSlowVersion = 0;

    for (int i = 0; i < 10; i++) {
    fastVersion += fastVersion();
    slowVersion += slowVersion();
    optimizedVersion += optimizedVersion();
    modifiedFastVersion += modifiedFastVersion();
    modifiedSlowVersion += modifiedSlowVersion();
    }

    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
    }

    private static long fastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
    n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
    }

    private static long slowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
    n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
    }

    private static long optimizedVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
    n += i * i;
    }
    n *= 2;
    return System.nanoTime() - startTime;
    }

    private static long modifiedFastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
    if (n == 0) n = 1;
    n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
    }

    private static long modifiedSlowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
    if (n == 0) n = 1;
    n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
    }


    And here are the results:



    Fast version: 5.7274411 s
    Slow version: 7.6190804 s
    Optimized version: 5.1348007 s
    Modified fast version: 7.1492705 s
    Modified slow version: 7.2952668 s





    share|improve this answer





















    • here is a benchmark: github.com/jawb-software/stackoverflow-53452713
      – dit
      yesterday










    • I think on the optimizedVersion, it should be n *= 2000000000;
      – StefansArya
      yesterday






    • 4




      @StefansArya Incorrect.. 2(1) + 2(4) + 2(9) + 2(16) + ... = 2(1 + 4 + 9 + 16 + ...)
      – Arth
      16 hours ago


















    up vote
    9
    down vote













    ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html

    ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer



    On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:



    public static void main(String args) {
    int repeat = 10;
    long A = 0;
    long B = 0;
    for (int i = 0; i < repeat; i++) {
    A += test();
    B += testB();
    }

    System.out.println(A / repeat + " ms");
    System.out.println(B / repeat + " ms");
    }


    private static long test() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
    n += multi(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
    n += multi(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms A " + n);
    return ms;
    }


    private static long testB() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
    n += multiB(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
    n += multiB(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms B " + n);
    return ms;
    }

    private static int multiB(int i) {
    return 2 * (i * i);
    }

    private static int multi(int i) {
    return 2 * i * i;
    }


    Output:



    ...
    405 ms A 785527736
    327 ms B 785527736
    404 ms A 785527736
    329 ms B 785527736
    404 ms A 785527736
    328 ms B 785527736
    404 ms A 785527736
    328 ms B 785527736
    410 ms
    333 ms


    So why?
    The Byte code is this:



     private static multiB(int arg0) { // 2 * (i * i)
    <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>

    L1 {
    iconst_2
    iload0
    iload0
    imul
    imul
    ireturn
    }
    L2 {
    }
    }

    private static multi(int arg0) { // 2 * i * i
    <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>

    L1 {
    iconst_2
    iload0
    imul
    iload0
    imul
    ireturn
    }
    L2 {
    }
    }


    The difference being:

    With brackets (2 * (i * i)):




    • push const stack

    • push local on stack

    • push local on stack

    • multiply top of stack

    • multiply top of stack


    Without brackets (2 * i * i):




    • push const stack

    • push local on stack

    • multiply top of stack

    • push local on stack

    • multiply top of stack


    Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.






    share|improve this answer






























      up vote
      4
      down vote













      I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .



      @State(Scope.Benchmark)
      @Warmup(iterations = 2)
      @Fork(1)
      @Measurement(iterations = 10)
      @OutputTimeUnit(TimeUnit.NANOSECONDS)
      //@BenchmarkMode({ Mode.All })
      @BenchmarkMode(Mode.AverageTime)
      public class MyBenchmark {
      @Param({ "100", "1000", "1000000000" })
      private int size;

      @Benchmark
      public int two_square_i() {
      int n = 0;
      for (int i = 0; i < size; i++) {
      n += 2 * (i * i);
      }
      return n;
      }

      @Benchmark
      public int square_i_two() {
      int n = 0;
      for (int i = 0; i < size; i++) {
      n += i * i;
      }
      return 2*n;
      }

      @Benchmark
      public int two_i_() {
      int n = 0;
      for (int i = 0; i < size; i++) {
      n += 2 * i * i;
      }
      return n;
      }
      }


      The result are here:



      Benchmark                           (size)  Mode  Samples          Score   Score error  Units
      o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
      o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
      o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
      o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
      o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
      o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
      o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
      o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
      o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op


      On my PC (Core i7 860, doing nothing much apart reading on my smartphone):





      • n += i*i then n*2 is first


      • 2 * (i * i) is second.


      The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).



      Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class




      • Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI

      • Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP


      I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.






      share|improve this answer




























        up vote
        2
        down vote













        I got similar results:



        2 * (i * i): 0.458765943 s, n=119860736
        2 * i * i: 0.580255126 s, n=119860736


        I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.



        Finally, here is a javap -c -v <.java> decompile of each:



             3: ldc           #3                  // String 2 * (i * i):
        5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
        8: invokestatic #5 // Method java/lang/System.nanoTime:()J
        8: invokestatic #5 // Method java/lang/System.nanoTime:()J
        11: lstore_1
        12: iconst_0
        13: istore_3
        14: iconst_0
        15: istore 4
        17: iload 4
        19: ldc #6 // int 1000000000
        21: if_icmpge 40
        24: iload_3
        25: iconst_2
        26: iload 4
        28: iload 4
        30: imul
        31: imul
        32: iadd
        33: istore_3
        34: iinc 4, 1
        37: goto 17


        vs.



             3: ldc           #3                  // String 2 * i * i:
        5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
        8: invokestatic #5 // Method java/lang/System.nanoTime:()J
        11: lstore_1
        12: iconst_0
        13: istore_3
        14: iconst_0
        15: istore 4
        17: iload 4
        19: ldc #6 // int 1000000000
        21: if_icmpge 40
        24: iload_3
        25: iconst_2
        26: iload 4
        28: imul
        29: iload 4
        31: imul
        32: iadd
        33: istore_3
        34: iinc 4, 1
        37: goto 17


        FYI -



        java -version
        java version "1.8.0_121"
        Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
        Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)





        share|improve this answer

















        • 1




          A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
          – nullpointer
          yesterday












        • @nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
          – paulsm4
          yesterday












        • That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
          – Krease
          yesterday






        • 1




          Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike.
          – rustyx
          yesterday






        • 1




          @ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work!
          – paulsm4
          22 hours ago













        Your Answer






        StackExchange.ifUsing("editor", function () {
        StackExchange.using("externalEditor", function () {
        StackExchange.using("snippets", function () {
        StackExchange.snippets.init();
        });
        });
        }, "code-snippets");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "1"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














         

        draft saved


        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452713%2fwhy-is-2-i-i-faster-than-2-i-i%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        5 Answers
        5






        active

        oldest

        votes








        5 Answers
        5






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes








        up vote
        104
        down vote



        accepted










        There is a slight difference in the ordering of the bytecode



        2 * (i * i):



             iconst_2
        iload0
        iload0
        imul
        imul
        iadd


        vs 2 * i * i:



             iconst_2
        iload0
        imul
        iload0
        imul
        iadd


        At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.



        So we need to dig deeper into the lower level (JIT).



        Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling:



        030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006
        030 addl R11, RBP # int
        033 movl RBP, R13 # spill
        036 addl RBP, #14 # int
        039 imull RBP, RBP # int
        03c movl R9, R13 # spill
        03f addl R9, #13 # int
        043 imull R9, R9 # int
        047 sall RBP, #1
        049 sall R9, #1
        04c movl R8, R13 # spill
        04f addl R8, #15 # int
        053 movl R10, R8 # spill
        056 movdl XMM1, R8 # spill
        05b imull R10, R8 # int
        05f movl R8, R13 # spill
        062 addl R8, #12 # int
        066 imull R8, R8 # int
        06a sall R10, #1
        06d movl [rsp + #32], R10 # spill
        072 sall R8, #1
        075 movl RBX, R13 # spill
        078 addl RBX, #11 # int
        07b imull RBX, RBX # int
        07e movl RCX, R13 # spill
        081 addl RCX, #10 # int
        084 imull RCX, RCX # int
        087 sall RBX, #1
        089 sall RCX, #1
        08b movl RDX, R13 # spill
        08e addl RDX, #8 # int
        091 imull RDX, RDX # int
        094 movl RDI, R13 # spill
        097 addl RDI, #7 # int
        09a imull RDI, RDI # int
        09d sall RDX, #1
        09f sall RDI, #1
        0a1 movl RAX, R13 # spill
        0a4 addl RAX, #6 # int
        0a7 imull RAX, RAX # int
        0aa movl RSI, R13 # spill
        0ad addl RSI, #4 # int
        0b0 imull RSI, RSI # int
        0b3 sall RAX, #1
        0b5 sall RSI, #1
        0b7 movl R10, R13 # spill
        0ba addl R10, #2 # int
        0be imull R10, R10 # int
        0c2 movl R14, R13 # spill
        0c5 incl R14 # int
        0c8 imull R14, R14 # int
        0cc sall R10, #1
        0cf sall R14, #1
        0d2 addl R14, R11 # int
        0d5 addl R14, R10 # int
        0d8 movl R10, R13 # spill
        0db addl R10, #3 # int
        0df imull R10, R10 # int
        0e3 movl R11, R13 # spill
        0e6 addl R11, #5 # int
        0ea imull R11, R11 # int
        0ee sall R10, #1
        0f1 addl R10, R14 # int
        0f4 addl R10, RSI # int
        0f7 sall R11, #1
        0fa addl R11, R10 # int
        0fd addl R11, RAX # int
        100 addl R11, RDI # int
        103 addl R11, RDX # int
        106 movl R10, R13 # spill
        109 addl R10, #9 # int
        10d imull R10, R10 # int
        111 sall R10, #1
        114 addl R10, R11 # int
        117 addl R10, RCX # int
        11a addl R10, RBX # int
        11d addl R10, R8 # int
        120 addl R9, R10 # int
        123 addl RBP, R9 # int
        126 addl RBP, [RSP + #32 (32-bit)] # int
        12a addl R13, #16 # int
        12e movl R11, R13 # spill
        131 imull R11, R13 # int
        135 sall R11, #1
        138 cmpl R13, #999999985
        13f jl B2 # loop end P=1.000000 C=6554623.000000


        We observe already that there is 1 register that is "spilled" onto the stack.



        And for the 2 * i * i version:



        05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006
        05a addl RBX, R11 # int
        05d movl [rsp + #32], RBX # spill
        061 movl R11, R8 # spill
        064 addl R11, #15 # int
        068 movl [rsp + #36], R11 # spill
        06d movl R11, R8 # spill
        070 addl R11, #14 # int
        074 movl R10, R9 # spill
        077 addl R10, #16 # int
        07b movdl XMM2, R10 # spill
        080 movl RCX, R9 # spill
        083 addl RCX, #14 # int
        086 movdl XMM1, RCX # spill
        08a movl R10, R9 # spill
        08d addl R10, #12 # int
        091 movdl XMM4, R10 # spill
        096 movl RCX, R9 # spill
        099 addl RCX, #10 # int
        09c movdl XMM6, RCX # spill
        0a0 movl RBX, R9 # spill
        0a3 addl RBX, #8 # int
        0a6 movl RCX, R9 # spill
        0a9 addl RCX, #6 # int
        0ac movl RDX, R9 # spill
        0af addl RDX, #4 # int
        0b2 addl R9, #2 # int
        0b6 movl R10, R14 # spill
        0b9 addl R10, #22 # int
        0bd movdl XMM3, R10 # spill
        0c2 movl RDI, R14 # spill
        0c5 addl RDI, #20 # int
        0c8 movl RAX, R14 # spill
        0cb addl RAX, #32 # int
        0ce movl RSI, R14 # spill
        0d1 addl RSI, #18 # int
        0d4 movl R13, R14 # spill
        0d7 addl R13, #24 # int
        0db movl R10, R14 # spill
        0de addl R10, #26 # int
        0e2 movl [rsp + #40], R10 # spill
        0e7 movl RBP, R14 # spill
        0ea addl RBP, #28 # int
        0ed imull RBP, R11 # int
        0f1 addl R14, #30 # int
        0f5 imull R14, [RSP + #36 (32-bit)] # int
        0fb movl R10, R8 # spill
        0fe addl R10, #11 # int
        102 movdl R11, XMM3 # spill
        107 imull R11, R10 # int
        10b movl [rsp + #44], R11 # spill
        110 movl R10, R8 # spill
        113 addl R10, #10 # int
        117 imull RDI, R10 # int
        11b movl R11, R8 # spill
        11e addl R11, #8 # int
        122 movdl R10, XMM2 # spill
        127 imull R10, R11 # int
        12b movl [rsp + #48], R10 # spill
        130 movl R10, R8 # spill
        133 addl R10, #7 # int
        137 movdl R11, XMM1 # spill
        13c imull R11, R10 # int
        140 movl [rsp + #52], R11 # spill
        145 movl R11, R8 # spill
        148 addl R11, #6 # int
        14c movdl R10, XMM4 # spill
        151 imull R10, R11 # int
        155 movl [rsp + #56], R10 # spill
        15a movl R10, R8 # spill
        15d addl R10, #5 # int
        161 movdl R11, XMM6 # spill
        166 imull R11, R10 # int
        16a movl [rsp + #60], R11 # spill
        16f movl R11, R8 # spill
        172 addl R11, #4 # int
        176 imull RBX, R11 # int
        17a movl R11, R8 # spill
        17d addl R11, #3 # int
        181 imull RCX, R11 # int
        185 movl R10, R8 # spill
        188 addl R10, #2 # int
        18c imull RDX, R10 # int
        190 movl R11, R8 # spill
        193 incl R11 # int
        196 imull R9, R11 # int
        19a addl R9, [RSP + #32 (32-bit)] # int
        19f addl R9, RDX # int
        1a2 addl R9, RCX # int
        1a5 addl R9, RBX # int
        1a8 addl R9, [RSP + #60 (32-bit)] # int
        1ad addl R9, [RSP + #56 (32-bit)] # int
        1b2 addl R9, [RSP + #52 (32-bit)] # int
        1b7 addl R9, [RSP + #48 (32-bit)] # int
        1bc movl R10, R8 # spill
        1bf addl R10, #9 # int
        1c3 imull R10, RSI # int
        1c7 addl R10, R9 # int
        1ca addl R10, RDI # int
        1cd addl R10, [RSP + #44 (32-bit)] # int
        1d2 movl R11, R8 # spill
        1d5 addl R11, #12 # int
        1d9 imull R13, R11 # int
        1dd addl R13, R10 # int
        1e0 movl R10, R8 # spill
        1e3 addl R10, #13 # int
        1e7 imull R10, [RSP + #40 (32-bit)] # int
        1ed addl R10, R13 # int
        1f0 addl RBP, R10 # int
        1f3 addl R14, RBP # int
        1f6 movl R10, R8 # spill
        1f9 addl R10, #16 # int
        1fd cmpl R10, #999999985
        204 jl B2 # loop end P=1.000000 C=7419903.000000


        Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved. But of course it is obvious that neither the first nor the second version is any good; the loop could literally be coded in 6 instructions.



        So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot.



        In fact, modern x86 CPU's break down these instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers the loop unrolling idea is now most probably counter-productive. According to Agner Fog's optimization guide:




        The gain in performance due to the µop cache can be quite
        considerable if the average instruction length is more than 4 bytes.
        The following methods of optimizing the use of the µop cache may
        be considered:




        • Make sure that critical loops are small enough to fit into the µop cache.

        • Align the most critical loop entries and function entries by 32.

        • Avoid unnecessary loop unrolling.

        • Avoid instructions that have extra load time

          . . .




        Regarding those load times - even the fastest L1D hit costs 4 cycles, so yes, even a few accesses to memory will hurt performance in tight loops.



        To see how fast it can be, we can compile a similar C application with GCC, which produces simply:



          xor edx, edx
        xor eax, eax
        .L2:
        mov ecx, edx
        imul ecx, edx
        add edx, 1
        lea eax, [rax+rcx*2]
        cmp edx, 1000000000
        jne .L2


        With a run time of... 0.08 s, or 5 times faster.






        share|improve this answer



























          up vote
          104
          down vote



          accepted










          There is a slight difference in the ordering of the bytecode



          2 * (i * i):



               iconst_2
          iload0
          iload0
          imul
          imul
          iadd


          vs 2 * i * i:



               iconst_2
          iload0
          imul
          iload0
          imul
          iadd


          At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.



          So we need to dig deeper into the lower level (JIT).



          Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling:



          030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006
          030 addl R11, RBP # int
          033 movl RBP, R13 # spill
          036 addl RBP, #14 # int
          039 imull RBP, RBP # int
          03c movl R9, R13 # spill
          03f addl R9, #13 # int
          043 imull R9, R9 # int
          047 sall RBP, #1
          049 sall R9, #1
          04c movl R8, R13 # spill
          04f addl R8, #15 # int
          053 movl R10, R8 # spill
          056 movdl XMM1, R8 # spill
          05b imull R10, R8 # int
          05f movl R8, R13 # spill
          062 addl R8, #12 # int
          066 imull R8, R8 # int
          06a sall R10, #1
          06d movl [rsp + #32], R10 # spill
          072 sall R8, #1
          075 movl RBX, R13 # spill
          078 addl RBX, #11 # int
          07b imull RBX, RBX # int
          07e movl RCX, R13 # spill
          081 addl RCX, #10 # int
          084 imull RCX, RCX # int
          087 sall RBX, #1
          089 sall RCX, #1
          08b movl RDX, R13 # spill
          08e addl RDX, #8 # int
          091 imull RDX, RDX # int
          094 movl RDI, R13 # spill
          097 addl RDI, #7 # int
          09a imull RDI, RDI # int
          09d sall RDX, #1
          09f sall RDI, #1
          0a1 movl RAX, R13 # spill
          0a4 addl RAX, #6 # int
          0a7 imull RAX, RAX # int
          0aa movl RSI, R13 # spill
          0ad addl RSI, #4 # int
          0b0 imull RSI, RSI # int
          0b3 sall RAX, #1
          0b5 sall RSI, #1
          0b7 movl R10, R13 # spill
          0ba addl R10, #2 # int
          0be imull R10, R10 # int
          0c2 movl R14, R13 # spill
          0c5 incl R14 # int
          0c8 imull R14, R14 # int
          0cc sall R10, #1
          0cf sall R14, #1
          0d2 addl R14, R11 # int
          0d5 addl R14, R10 # int
          0d8 movl R10, R13 # spill
          0db addl R10, #3 # int
          0df imull R10, R10 # int
          0e3 movl R11, R13 # spill
          0e6 addl R11, #5 # int
          0ea imull R11, R11 # int
          0ee sall R10, #1
          0f1 addl R10, R14 # int
          0f4 addl R10, RSI # int
          0f7 sall R11, #1
          0fa addl R11, R10 # int
          0fd addl R11, RAX # int
          100 addl R11, RDI # int
          103 addl R11, RDX # int
          106 movl R10, R13 # spill
          109 addl R10, #9 # int
          10d imull R10, R10 # int
          111 sall R10, #1
          114 addl R10, R11 # int
          117 addl R10, RCX # int
          11a addl R10, RBX # int
          11d addl R10, R8 # int
          120 addl R9, R10 # int
          123 addl RBP, R9 # int
          126 addl RBP, [RSP + #32 (32-bit)] # int
          12a addl R13, #16 # int
          12e movl R11, R13 # spill
          131 imull R11, R13 # int
          135 sall R11, #1
          138 cmpl R13, #999999985
          13f jl B2 # loop end P=1.000000 C=6554623.000000


          We observe already that there is 1 register that is "spilled" onto the stack.



          And for the 2 * i * i version:



          05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006
          05a addl RBX, R11 # int
          05d movl [rsp + #32], RBX # spill
          061 movl R11, R8 # spill
          064 addl R11, #15 # int
          068 movl [rsp + #36], R11 # spill
          06d movl R11, R8 # spill
          070 addl R11, #14 # int
          074 movl R10, R9 # spill
          077 addl R10, #16 # int
          07b movdl XMM2, R10 # spill
          080 movl RCX, R9 # spill
          083 addl RCX, #14 # int
          086 movdl XMM1, RCX # spill
          08a movl R10, R9 # spill
          08d addl R10, #12 # int
          091 movdl XMM4, R10 # spill
          096 movl RCX, R9 # spill
          099 addl RCX, #10 # int
          09c movdl XMM6, RCX # spill
          0a0 movl RBX, R9 # spill
          0a3 addl RBX, #8 # int
          0a6 movl RCX, R9 # spill
          0a9 addl RCX, #6 # int
          0ac movl RDX, R9 # spill
          0af addl RDX, #4 # int
          0b2 addl R9, #2 # int
          0b6 movl R10, R14 # spill
          0b9 addl R10, #22 # int
          0bd movdl XMM3, R10 # spill
          0c2 movl RDI, R14 # spill
          0c5 addl RDI, #20 # int
          0c8 movl RAX, R14 # spill
          0cb addl RAX, #32 # int
          0ce movl RSI, R14 # spill
          0d1 addl RSI, #18 # int
          0d4 movl R13, R14 # spill
          0d7 addl R13, #24 # int
          0db movl R10, R14 # spill
          0de addl R10, #26 # int
          0e2 movl [rsp + #40], R10 # spill
          0e7 movl RBP, R14 # spill
          0ea addl RBP, #28 # int
          0ed imull RBP, R11 # int
          0f1 addl R14, #30 # int
          0f5 imull R14, [RSP + #36 (32-bit)] # int
          0fb movl R10, R8 # spill
          0fe addl R10, #11 # int
          102 movdl R11, XMM3 # spill
          107 imull R11, R10 # int
          10b movl [rsp + #44], R11 # spill
          110 movl R10, R8 # spill
          113 addl R10, #10 # int
          117 imull RDI, R10 # int
          11b movl R11, R8 # spill
          11e addl R11, #8 # int
          122 movdl R10, XMM2 # spill
          127 imull R10, R11 # int
          12b movl [rsp + #48], R10 # spill
          130 movl R10, R8 # spill
          133 addl R10, #7 # int
          137 movdl R11, XMM1 # spill
          13c imull R11, R10 # int
          140 movl [rsp + #52], R11 # spill
          145 movl R11, R8 # spill
          148 addl R11, #6 # int
          14c movdl R10, XMM4 # spill
          151 imull R10, R11 # int
          155 movl [rsp + #56], R10 # spill
          15a movl R10, R8 # spill
          15d addl R10, #5 # int
          161 movdl R11, XMM6 # spill
          166 imull R11, R10 # int
          16a movl [rsp + #60], R11 # spill
          16f movl R11, R8 # spill
          172 addl R11, #4 # int
          176 imull RBX, R11 # int
          17a movl R11, R8 # spill
          17d addl R11, #3 # int
          181 imull RCX, R11 # int
          185 movl R10, R8 # spill
          188 addl R10, #2 # int
          18c imull RDX, R10 # int
          190 movl R11, R8 # spill
          193 incl R11 # int
          196 imull R9, R11 # int
          19a addl R9, [RSP + #32 (32-bit)] # int
          19f addl R9, RDX # int
          1a2 addl R9, RCX # int
          1a5 addl R9, RBX # int
          1a8 addl R9, [RSP + #60 (32-bit)] # int
          1ad addl R9, [RSP + #56 (32-bit)] # int
          1b2 addl R9, [RSP + #52 (32-bit)] # int
          1b7 addl R9, [RSP + #48 (32-bit)] # int
          1bc movl R10, R8 # spill
          1bf addl R10, #9 # int
          1c3 imull R10, RSI # int
          1c7 addl R10, R9 # int
          1ca addl R10, RDI # int
          1cd addl R10, [RSP + #44 (32-bit)] # int
          1d2 movl R11, R8 # spill
          1d5 addl R11, #12 # int
          1d9 imull R13, R11 # int
          1dd addl R13, R10 # int
          1e0 movl R10, R8 # spill
          1e3 addl R10, #13 # int
          1e7 imull R10, [RSP + #40 (32-bit)] # int
          1ed addl R10, R13 # int
          1f0 addl RBP, R10 # int
          1f3 addl R14, RBP # int
          1f6 movl R10, R8 # spill
          1f9 addl R10, #16 # int
          1fd cmpl R10, #999999985
          204 jl B2 # loop end P=1.000000 C=7419903.000000


          Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved. But of course it is obvious that neither the first nor the second version is any good; the loop could literally be coded in 6 instructions.



          So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot.



          In fact, modern x86 CPU's break down these instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers the loop unrolling idea is now most probably counter-productive. According to Agner Fog's optimization guide:




          The gain in performance due to the µop cache can be quite
          considerable if the average instruction length is more than 4 bytes.
          The following methods of optimizing the use of the µop cache may
          be considered:




          • Make sure that critical loops are small enough to fit into the µop cache.

          • Align the most critical loop entries and function entries by 32.

          • Avoid unnecessary loop unrolling.

          • Avoid instructions that have extra load time

            . . .




          Regarding those load times - even the fastest L1D hit costs 4 cycles, so yes, even a few accesses to memory will hurt performance in tight loops.



          To see how fast it can be, we can compile a similar C application with GCC, which produces simply:



            xor edx, edx
          xor eax, eax
          .L2:
          mov ecx, edx
          imul ecx, edx
          add edx, 1
          lea eax, [rax+rcx*2]
          cmp edx, 1000000000
          jne .L2


          With a run time of... 0.08 s, or 5 times faster.






          share|improve this answer

























            up vote
            104
            down vote



            accepted







            up vote
            104
            down vote



            accepted






            There is a slight difference in the ordering of the bytecode



            2 * (i * i):



                 iconst_2
            iload0
            iload0
            imul
            imul
            iadd


            vs 2 * i * i:



                 iconst_2
            iload0
            imul
            iload0
            imul
            iadd


            At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.



            So we need to dig deeper into the lower level (JIT).



            Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling:



            030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006
            030 addl R11, RBP # int
            033 movl RBP, R13 # spill
            036 addl RBP, #14 # int
            039 imull RBP, RBP # int
            03c movl R9, R13 # spill
            03f addl R9, #13 # int
            043 imull R9, R9 # int
            047 sall RBP, #1
            049 sall R9, #1
            04c movl R8, R13 # spill
            04f addl R8, #15 # int
            053 movl R10, R8 # spill
            056 movdl XMM1, R8 # spill
            05b imull R10, R8 # int
            05f movl R8, R13 # spill
            062 addl R8, #12 # int
            066 imull R8, R8 # int
            06a sall R10, #1
            06d movl [rsp + #32], R10 # spill
            072 sall R8, #1
            075 movl RBX, R13 # spill
            078 addl RBX, #11 # int
            07b imull RBX, RBX # int
            07e movl RCX, R13 # spill
            081 addl RCX, #10 # int
            084 imull RCX, RCX # int
            087 sall RBX, #1
            089 sall RCX, #1
            08b movl RDX, R13 # spill
            08e addl RDX, #8 # int
            091 imull RDX, RDX # int
            094 movl RDI, R13 # spill
            097 addl RDI, #7 # int
            09a imull RDI, RDI # int
            09d sall RDX, #1
            09f sall RDI, #1
            0a1 movl RAX, R13 # spill
            0a4 addl RAX, #6 # int
            0a7 imull RAX, RAX # int
            0aa movl RSI, R13 # spill
            0ad addl RSI, #4 # int
            0b0 imull RSI, RSI # int
            0b3 sall RAX, #1
            0b5 sall RSI, #1
            0b7 movl R10, R13 # spill
            0ba addl R10, #2 # int
            0be imull R10, R10 # int
            0c2 movl R14, R13 # spill
            0c5 incl R14 # int
            0c8 imull R14, R14 # int
            0cc sall R10, #1
            0cf sall R14, #1
            0d2 addl R14, R11 # int
            0d5 addl R14, R10 # int
            0d8 movl R10, R13 # spill
            0db addl R10, #3 # int
            0df imull R10, R10 # int
            0e3 movl R11, R13 # spill
            0e6 addl R11, #5 # int
            0ea imull R11, R11 # int
            0ee sall R10, #1
            0f1 addl R10, R14 # int
            0f4 addl R10, RSI # int
            0f7 sall R11, #1
            0fa addl R11, R10 # int
            0fd addl R11, RAX # int
            100 addl R11, RDI # int
            103 addl R11, RDX # int
            106 movl R10, R13 # spill
            109 addl R10, #9 # int
            10d imull R10, R10 # int
            111 sall R10, #1
            114 addl R10, R11 # int
            117 addl R10, RCX # int
            11a addl R10, RBX # int
            11d addl R10, R8 # int
            120 addl R9, R10 # int
            123 addl RBP, R9 # int
            126 addl RBP, [RSP + #32 (32-bit)] # int
            12a addl R13, #16 # int
            12e movl R11, R13 # spill
            131 imull R11, R13 # int
            135 sall R11, #1
            138 cmpl R13, #999999985
            13f jl B2 # loop end P=1.000000 C=6554623.000000


            We observe already that there is 1 register that is "spilled" onto the stack.



            And for the 2 * i * i version:



            05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006
            05a addl RBX, R11 # int
            05d movl [rsp + #32], RBX # spill
            061 movl R11, R8 # spill
            064 addl R11, #15 # int
            068 movl [rsp + #36], R11 # spill
            06d movl R11, R8 # spill
            070 addl R11, #14 # int
            074 movl R10, R9 # spill
            077 addl R10, #16 # int
            07b movdl XMM2, R10 # spill
            080 movl RCX, R9 # spill
            083 addl RCX, #14 # int
            086 movdl XMM1, RCX # spill
            08a movl R10, R9 # spill
            08d addl R10, #12 # int
            091 movdl XMM4, R10 # spill
            096 movl RCX, R9 # spill
            099 addl RCX, #10 # int
            09c movdl XMM6, RCX # spill
            0a0 movl RBX, R9 # spill
            0a3 addl RBX, #8 # int
            0a6 movl RCX, R9 # spill
            0a9 addl RCX, #6 # int
            0ac movl RDX, R9 # spill
            0af addl RDX, #4 # int
            0b2 addl R9, #2 # int
            0b6 movl R10, R14 # spill
            0b9 addl R10, #22 # int
            0bd movdl XMM3, R10 # spill
            0c2 movl RDI, R14 # spill
            0c5 addl RDI, #20 # int
            0c8 movl RAX, R14 # spill
            0cb addl RAX, #32 # int
            0ce movl RSI, R14 # spill
            0d1 addl RSI, #18 # int
            0d4 movl R13, R14 # spill
            0d7 addl R13, #24 # int
            0db movl R10, R14 # spill
            0de addl R10, #26 # int
            0e2 movl [rsp + #40], R10 # spill
            0e7 movl RBP, R14 # spill
            0ea addl RBP, #28 # int
            0ed imull RBP, R11 # int
            0f1 addl R14, #30 # int
            0f5 imull R14, [RSP + #36 (32-bit)] # int
            0fb movl R10, R8 # spill
            0fe addl R10, #11 # int
            102 movdl R11, XMM3 # spill
            107 imull R11, R10 # int
            10b movl [rsp + #44], R11 # spill
            110 movl R10, R8 # spill
            113 addl R10, #10 # int
            117 imull RDI, R10 # int
            11b movl R11, R8 # spill
            11e addl R11, #8 # int
            122 movdl R10, XMM2 # spill
            127 imull R10, R11 # int
            12b movl [rsp + #48], R10 # spill
            130 movl R10, R8 # spill
            133 addl R10, #7 # int
            137 movdl R11, XMM1 # spill
            13c imull R11, R10 # int
            140 movl [rsp + #52], R11 # spill
            145 movl R11, R8 # spill
            148 addl R11, #6 # int
            14c movdl R10, XMM4 # spill
            151 imull R10, R11 # int
            155 movl [rsp + #56], R10 # spill
            15a movl R10, R8 # spill
            15d addl R10, #5 # int
            161 movdl R11, XMM6 # spill
            166 imull R11, R10 # int
            16a movl [rsp + #60], R11 # spill
            16f movl R11, R8 # spill
            172 addl R11, #4 # int
            176 imull RBX, R11 # int
            17a movl R11, R8 # spill
            17d addl R11, #3 # int
            181 imull RCX, R11 # int
            185 movl R10, R8 # spill
            188 addl R10, #2 # int
            18c imull RDX, R10 # int
            190 movl R11, R8 # spill
            193 incl R11 # int
            196 imull R9, R11 # int
            19a addl R9, [RSP + #32 (32-bit)] # int
            19f addl R9, RDX # int
            1a2 addl R9, RCX # int
            1a5 addl R9, RBX # int
            1a8 addl R9, [RSP + #60 (32-bit)] # int
            1ad addl R9, [RSP + #56 (32-bit)] # int
            1b2 addl R9, [RSP + #52 (32-bit)] # int
            1b7 addl R9, [RSP + #48 (32-bit)] # int
            1bc movl R10, R8 # spill
            1bf addl R10, #9 # int
            1c3 imull R10, RSI # int
            1c7 addl R10, R9 # int
            1ca addl R10, RDI # int
            1cd addl R10, [RSP + #44 (32-bit)] # int
            1d2 movl R11, R8 # spill
            1d5 addl R11, #12 # int
            1d9 imull R13, R11 # int
            1dd addl R13, R10 # int
            1e0 movl R10, R8 # spill
            1e3 addl R10, #13 # int
            1e7 imull R10, [RSP + #40 (32-bit)] # int
            1ed addl R10, R13 # int
            1f0 addl RBP, R10 # int
            1f3 addl R14, RBP # int
            1f6 movl R10, R8 # spill
            1f9 addl R10, #16 # int
            1fd cmpl R10, #999999985
            204 jl B2 # loop end P=1.000000 C=7419903.000000


            Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved. But of course it is obvious that neither the first nor the second version is any good; the loop could literally be coded in 6 instructions.



            So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot.



            In fact, modern x86 CPU's break down these instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers the loop unrolling idea is now most probably counter-productive. According to Agner Fog's optimization guide:




            The gain in performance due to the µop cache can be quite
            considerable if the average instruction length is more than 4 bytes.
            The following methods of optimizing the use of the µop cache may
            be considered:




            • Make sure that critical loops are small enough to fit into the µop cache.

            • Align the most critical loop entries and function entries by 32.

            • Avoid unnecessary loop unrolling.

            • Avoid instructions that have extra load time

              . . .




            Regarding those load times - even the fastest L1D hit costs 4 cycles, so yes, even a few accesses to memory will hurt performance in tight loops.



            To see how fast it can be, we can compile a similar C application with GCC, which produces simply:



              xor edx, edx
            xor eax, eax
            .L2:
            mov ecx, edx
            imul ecx, edx
            add edx, 1
            lea eax, [rax+rcx*2]
            cmp edx, 1000000000
            jne .L2


            With a run time of... 0.08 s, or 5 times faster.






            share|improve this answer














            There is a slight difference in the ordering of the bytecode



            2 * (i * i):



                 iconst_2
            iload0
            iload0
            imul
            imul
            iadd


            vs 2 * i * i:



                 iconst_2
            iload0
            imul
            iload0
            imul
            iadd


            At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.



            So we need to dig deeper into the lower level (JIT).



            Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling:



            030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006
            030 addl R11, RBP # int
            033 movl RBP, R13 # spill
            036 addl RBP, #14 # int
            039 imull RBP, RBP # int
            03c movl R9, R13 # spill
            03f addl R9, #13 # int
            043 imull R9, R9 # int
            047 sall RBP, #1
            049 sall R9, #1
            04c movl R8, R13 # spill
            04f addl R8, #15 # int
            053 movl R10, R8 # spill
            056 movdl XMM1, R8 # spill
            05b imull R10, R8 # int
            05f movl R8, R13 # spill
            062 addl R8, #12 # int
            066 imull R8, R8 # int
            06a sall R10, #1
            06d movl [rsp + #32], R10 # spill
            072 sall R8, #1
            075 movl RBX, R13 # spill
            078 addl RBX, #11 # int
            07b imull RBX, RBX # int
            07e movl RCX, R13 # spill
            081 addl RCX, #10 # int
            084 imull RCX, RCX # int
            087 sall RBX, #1
            089 sall RCX, #1
            08b movl RDX, R13 # spill
            08e addl RDX, #8 # int
            091 imull RDX, RDX # int
            094 movl RDI, R13 # spill
            097 addl RDI, #7 # int
            09a imull RDI, RDI # int
            09d sall RDX, #1
            09f sall RDI, #1
            0a1 movl RAX, R13 # spill
            0a4 addl RAX, #6 # int
            0a7 imull RAX, RAX # int
            0aa movl RSI, R13 # spill
            0ad addl RSI, #4 # int
            0b0 imull RSI, RSI # int
            0b3 sall RAX, #1
            0b5 sall RSI, #1
            0b7 movl R10, R13 # spill
            0ba addl R10, #2 # int
            0be imull R10, R10 # int
            0c2 movl R14, R13 # spill
            0c5 incl R14 # int
            0c8 imull R14, R14 # int
            0cc sall R10, #1
            0cf sall R14, #1
            0d2 addl R14, R11 # int
            0d5 addl R14, R10 # int
            0d8 movl R10, R13 # spill
            0db addl R10, #3 # int
            0df imull R10, R10 # int
            0e3 movl R11, R13 # spill
            0e6 addl R11, #5 # int
            0ea imull R11, R11 # int
            0ee sall R10, #1
            0f1 addl R10, R14 # int
            0f4 addl R10, RSI # int
            0f7 sall R11, #1
            0fa addl R11, R10 # int
            0fd addl R11, RAX # int
            100 addl R11, RDI # int
            103 addl R11, RDX # int
            106 movl R10, R13 # spill
            109 addl R10, #9 # int
            10d imull R10, R10 # int
            111 sall R10, #1
            114 addl R10, R11 # int
            117 addl R10, RCX # int
            11a addl R10, RBX # int
            11d addl R10, R8 # int
            120 addl R9, R10 # int
            123 addl RBP, R9 # int
            126 addl RBP, [RSP + #32 (32-bit)] # int
            12a addl R13, #16 # int
            12e movl R11, R13 # spill
            131 imull R11, R13 # int
            135 sall R11, #1
            138 cmpl R13, #999999985
            13f jl B2 # loop end P=1.000000 C=6554623.000000


            We observe already that there is 1 register that is "spilled" onto the stack.



            And for the 2 * i * i version:



            05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006
            05a addl RBX, R11 # int
            05d movl [rsp + #32], RBX # spill
            061 movl R11, R8 # spill
            064 addl R11, #15 # int
            068 movl [rsp + #36], R11 # spill
            06d movl R11, R8 # spill
            070 addl R11, #14 # int
            074 movl R10, R9 # spill
            077 addl R10, #16 # int
            07b movdl XMM2, R10 # spill
            080 movl RCX, R9 # spill
            083 addl RCX, #14 # int
            086 movdl XMM1, RCX # spill
            08a movl R10, R9 # spill
            08d addl R10, #12 # int
            091 movdl XMM4, R10 # spill
            096 movl RCX, R9 # spill
            099 addl RCX, #10 # int
            09c movdl XMM6, RCX # spill
            0a0 movl RBX, R9 # spill
            0a3 addl RBX, #8 # int
            0a6 movl RCX, R9 # spill
            0a9 addl RCX, #6 # int
            0ac movl RDX, R9 # spill
            0af addl RDX, #4 # int
            0b2 addl R9, #2 # int
            0b6 movl R10, R14 # spill
            0b9 addl R10, #22 # int
            0bd movdl XMM3, R10 # spill
            0c2 movl RDI, R14 # spill
            0c5 addl RDI, #20 # int
            0c8 movl RAX, R14 # spill
            0cb addl RAX, #32 # int
            0ce movl RSI, R14 # spill
            0d1 addl RSI, #18 # int
            0d4 movl R13, R14 # spill
            0d7 addl R13, #24 # int
            0db movl R10, R14 # spill
            0de addl R10, #26 # int
            0e2 movl [rsp + #40], R10 # spill
            0e7 movl RBP, R14 # spill
            0ea addl RBP, #28 # int
            0ed imull RBP, R11 # int
            0f1 addl R14, #30 # int
            0f5 imull R14, [RSP + #36 (32-bit)] # int
            0fb movl R10, R8 # spill
            0fe addl R10, #11 # int
            102 movdl R11, XMM3 # spill
            107 imull R11, R10 # int
            10b movl [rsp + #44], R11 # spill
            110 movl R10, R8 # spill
            113 addl R10, #10 # int
            117 imull RDI, R10 # int
            11b movl R11, R8 # spill
            11e addl R11, #8 # int
            122 movdl R10, XMM2 # spill
            127 imull R10, R11 # int
            12b movl [rsp + #48], R10 # spill
            130 movl R10, R8 # spill
            133 addl R10, #7 # int
            137 movdl R11, XMM1 # spill
            13c imull R11, R10 # int
            140 movl [rsp + #52], R11 # spill
            145 movl R11, R8 # spill
            148 addl R11, #6 # int
            14c movdl R10, XMM4 # spill
            151 imull R10, R11 # int
            155 movl [rsp + #56], R10 # spill
            15a movl R10, R8 # spill
            15d addl R10, #5 # int
            161 movdl R11, XMM6 # spill
            166 imull R11, R10 # int
            16a movl [rsp + #60], R11 # spill
            16f movl R11, R8 # spill
            172 addl R11, #4 # int
            176 imull RBX, R11 # int
            17a movl R11, R8 # spill
            17d addl R11, #3 # int
            181 imull RCX, R11 # int
            185 movl R10, R8 # spill
            188 addl R10, #2 # int
            18c imull RDX, R10 # int
            190 movl R11, R8 # spill
            193 incl R11 # int
            196 imull R9, R11 # int
            19a addl R9, [RSP + #32 (32-bit)] # int
            19f addl R9, RDX # int
            1a2 addl R9, RCX # int
            1a5 addl R9, RBX # int
            1a8 addl R9, [RSP + #60 (32-bit)] # int
            1ad addl R9, [RSP + #56 (32-bit)] # int
            1b2 addl R9, [RSP + #52 (32-bit)] # int
            1b7 addl R9, [RSP + #48 (32-bit)] # int
            1bc movl R10, R8 # spill
            1bf addl R10, #9 # int
            1c3 imull R10, RSI # int
            1c7 addl R10, R9 # int
            1ca addl R10, RDI # int
            1cd addl R10, [RSP + #44 (32-bit)] # int
            1d2 movl R11, R8 # spill
            1d5 addl R11, #12 # int
            1d9 imull R13, R11 # int
            1dd addl R13, R10 # int
            1e0 movl R10, R8 # spill
            1e3 addl R10, #13 # int
            1e7 imull R10, [RSP + #40 (32-bit)] # int
            1ed addl R10, R13 # int
            1f0 addl RBP, R10 # int
            1f3 addl R14, RBP # int
            1f6 movl R10, R8 # spill
            1f9 addl R10, #16 # int
            1fd cmpl R10, #999999985
            204 jl B2 # loop end P=1.000000 C=7419903.000000


            Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved. But of course it is obvious that neither the first nor the second version is any good; the loop could literally be coded in 6 instructions.



            So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot.



            In fact, modern x86 CPU's break down these instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers the loop unrolling idea is now most probably counter-productive. According to Agner Fog's optimization guide:




            The gain in performance due to the µop cache can be quite
            considerable if the average instruction length is more than 4 bytes.
            The following methods of optimizing the use of the µop cache may
            be considered:




            • Make sure that critical loops are small enough to fit into the µop cache.

            • Align the most critical loop entries and function entries by 32.

            • Avoid unnecessary loop unrolling.

            • Avoid instructions that have extra load time

              . . .




            Regarding those load times - even the fastest L1D hit costs 4 cycles, so yes, even a few accesses to memory will hurt performance in tight loops.



            To see how fast it can be, we can compile a similar C application with GCC, which produces simply:



              xor edx, edx
            xor eax, eax
            .L2:
            mov ecx, edx
            imul ecx, edx
            add edx, 1
            lea eax, [rax+rcx*2]
            cmp edx, 1000000000
            jne .L2


            With a run time of... 0.08 s, or 5 times faster.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 17 hours ago

























            answered yesterday









            rustyx

            24.9k695134




            24.9k695134
























                up vote
                30
                down vote













                When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:



                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += i * i;
                }
                n *= 2;


                but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.



                Here are a few reasons why I think this is the case:




                • Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same

                • The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version


                Here is the test code that I used to draw these conclusions:



                public static void main(String args) {
                long fastVersion = 0;
                long slowVersion = 0;
                long optimizedVersion = 0;
                long modifiedFastVersion = 0;
                long modifiedSlowVersion = 0;

                for (int i = 0; i < 10; i++) {
                fastVersion += fastVersion();
                slowVersion += slowVersion();
                optimizedVersion += optimizedVersion();
                modifiedFastVersion += modifiedFastVersion();
                modifiedSlowVersion += modifiedSlowVersion();
                }

                System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
                System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
                System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
                System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
                System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
                }

                private static long fastVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += 2 * (i * i);
                }
                return System.nanoTime() - startTime;
                }

                private static long slowVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += 2 * i * i;
                }
                return System.nanoTime() - startTime;
                }

                private static long optimizedVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += i * i;
                }
                n *= 2;
                return System.nanoTime() - startTime;
                }

                private static long modifiedFastVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                if (n == 0) n = 1;
                n += 2 * (i * i);
                }
                return System.nanoTime() - startTime;
                }

                private static long modifiedSlowVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                if (n == 0) n = 1;
                n += 2 * i * i;
                }
                return System.nanoTime() - startTime;
                }


                And here are the results:



                Fast version: 5.7274411 s
                Slow version: 7.6190804 s
                Optimized version: 5.1348007 s
                Modified fast version: 7.1492705 s
                Modified slow version: 7.2952668 s





                share|improve this answer





















                • here is a benchmark: github.com/jawb-software/stackoverflow-53452713
                  – dit
                  yesterday










                • I think on the optimizedVersion, it should be n *= 2000000000;
                  – StefansArya
                  yesterday






                • 4




                  @StefansArya Incorrect.. 2(1) + 2(4) + 2(9) + 2(16) + ... = 2(1 + 4 + 9 + 16 + ...)
                  – Arth
                  16 hours ago















                up vote
                30
                down vote













                When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:



                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += i * i;
                }
                n *= 2;


                but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.



                Here are a few reasons why I think this is the case:




                • Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same

                • The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version


                Here is the test code that I used to draw these conclusions:



                public static void main(String args) {
                long fastVersion = 0;
                long slowVersion = 0;
                long optimizedVersion = 0;
                long modifiedFastVersion = 0;
                long modifiedSlowVersion = 0;

                for (int i = 0; i < 10; i++) {
                fastVersion += fastVersion();
                slowVersion += slowVersion();
                optimizedVersion += optimizedVersion();
                modifiedFastVersion += modifiedFastVersion();
                modifiedSlowVersion += modifiedSlowVersion();
                }

                System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
                System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
                System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
                System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
                System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
                }

                private static long fastVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += 2 * (i * i);
                }
                return System.nanoTime() - startTime;
                }

                private static long slowVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += 2 * i * i;
                }
                return System.nanoTime() - startTime;
                }

                private static long optimizedVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += i * i;
                }
                n *= 2;
                return System.nanoTime() - startTime;
                }

                private static long modifiedFastVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                if (n == 0) n = 1;
                n += 2 * (i * i);
                }
                return System.nanoTime() - startTime;
                }

                private static long modifiedSlowVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                if (n == 0) n = 1;
                n += 2 * i * i;
                }
                return System.nanoTime() - startTime;
                }


                And here are the results:



                Fast version: 5.7274411 s
                Slow version: 7.6190804 s
                Optimized version: 5.1348007 s
                Modified fast version: 7.1492705 s
                Modified slow version: 7.2952668 s





                share|improve this answer





















                • here is a benchmark: github.com/jawb-software/stackoverflow-53452713
                  – dit
                  yesterday










                • I think on the optimizedVersion, it should be n *= 2000000000;
                  – StefansArya
                  yesterday






                • 4




                  @StefansArya Incorrect.. 2(1) + 2(4) + 2(9) + 2(16) + ... = 2(1 + 4 + 9 + 16 + ...)
                  – Arth
                  16 hours ago













                up vote
                30
                down vote










                up vote
                30
                down vote









                When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:



                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += i * i;
                }
                n *= 2;


                but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.



                Here are a few reasons why I think this is the case:




                • Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same

                • The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version


                Here is the test code that I used to draw these conclusions:



                public static void main(String args) {
                long fastVersion = 0;
                long slowVersion = 0;
                long optimizedVersion = 0;
                long modifiedFastVersion = 0;
                long modifiedSlowVersion = 0;

                for (int i = 0; i < 10; i++) {
                fastVersion += fastVersion();
                slowVersion += slowVersion();
                optimizedVersion += optimizedVersion();
                modifiedFastVersion += modifiedFastVersion();
                modifiedSlowVersion += modifiedSlowVersion();
                }

                System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
                System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
                System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
                System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
                System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
                }

                private static long fastVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += 2 * (i * i);
                }
                return System.nanoTime() - startTime;
                }

                private static long slowVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += 2 * i * i;
                }
                return System.nanoTime() - startTime;
                }

                private static long optimizedVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += i * i;
                }
                n *= 2;
                return System.nanoTime() - startTime;
                }

                private static long modifiedFastVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                if (n == 0) n = 1;
                n += 2 * (i * i);
                }
                return System.nanoTime() - startTime;
                }

                private static long modifiedSlowVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                if (n == 0) n = 1;
                n += 2 * i * i;
                }
                return System.nanoTime() - startTime;
                }


                And here are the results:



                Fast version: 5.7274411 s
                Slow version: 7.6190804 s
                Optimized version: 5.1348007 s
                Modified fast version: 7.1492705 s
                Modified slow version: 7.2952668 s





                share|improve this answer












                When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:



                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += i * i;
                }
                n *= 2;


                but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.



                Here are a few reasons why I think this is the case:




                • Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same

                • The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version


                Here is the test code that I used to draw these conclusions:



                public static void main(String args) {
                long fastVersion = 0;
                long slowVersion = 0;
                long optimizedVersion = 0;
                long modifiedFastVersion = 0;
                long modifiedSlowVersion = 0;

                for (int i = 0; i < 10; i++) {
                fastVersion += fastVersion();
                slowVersion += slowVersion();
                optimizedVersion += optimizedVersion();
                modifiedFastVersion += modifiedFastVersion();
                modifiedSlowVersion += modifiedSlowVersion();
                }

                System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
                System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
                System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
                System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
                System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
                }

                private static long fastVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += 2 * (i * i);
                }
                return System.nanoTime() - startTime;
                }

                private static long slowVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += 2 * i * i;
                }
                return System.nanoTime() - startTime;
                }

                private static long optimizedVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                n += i * i;
                }
                n *= 2;
                return System.nanoTime() - startTime;
                }

                private static long modifiedFastVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                if (n == 0) n = 1;
                n += 2 * (i * i);
                }
                return System.nanoTime() - startTime;
                }

                private static long modifiedSlowVersion() {
                long startTime = System.nanoTime();
                int n = 0;
                for (int i = 0; i < 1000000000; i++) {
                if (n == 0) n = 1;
                n += 2 * i * i;
                }
                return System.nanoTime() - startTime;
                }


                And here are the results:



                Fast version: 5.7274411 s
                Slow version: 7.6190804 s
                Optimized version: 5.1348007 s
                Modified fast version: 7.1492705 s
                Modified slow version: 7.2952668 s






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered yesterday









                Runemoro

                1,63611237




                1,63611237












                • here is a benchmark: github.com/jawb-software/stackoverflow-53452713
                  – dit
                  yesterday










                • I think on the optimizedVersion, it should be n *= 2000000000;
                  – StefansArya
                  yesterday






                • 4




                  @StefansArya Incorrect.. 2(1) + 2(4) + 2(9) + 2(16) + ... = 2(1 + 4 + 9 + 16 + ...)
                  – Arth
                  16 hours ago


















                • here is a benchmark: github.com/jawb-software/stackoverflow-53452713
                  – dit
                  yesterday










                • I think on the optimizedVersion, it should be n *= 2000000000;
                  – StefansArya
                  yesterday






                • 4




                  @StefansArya Incorrect.. 2(1) + 2(4) + 2(9) + 2(16) + ... = 2(1 + 4 + 9 + 16 + ...)
                  – Arth
                  16 hours ago
















                here is a benchmark: github.com/jawb-software/stackoverflow-53452713
                – dit
                yesterday




                here is a benchmark: github.com/jawb-software/stackoverflow-53452713
                – dit
                yesterday












                I think on the optimizedVersion, it should be n *= 2000000000;
                – StefansArya
                yesterday




                I think on the optimizedVersion, it should be n *= 2000000000;
                – StefansArya
                yesterday




                4




                4




                @StefansArya Incorrect.. 2(1) + 2(4) + 2(9) + 2(16) + ... = 2(1 + 4 + 9 + 16 + ...)
                – Arth
                16 hours ago




                @StefansArya Incorrect.. 2(1) + 2(4) + 2(9) + 2(16) + ... = 2(1 + 4 + 9 + 16 + ...)
                – Arth
                16 hours ago










                up vote
                9
                down vote













                ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html

                ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer



                On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:



                public static void main(String args) {
                int repeat = 10;
                long A = 0;
                long B = 0;
                for (int i = 0; i < repeat; i++) {
                A += test();
                B += testB();
                }

                System.out.println(A / repeat + " ms");
                System.out.println(B / repeat + " ms");
                }


                private static long test() {
                int n = 0;
                for (int i = 0; i < 1000; i++) {
                n += multi(i);
                }
                long startTime = System.currentTimeMillis();
                for (int i = 0; i < 1000000000; i++) {
                n += multi(i);
                }
                long ms = (System.currentTimeMillis() - startTime);
                System.out.println(ms + " ms A " + n);
                return ms;
                }


                private static long testB() {
                int n = 0;
                for (int i = 0; i < 1000; i++) {
                n += multiB(i);
                }
                long startTime = System.currentTimeMillis();
                for (int i = 0; i < 1000000000; i++) {
                n += multiB(i);
                }
                long ms = (System.currentTimeMillis() - startTime);
                System.out.println(ms + " ms B " + n);
                return ms;
                }

                private static int multiB(int i) {
                return 2 * (i * i);
                }

                private static int multi(int i) {
                return 2 * i * i;
                }


                Output:



                ...
                405 ms A 785527736
                327 ms B 785527736
                404 ms A 785527736
                329 ms B 785527736
                404 ms A 785527736
                328 ms B 785527736
                404 ms A 785527736
                328 ms B 785527736
                410 ms
                333 ms


                So why?
                The Byte code is this:



                 private static multiB(int arg0) { // 2 * (i * i)
                <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>

                L1 {
                iconst_2
                iload0
                iload0
                imul
                imul
                ireturn
                }
                L2 {
                }
                }

                private static multi(int arg0) { // 2 * i * i
                <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>

                L1 {
                iconst_2
                iload0
                imul
                iload0
                imul
                ireturn
                }
                L2 {
                }
                }


                The difference being:

                With brackets (2 * (i * i)):




                • push const stack

                • push local on stack

                • push local on stack

                • multiply top of stack

                • multiply top of stack


                Without brackets (2 * i * i):




                • push const stack

                • push local on stack

                • multiply top of stack

                • push local on stack

                • multiply top of stack


                Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.






                share|improve this answer



























                  up vote
                  9
                  down vote













                  ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html

                  ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer



                  On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:



                  public static void main(String args) {
                  int repeat = 10;
                  long A = 0;
                  long B = 0;
                  for (int i = 0; i < repeat; i++) {
                  A += test();
                  B += testB();
                  }

                  System.out.println(A / repeat + " ms");
                  System.out.println(B / repeat + " ms");
                  }


                  private static long test() {
                  int n = 0;
                  for (int i = 0; i < 1000; i++) {
                  n += multi(i);
                  }
                  long startTime = System.currentTimeMillis();
                  for (int i = 0; i < 1000000000; i++) {
                  n += multi(i);
                  }
                  long ms = (System.currentTimeMillis() - startTime);
                  System.out.println(ms + " ms A " + n);
                  return ms;
                  }


                  private static long testB() {
                  int n = 0;
                  for (int i = 0; i < 1000; i++) {
                  n += multiB(i);
                  }
                  long startTime = System.currentTimeMillis();
                  for (int i = 0; i < 1000000000; i++) {
                  n += multiB(i);
                  }
                  long ms = (System.currentTimeMillis() - startTime);
                  System.out.println(ms + " ms B " + n);
                  return ms;
                  }

                  private static int multiB(int i) {
                  return 2 * (i * i);
                  }

                  private static int multi(int i) {
                  return 2 * i * i;
                  }


                  Output:



                  ...
                  405 ms A 785527736
                  327 ms B 785527736
                  404 ms A 785527736
                  329 ms B 785527736
                  404 ms A 785527736
                  328 ms B 785527736
                  404 ms A 785527736
                  328 ms B 785527736
                  410 ms
                  333 ms


                  So why?
                  The Byte code is this:



                   private static multiB(int arg0) { // 2 * (i * i)
                  <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>

                  L1 {
                  iconst_2
                  iload0
                  iload0
                  imul
                  imul
                  ireturn
                  }
                  L2 {
                  }
                  }

                  private static multi(int arg0) { // 2 * i * i
                  <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>

                  L1 {
                  iconst_2
                  iload0
                  imul
                  iload0
                  imul
                  ireturn
                  }
                  L2 {
                  }
                  }


                  The difference being:

                  With brackets (2 * (i * i)):




                  • push const stack

                  • push local on stack

                  • push local on stack

                  • multiply top of stack

                  • multiply top of stack


                  Without brackets (2 * i * i):




                  • push const stack

                  • push local on stack

                  • multiply top of stack

                  • push local on stack

                  • multiply top of stack


                  Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.






                  share|improve this answer

























                    up vote
                    9
                    down vote










                    up vote
                    9
                    down vote









                    ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html

                    ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer



                    On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:



                    public static void main(String args) {
                    int repeat = 10;
                    long A = 0;
                    long B = 0;
                    for (int i = 0; i < repeat; i++) {
                    A += test();
                    B += testB();
                    }

                    System.out.println(A / repeat + " ms");
                    System.out.println(B / repeat + " ms");
                    }


                    private static long test() {
                    int n = 0;
                    for (int i = 0; i < 1000; i++) {
                    n += multi(i);
                    }
                    long startTime = System.currentTimeMillis();
                    for (int i = 0; i < 1000000000; i++) {
                    n += multi(i);
                    }
                    long ms = (System.currentTimeMillis() - startTime);
                    System.out.println(ms + " ms A " + n);
                    return ms;
                    }


                    private static long testB() {
                    int n = 0;
                    for (int i = 0; i < 1000; i++) {
                    n += multiB(i);
                    }
                    long startTime = System.currentTimeMillis();
                    for (int i = 0; i < 1000000000; i++) {
                    n += multiB(i);
                    }
                    long ms = (System.currentTimeMillis() - startTime);
                    System.out.println(ms + " ms B " + n);
                    return ms;
                    }

                    private static int multiB(int i) {
                    return 2 * (i * i);
                    }

                    private static int multi(int i) {
                    return 2 * i * i;
                    }


                    Output:



                    ...
                    405 ms A 785527736
                    327 ms B 785527736
                    404 ms A 785527736
                    329 ms B 785527736
                    404 ms A 785527736
                    328 ms B 785527736
                    404 ms A 785527736
                    328 ms B 785527736
                    410 ms
                    333 ms


                    So why?
                    The Byte code is this:



                     private static multiB(int arg0) { // 2 * (i * i)
                    <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>

                    L1 {
                    iconst_2
                    iload0
                    iload0
                    imul
                    imul
                    ireturn
                    }
                    L2 {
                    }
                    }

                    private static multi(int arg0) { // 2 * i * i
                    <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>

                    L1 {
                    iconst_2
                    iload0
                    imul
                    iload0
                    imul
                    ireturn
                    }
                    L2 {
                    }
                    }


                    The difference being:

                    With brackets (2 * (i * i)):




                    • push const stack

                    • push local on stack

                    • push local on stack

                    • multiply top of stack

                    • multiply top of stack


                    Without brackets (2 * i * i):




                    • push const stack

                    • push local on stack

                    • multiply top of stack

                    • push local on stack

                    • multiply top of stack


                    Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.






                    share|improve this answer














                    ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html

                    ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer



                    On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:



                    public static void main(String args) {
                    int repeat = 10;
                    long A = 0;
                    long B = 0;
                    for (int i = 0; i < repeat; i++) {
                    A += test();
                    B += testB();
                    }

                    System.out.println(A / repeat + " ms");
                    System.out.println(B / repeat + " ms");
                    }


                    private static long test() {
                    int n = 0;
                    for (int i = 0; i < 1000; i++) {
                    n += multi(i);
                    }
                    long startTime = System.currentTimeMillis();
                    for (int i = 0; i < 1000000000; i++) {
                    n += multi(i);
                    }
                    long ms = (System.currentTimeMillis() - startTime);
                    System.out.println(ms + " ms A " + n);
                    return ms;
                    }


                    private static long testB() {
                    int n = 0;
                    for (int i = 0; i < 1000; i++) {
                    n += multiB(i);
                    }
                    long startTime = System.currentTimeMillis();
                    for (int i = 0; i < 1000000000; i++) {
                    n += multiB(i);
                    }
                    long ms = (System.currentTimeMillis() - startTime);
                    System.out.println(ms + " ms B " + n);
                    return ms;
                    }

                    private static int multiB(int i) {
                    return 2 * (i * i);
                    }

                    private static int multi(int i) {
                    return 2 * i * i;
                    }


                    Output:



                    ...
                    405 ms A 785527736
                    327 ms B 785527736
                    404 ms A 785527736
                    329 ms B 785527736
                    404 ms A 785527736
                    328 ms B 785527736
                    404 ms A 785527736
                    328 ms B 785527736
                    410 ms
                    333 ms


                    So why?
                    The Byte code is this:



                     private static multiB(int arg0) { // 2 * (i * i)
                    <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>

                    L1 {
                    iconst_2
                    iload0
                    iload0
                    imul
                    imul
                    ireturn
                    }
                    L2 {
                    }
                    }

                    private static multi(int arg0) { // 2 * i * i
                    <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>

                    L1 {
                    iconst_2
                    iload0
                    imul
                    iload0
                    imul
                    ireturn
                    }
                    L2 {
                    }
                    }


                    The difference being:

                    With brackets (2 * (i * i)):




                    • push const stack

                    • push local on stack

                    • push local on stack

                    • multiply top of stack

                    • multiply top of stack


                    Without brackets (2 * i * i):




                    • push const stack

                    • push local on stack

                    • multiply top of stack

                    • push local on stack

                    • multiply top of stack


                    Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited yesterday

























                    answered yesterday









                    DSchmidt

                    25329




                    25329






















                        up vote
                        4
                        down vote













                        I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .



                        @State(Scope.Benchmark)
                        @Warmup(iterations = 2)
                        @Fork(1)
                        @Measurement(iterations = 10)
                        @OutputTimeUnit(TimeUnit.NANOSECONDS)
                        //@BenchmarkMode({ Mode.All })
                        @BenchmarkMode(Mode.AverageTime)
                        public class MyBenchmark {
                        @Param({ "100", "1000", "1000000000" })
                        private int size;

                        @Benchmark
                        public int two_square_i() {
                        int n = 0;
                        for (int i = 0; i < size; i++) {
                        n += 2 * (i * i);
                        }
                        return n;
                        }

                        @Benchmark
                        public int square_i_two() {
                        int n = 0;
                        for (int i = 0; i < size; i++) {
                        n += i * i;
                        }
                        return 2*n;
                        }

                        @Benchmark
                        public int two_i_() {
                        int n = 0;
                        for (int i = 0; i < size; i++) {
                        n += 2 * i * i;
                        }
                        return n;
                        }
                        }


                        The result are here:



                        Benchmark                           (size)  Mode  Samples          Score   Score error  Units
                        o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
                        o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
                        o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
                        o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
                        o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
                        o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
                        o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
                        o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
                        o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op


                        On my PC (Core i7 860, doing nothing much apart reading on my smartphone):





                        • n += i*i then n*2 is first


                        • 2 * (i * i) is second.


                        The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).



                        Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class




                        • Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI

                        • Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP


                        I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.






                        share|improve this answer

























                          up vote
                          4
                          down vote













                          I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .



                          @State(Scope.Benchmark)
                          @Warmup(iterations = 2)
                          @Fork(1)
                          @Measurement(iterations = 10)
                          @OutputTimeUnit(TimeUnit.NANOSECONDS)
                          //@BenchmarkMode({ Mode.All })
                          @BenchmarkMode(Mode.AverageTime)
                          public class MyBenchmark {
                          @Param({ "100", "1000", "1000000000" })
                          private int size;

                          @Benchmark
                          public int two_square_i() {
                          int n = 0;
                          for (int i = 0; i < size; i++) {
                          n += 2 * (i * i);
                          }
                          return n;
                          }

                          @Benchmark
                          public int square_i_two() {
                          int n = 0;
                          for (int i = 0; i < size; i++) {
                          n += i * i;
                          }
                          return 2*n;
                          }

                          @Benchmark
                          public int two_i_() {
                          int n = 0;
                          for (int i = 0; i < size; i++) {
                          n += 2 * i * i;
                          }
                          return n;
                          }
                          }


                          The result are here:



                          Benchmark                           (size)  Mode  Samples          Score   Score error  Units
                          o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
                          o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
                          o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
                          o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
                          o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
                          o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
                          o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
                          o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
                          o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op


                          On my PC (Core i7 860, doing nothing much apart reading on my smartphone):





                          • n += i*i then n*2 is first


                          • 2 * (i * i) is second.


                          The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).



                          Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class




                          • Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI

                          • Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP


                          I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.






                          share|improve this answer























                            up vote
                            4
                            down vote










                            up vote
                            4
                            down vote









                            I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .



                            @State(Scope.Benchmark)
                            @Warmup(iterations = 2)
                            @Fork(1)
                            @Measurement(iterations = 10)
                            @OutputTimeUnit(TimeUnit.NANOSECONDS)
                            //@BenchmarkMode({ Mode.All })
                            @BenchmarkMode(Mode.AverageTime)
                            public class MyBenchmark {
                            @Param({ "100", "1000", "1000000000" })
                            private int size;

                            @Benchmark
                            public int two_square_i() {
                            int n = 0;
                            for (int i = 0; i < size; i++) {
                            n += 2 * (i * i);
                            }
                            return n;
                            }

                            @Benchmark
                            public int square_i_two() {
                            int n = 0;
                            for (int i = 0; i < size; i++) {
                            n += i * i;
                            }
                            return 2*n;
                            }

                            @Benchmark
                            public int two_i_() {
                            int n = 0;
                            for (int i = 0; i < size; i++) {
                            n += 2 * i * i;
                            }
                            return n;
                            }
                            }


                            The result are here:



                            Benchmark                           (size)  Mode  Samples          Score   Score error  Units
                            o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
                            o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
                            o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
                            o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
                            o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
                            o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
                            o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
                            o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
                            o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op


                            On my PC (Core i7 860, doing nothing much apart reading on my smartphone):





                            • n += i*i then n*2 is first


                            • 2 * (i * i) is second.


                            The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).



                            Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class




                            • Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI

                            • Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP


                            I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.






                            share|improve this answer












                            I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .



                            @State(Scope.Benchmark)
                            @Warmup(iterations = 2)
                            @Fork(1)
                            @Measurement(iterations = 10)
                            @OutputTimeUnit(TimeUnit.NANOSECONDS)
                            //@BenchmarkMode({ Mode.All })
                            @BenchmarkMode(Mode.AverageTime)
                            public class MyBenchmark {
                            @Param({ "100", "1000", "1000000000" })
                            private int size;

                            @Benchmark
                            public int two_square_i() {
                            int n = 0;
                            for (int i = 0; i < size; i++) {
                            n += 2 * (i * i);
                            }
                            return n;
                            }

                            @Benchmark
                            public int square_i_two() {
                            int n = 0;
                            for (int i = 0; i < size; i++) {
                            n += i * i;
                            }
                            return 2*n;
                            }

                            @Benchmark
                            public int two_i_() {
                            int n = 0;
                            for (int i = 0; i < size; i++) {
                            n += 2 * i * i;
                            }
                            return n;
                            }
                            }


                            The result are here:



                            Benchmark                           (size)  Mode  Samples          Score   Score error  Units
                            o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
                            o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
                            o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
                            o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
                            o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
                            o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
                            o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
                            o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
                            o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op


                            On my PC (Core i7 860, doing nothing much apart reading on my smartphone):





                            • n += i*i then n*2 is first


                            • 2 * (i * i) is second.


                            The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).



                            Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class




                            • Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI

                            • Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP


                            I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered yesterday









                            NoDataFound

                            5,3541739




                            5,3541739






















                                up vote
                                2
                                down vote













                                I got similar results:



                                2 * (i * i): 0.458765943 s, n=119860736
                                2 * i * i: 0.580255126 s, n=119860736


                                I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.



                                Finally, here is a javap -c -v <.java> decompile of each:



                                     3: ldc           #3                  // String 2 * (i * i):
                                5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                11: lstore_1
                                12: iconst_0
                                13: istore_3
                                14: iconst_0
                                15: istore 4
                                17: iload 4
                                19: ldc #6 // int 1000000000
                                21: if_icmpge 40
                                24: iload_3
                                25: iconst_2
                                26: iload 4
                                28: iload 4
                                30: imul
                                31: imul
                                32: iadd
                                33: istore_3
                                34: iinc 4, 1
                                37: goto 17


                                vs.



                                     3: ldc           #3                  // String 2 * i * i:
                                5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                11: lstore_1
                                12: iconst_0
                                13: istore_3
                                14: iconst_0
                                15: istore 4
                                17: iload 4
                                19: ldc #6 // int 1000000000
                                21: if_icmpge 40
                                24: iload_3
                                25: iconst_2
                                26: iload 4
                                28: imul
                                29: iload 4
                                31: imul
                                32: iadd
                                33: istore_3
                                34: iinc 4, 1
                                37: goto 17


                                FYI -



                                java -version
                                java version "1.8.0_121"
                                Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
                                Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)





                                share|improve this answer

















                                • 1




                                  A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
                                  – nullpointer
                                  yesterday












                                • @nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
                                  – paulsm4
                                  yesterday












                                • That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
                                  – Krease
                                  yesterday






                                • 1




                                  Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike.
                                  – rustyx
                                  yesterday






                                • 1




                                  @ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work!
                                  – paulsm4
                                  22 hours ago

















                                up vote
                                2
                                down vote













                                I got similar results:



                                2 * (i * i): 0.458765943 s, n=119860736
                                2 * i * i: 0.580255126 s, n=119860736


                                I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.



                                Finally, here is a javap -c -v <.java> decompile of each:



                                     3: ldc           #3                  // String 2 * (i * i):
                                5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                11: lstore_1
                                12: iconst_0
                                13: istore_3
                                14: iconst_0
                                15: istore 4
                                17: iload 4
                                19: ldc #6 // int 1000000000
                                21: if_icmpge 40
                                24: iload_3
                                25: iconst_2
                                26: iload 4
                                28: iload 4
                                30: imul
                                31: imul
                                32: iadd
                                33: istore_3
                                34: iinc 4, 1
                                37: goto 17


                                vs.



                                     3: ldc           #3                  // String 2 * i * i:
                                5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                11: lstore_1
                                12: iconst_0
                                13: istore_3
                                14: iconst_0
                                15: istore 4
                                17: iload 4
                                19: ldc #6 // int 1000000000
                                21: if_icmpge 40
                                24: iload_3
                                25: iconst_2
                                26: iload 4
                                28: imul
                                29: iload 4
                                31: imul
                                32: iadd
                                33: istore_3
                                34: iinc 4, 1
                                37: goto 17


                                FYI -



                                java -version
                                java version "1.8.0_121"
                                Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
                                Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)





                                share|improve this answer

















                                • 1




                                  A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
                                  – nullpointer
                                  yesterday












                                • @nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
                                  – paulsm4
                                  yesterday












                                • That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
                                  – Krease
                                  yesterday






                                • 1




                                  Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike.
                                  – rustyx
                                  yesterday






                                • 1




                                  @ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work!
                                  – paulsm4
                                  22 hours ago















                                up vote
                                2
                                down vote










                                up vote
                                2
                                down vote









                                I got similar results:



                                2 * (i * i): 0.458765943 s, n=119860736
                                2 * i * i: 0.580255126 s, n=119860736


                                I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.



                                Finally, here is a javap -c -v <.java> decompile of each:



                                     3: ldc           #3                  // String 2 * (i * i):
                                5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                11: lstore_1
                                12: iconst_0
                                13: istore_3
                                14: iconst_0
                                15: istore 4
                                17: iload 4
                                19: ldc #6 // int 1000000000
                                21: if_icmpge 40
                                24: iload_3
                                25: iconst_2
                                26: iload 4
                                28: iload 4
                                30: imul
                                31: imul
                                32: iadd
                                33: istore_3
                                34: iinc 4, 1
                                37: goto 17


                                vs.



                                     3: ldc           #3                  // String 2 * i * i:
                                5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                11: lstore_1
                                12: iconst_0
                                13: istore_3
                                14: iconst_0
                                15: istore 4
                                17: iload 4
                                19: ldc #6 // int 1000000000
                                21: if_icmpge 40
                                24: iload_3
                                25: iconst_2
                                26: iload 4
                                28: imul
                                29: iload 4
                                31: imul
                                32: iadd
                                33: istore_3
                                34: iinc 4, 1
                                37: goto 17


                                FYI -



                                java -version
                                java version "1.8.0_121"
                                Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
                                Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)





                                share|improve this answer












                                I got similar results:



                                2 * (i * i): 0.458765943 s, n=119860736
                                2 * i * i: 0.580255126 s, n=119860736


                                I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.



                                Finally, here is a javap -c -v <.java> decompile of each:



                                     3: ldc           #3                  // String 2 * (i * i):
                                5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                11: lstore_1
                                12: iconst_0
                                13: istore_3
                                14: iconst_0
                                15: istore 4
                                17: iload 4
                                19: ldc #6 // int 1000000000
                                21: if_icmpge 40
                                24: iload_3
                                25: iconst_2
                                26: iload 4
                                28: iload 4
                                30: imul
                                31: imul
                                32: iadd
                                33: istore_3
                                34: iinc 4, 1
                                37: goto 17


                                vs.



                                     3: ldc           #3                  // String 2 * i * i:
                                5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
                                8: invokestatic #5 // Method java/lang/System.nanoTime:()J
                                11: lstore_1
                                12: iconst_0
                                13: istore_3
                                14: iconst_0
                                15: istore 4
                                17: iload 4
                                19: ldc #6 // int 1000000000
                                21: if_icmpge 40
                                24: iload_3
                                25: iconst_2
                                26: iload 4
                                28: imul
                                29: iload 4
                                31: imul
                                32: iadd
                                33: istore_3
                                34: iinc 4, 1
                                37: goto 17


                                FYI -



                                java -version
                                java version "1.8.0_121"
                                Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
                                Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)






                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered yesterday









                                paulsm4

                                75.7k898122




                                75.7k898122








                                • 1




                                  A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
                                  – nullpointer
                                  yesterday












                                • @nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
                                  – paulsm4
                                  yesterday












                                • That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
                                  – Krease
                                  yesterday






                                • 1




                                  Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike.
                                  – rustyx
                                  yesterday






                                • 1




                                  @ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work!
                                  – paulsm4
                                  22 hours ago
















                                • 1




                                  A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
                                  – nullpointer
                                  yesterday












                                • @nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
                                  – paulsm4
                                  yesterday












                                • That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
                                  – Krease
                                  yesterday






                                • 1




                                  Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike.
                                  – rustyx
                                  yesterday






                                • 1




                                  @ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work!
                                  – paulsm4
                                  22 hours ago










                                1




                                1




                                A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
                                – nullpointer
                                yesterday






                                A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
                                – nullpointer
                                yesterday














                                @nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
                                – paulsm4
                                yesterday






                                @nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
                                – paulsm4
                                yesterday














                                That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
                                – Krease
                                yesterday




                                That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
                                – Krease
                                yesterday




                                1




                                1




                                Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike.
                                – rustyx
                                yesterday




                                Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike.
                                – rustyx
                                yesterday




                                1




                                1




                                @ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work!
                                – paulsm4
                                22 hours ago






                                @ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work!
                                – paulsm4
                                22 hours ago




















                                 

                                draft saved


                                draft discarded



















































                                 


                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452713%2fwhy-is-2-i-i-faster-than-2-i-i%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Accessing regular linux commands in Huawei's Dopra Linux

                                Can't connect RFCOMM socket: Host is down

                                Kernel panic - not syncing: Fatal Exception in Interrupt