jit
The JavaScript InfoVis Toolkit provides tools for creating Interactive Data Visualizations for the Web
JavaScript InfoVis Toolkit
javascript infovis toolkit, meaningful visualizations
Intro: I write high-performance code in C#. Yes, I know C++ would give me better optimization, but I still choose to use C#. I do not wish to debate that choice. Rather, I'd like to hear from those who, like me, are trying to write high-performance code on the .NET Framework.
Questions:
- Why is the operator in the code below slower than the equivalent
method call??
- Why is the method passing two doubles in the code below
faster than the equivalent method passing a struct that has two
doubles inside? (A: older JITs optimize structs poorly)
- Is there a way to get the .NET JIT Compiler to treat
simple structs as efficiently as the members of the struct? (A: get newer JIT)
What I think I know:
The original .NET JIT Compiler would not inline anything that involved a struct. Bizarre given structs should only be used where you need small value types that should be optimized like built-ins, but true. Fortunately, in .NET 3.5SP1 and .NET 2.0SP2, they made some improvements to the JIT Optimizer, including improvements to inlining, particularly for structs. (I am guessing they did that because otherwise the new Complex struct that they were introducing would have performed horribly... so the Complex team was probably pounding on the JIT Optimizer team.) So, any documentation prior to .NET 3.5 SP1 is probably not too relevant to this issue.
What my testing shows:
I have verified that I do have the newer JIT Optimizer by checking that C:\Windows\Microsoft.NET\Framework\v2.0.50727\mscorwks.dll file does have version >= 3053 and so should have those improvements to the JIT optimizer. However, even with that, what my timings and looks at the disassembly both show are:
The JIT-produced code for passing a struct with two doubles is far less efficient than code that directly passes the two doubles.
The JIT-produced code for a struct method passes in 'this' far more efficiently than if you passed a struct as an argument.
The JIT still inlines better if you pass two doubles rather than passing a struct with two doubles, even with the multiplier due to being clearly in a loop.
The Timings:
Actually, looking at the disassembly I realize that most of the time in the loops is just accessing the test data out of the List. The difference between the four ways of making the same calls is dramatically different if you factor out the overhead code of the loop and the accessing of the data. I get anywhere from 5x to 20x speedups for doing PlusEqual(double, double) instead of PlusEqual(Element). And 10x to 40x for doing PlusEqual(double, double) instead of operator +=. Wow. Sad.
Here's one set of timings:
Populating List<Element> took 320ms.
The PlusEqual() method took 105ms.
The 'same' += operator took 131ms.
The 'same' -= operator took 139ms.
The PlusEqual(double, double) method took 68ms.
The do nothing loop took 66ms.
The ratio of operator with constructor to method is 124%.
The ratio of operator without constructor to method is 132%.
The ratio of PlusEqual(double,double) to PlusEqual(Element) is 64%.
If we remove the overhead time for the loop accessing the elements from the List...
The ratio of operator with constructor to method is 166%.
The ratio of operator without constructor to method is 187%.
The ratio of PlusEqual(double,double) to PlusEqual(Element) is 5%.
The Code:
namespace OperatorVsMethod
{
public struct Element
{
public double Left;
public double Right;
public Element(double left, double right)
{
this.Left = left;
this.Right = right;
}
public static Element operator +(Element x, Element y)
{
return new Element(x.Left + y.Left, x.Right + y.Right);
}
public static Element operator -(Element x, Element y)
{
x.Left += y.Left;
x.Right += y.Right;
return x;
}
/// <summary>
/// Like the += operator; but faster.
/// </summary>
public void PlusEqual(Element that)
{
this.Left += that.Left;
this.Right += that.Right;
}
/// <summary>
/// Like the += operator; but faster.
/// </summary>
public void PlusEqual(double thatLeft, double thatRight)
{
this.Left += thatLeft;
this.Right += thatRight;
}
}
[TestClass]
public class UnitTest1
{
[TestMethod]
public void TestMethod1()
{
Stopwatch stopwatch = new Stopwatch();
// Populate a List of Elements to multiply together
int seedSize = 4;
List<double> doubles = new List<double>(seedSize);
doubles.Add(2.5d);
doubles.Add(100000d);
doubles.Add(-0.5d);
doubles.Add(-100002d);
int size = 2500000 * seedSize;
List<Element> elts = new List<Element>(size);
stopwatch.Reset();
stopwatch.Start();
for (int ii = 0; ii < size; ++ii)
{
int di = ii % seedSize;
double d = doubles[di];
elts.Add(new Element(d, d));
}
stopwatch.Stop();
long populateMS = stopwatch.ElapsedMilliseconds;
// Measure speed of += operator (calls ctor)
Element operatorCtorResult = new Element(1d, 1d);
stopwatch.Reset();
stopwatch.Start();
for (int ii = 0; ii < size; ++ii)
{
operatorCtorResult += elts[ii];
}
stopwatch.Stop();
long operatorCtorMS = stopwatch.ElapsedMilliseconds;
// Measure speed of -= operator (+= without ctor)
Element operatorNoCtorResult = new Element(1d, 1d);
stopwatch.Reset();
stopwatch.Start();
for (int ii = 0; ii < size; ++ii)
{
operatorNoCtorResult -= elts[ii];
}
stopwatch.Stop();
long operatorNoCtorMS = stopwatch.ElapsedMilliseconds;
// Measure speed of PlusEqual(Element) method
Element plusEqualResult = new Element(1d, 1d);
stopwatch.Reset();
stopwatch.Start();
for (int ii = 0; ii < size; ++ii)
{
plusEqualResult.PlusEqual(elts[ii]);
}
stopwatch.Stop();
long plusEqualMS = stopwatch.ElapsedMilliseconds;
// Measure speed of PlusEqual(double, double) method
Element plusEqualDDResult = new Element(1d, 1d);
stopwatch.Reset();
stopwatch.Start();
for (int ii = 0; ii < size; ++ii)
{
Element elt = elts[ii];
plusEqualDDResult.PlusEqual(elt.Left, elt.Right);
}
stopwatch.Stop();
long plusEqualDDMS = stopwatch.ElapsedMilliseconds;
// Measure speed of doing nothing but accessing the Element
Element doNothingResult = new Element(1d, 1d);
stopwatch.Reset();
stopwatch.Start();
for (int ii = 0; ii < size; ++ii)
{
Element elt = elts[ii];
double left = elt.Left;
double right = elt.Right;
}
stopwatch.Stop();
long doNothingMS = stopwatch.ElapsedMilliseconds;
// Report results
Assert.AreEqual(1d, operatorCtorResult.Left, "The operator += did not compute the right result!");
Assert.AreEqual(1d, operatorNoCtorResult.Left, "The operator += did not compute the right result!");
Assert.AreEqual(1d, plusEqualResult.Left, "The operator += did not compute the right result!");
Assert.AreEqual(1d, plusEqualDDResult.Left, "The operator += did not compute the right result!");
Assert.AreEqual(1d, doNothingResult.Left, "The operator += did not compute the right result!");
// Report speeds
Console.WriteLine("Populating List<Element> took {0}ms.", populateMS);
Console.WriteLine("The PlusEqual() method took {0}ms.", plusEqualMS);
Console.WriteLine("The 'same' += operator took {0}ms.", operatorCtorMS);
Console.WriteLine("The 'same' -= operator took {0}ms.", operatorNoCtorMS);
Console.WriteLine("The PlusEqual(double, double) method took {0}ms.", plusEqualDDMS);
Console.WriteLine("The do nothing loop took {0}ms.", doNothingMS);
// Compare speeds
long percentageRatio = 100L * operatorCtorMS / plusEqualMS;
Console.WriteLine("The ratio of operator with constructor to method is {0}%.", percentageRatio);
percentageRatio = 100L * operatorNoCtorMS / plusEqualMS;
Console.WriteLine("The ratio of operator without constructor to method is {0}%.", percentageRatio);
percentageRatio = 100L * plusEqualDDMS / plusEqualMS;
Console.WriteLine("The ratio of PlusEqual(double,double) to PlusEqual(Element) is {0}%.", percentageRatio);
operatorCtorMS -= doNothingMS;
operatorNoCtorMS -= doNothingMS;
plusEqualMS -= doNothingMS;
plusEqualDDMS -= doNothingMS;
Console.WriteLine("If we remove the overhead time for the loop accessing the elements from the List...");
percentageRatio = 100L * operatorCtorMS / plusEqualMS;
Console.WriteLine("The ratio of operator with constructor to method is {0}%.", percentageRatio);
percentageRatio = 100L * operatorNoCtorMS / plusEqualMS;
Console.WriteLine("The ratio of operator without constructor to method is {0}%.", percentageRatio);
percentageRatio = 100L * plusEqualDDMS / plusEqualMS;
Console.WriteLine("The ratio of PlusEqual(double,double) to PlusEqual(Element) is {0}%.", percentageRatio);
}
}
}
The IL: (aka. what some of the above gets compiled into)
public void PlusEqual(Element that)
{
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 sub esp,30h
00000009 xor eax,eax
0000000b mov dword ptr [ebp-10h],eax
0000000e xor eax,eax
00000010 mov dword ptr [ebp-1Ch],eax
00000013 mov dword ptr [ebp-3Ch],ecx
00000016 cmp dword ptr ds:[04C87B7Ch],0
0000001d je 00000024
0000001f call 753081B1
00000024 nop
this.Left += that.Left;
00000025 mov eax,dword ptr [ebp-3Ch]
00000028 fld qword ptr [ebp+8]
0000002b fadd qword ptr [eax]
0000002d fstp qword ptr [eax]
this.Right += that.Right;
0000002f mov eax,dword ptr [ebp-3Ch]
00000032 fld qword ptr [ebp+10h]
00000035 fadd qword ptr [eax+8]
00000038 fstp qword ptr [eax+8]
}
0000003b nop
0000003c lea esp,[ebp-0Ch]
0000003f pop ebx
00000040 pop esi
00000041 pop edi
00000042 pop ebp
00000043 ret 10h
public void PlusEqual(double thatLeft, double thatRight)
{
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 sub esp,30h
00000009 xor eax,eax
0000000b mov dword ptr [ebp-10h],eax
0000000e xor eax,eax
00000010 mov dword ptr [ebp-1Ch],eax
00000013 mov dword ptr [ebp-3Ch],ecx
00000016 cmp dword ptr ds:[04C87B7Ch],0
0000001d je 00000024
0000001f call 75308159
00000024 nop
this.Left += thatLeft;
00000025 mov eax,dword ptr [ebp-3Ch]
00000028 fld qword ptr [ebp+10h]
0000002b fadd qword ptr [eax]
0000002d fstp qword ptr [eax]
this.Right += thatRight;
0000002f mov eax,dword ptr [ebp-3Ch]
00000032 fld qword ptr [ebp+8]
00000035 fadd qword ptr [eax+8]
00000038 fstp qword ptr [eax+8]
}
0000003b nop
0000003c lea esp,[ebp-0Ch]
0000003f pop ebx
00000040 pop esi
00000041 pop edi
00000042 pop ebp
00000043 ret 10h
Source: (StackOverflow)
Yesterday I found an article by Christoph Nahr titled ".NET Struct Performance" which benchmarked several languages (C++, C#, Java, JavaScript) for a method which adds two point structs (double
tuples).
As it turned out, C++ version takes about 1000ms to execute (1e9 iterations), while C# cannot get under ~3000ms on the same machine (and performs even worse in x64).
To test it myself, I took the C# code (and simplified slightly to call only the method where parameters are passed by value), and ran it on a i7-3610QM machine (3.1Ghz boost for single core), 8GB RAM, Win8.1, using .NET 4.5.2, RELEASE build 32-bit (x86 WoW64 since my OS is 64-bit). This is the simplified version:
public static class CSharpTest
{
private const int ITERATIONS = 1000000000;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Point AddByVal(Point a, Point b)
{
return new Point(a.X + b.Y, a.Y + b.X);
}
public static void Main()
{
Point a = new Point(1, 1), b = new Point(1, 1);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
}
With Point
defined as simply:
public struct Point
{
private readonly double _x, _y;
public Point(double x, double y) { _x = x; _y = y; }
public double X { get { return _x; } }
public double Y { get { return _y; } }
}
Running it produces the results similar to those in the article:
Result: x=1000000001 y=1000000001, Time elapsed: 3159 ms
First strange observation
Since the method should be inlined, I wondered how the code would perform if I removed structs altogether and simply inlined the whole thing together:
public static class CSharpTest
{
private const int ITERATIONS = 1000000000;
public static void Main()
{
// not using structs at all here
double ax = 1, ay = 1, bx = 1, by = 1;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
{
ax = ax + by;
ay = ay + bx;
}
sw.Stop();
Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
ax, ay, sw.ElapsedMilliseconds);
}
}
And got practically the same result (actually 1% slower after several retries), meaning that JIT-ter seems to be doing a good job optimizing all function calls:
Result: x=1000000001 y=1000000001, Time elapsed: 3200 ms
It also means that the benchmark doesn't seem to measure any struct
performance and actually only seem to measure basic double
arithmetic (after everything else gets optimized away).
The weird stuff
Now comes the weird part. If I merely add another stopwatch outside the loop (yes, I narrowed it down to this crazy step after several retries), the code runs three times faster:
public static void Main()
{
var outerSw = Stopwatch.StartNew(); // <-- added
{
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
outerSw.Stop(); // <-- added
}
Result: x=1000000001 y=1000000001, Time elapsed: 961 ms
That's ridiculous! And it's not like Stopwatch
is giving me wrong results because I can clearly see that it ends after a single second.
Can anyone tell me what might be happening here?
(Update)
I also presumed it has something to do with JITting, but repeating the test several times (without the outer stopwatch) is slow in all cases:
public static class CSharpTest
{
private const int ITERATIONS = 1000000000;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Point AddByVal(Point a, Point b)
{
return new Point(a.X + b.Y, a.Y + b.X);
}
public static void Main()
{
Test();
Test();
Test();
}
private static void Test()
{
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
}
Output:
Result: x=1000000001 y=1000000001, Time elapsed: 3251 ms
Result: x=1000000001 y=1000000001, Time elapsed: 3249 ms
Result: x=1000000001 y=1000000001, Time elapsed: 3250 ms
Again, with the outer stopwatch added, we get the magical boost:
public static void Main()
{
Test();
Test();
Test();
}
private static void Test()
{
var outerSw = Stopwatch.StartNew();
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms", a.X, a.Y, sw.ElapsedMilliseconds);
outerSw.Stop();
}
Result: x=1000000001 y=1000000001, Time elapsed: 979 ms
Result: x=1000000001 y=1000000001, Time elapsed: 982 ms
Result: x=1000000001 y=1000000001, Time elapsed: 978 ms
(Update 2)
Here are two methods in the same program, which shows that the reason is not JITting:
public static class CSharpTest
{
private const int ITERATIONS = 1000000000;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Point AddByVal(Point a, Point b)
{
return new Point(a.X + b.Y, a.Y + b.X);
}
public static void Main()
{
Test1();
Test2();
Console.WriteLine();
Test1();
Test2();
}
private static void Test1()
{
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
private static void Test2()
{
var swOuter = Stopwatch.StartNew();
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
swOuter.Stop();
}
}
Output:
Test1: x=1000000001 y=1000000001, Time elapsed: 3242 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 974 ms
Test1: x=1000000001 y=1000000001, Time elapsed: 3251 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 972 ms
(Update 3)
Here is a pastebin. You need to run it as a 32-bit release on .NET 4.x (there are a couple of checks in the code to ensure this).
(Solution, perhaps?) (read @HansPassant's answer for details)
According to @Hans, the slowdown is happening due to 64-bit variables not always being aligned in 32-bit apps. If I add a 32-bit variable to one of the test methods, it will perform as fast as the C++ version:
private static void Test3()
{
var magical_speed_booster = "whatever";
{
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
GC.KeepAlive(magical_speed_booster);
}
(Update 4)
Following @usr's comments on @Hans' answer, I checked the optimized disassembly for both methods, and they are rather different:

This seems to show that the difference might be due to compiler acting funny in the first case, rather than double field alignment?
Also, if I add two variables (total offset of 8 bytes), I still get the same speed boost - and it no longer seems it's related to field alignment:
// this is still fast?
private static void Test3()
{
var magical_speed_booster_1 = "whatever";
var magical_speed_booster_2 = "whatever";
{
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
GC.KeepAlive(magical_speed_booster_1);
GC.KeepAlive(magical_speed_booster_2);
}
Source: (StackOverflow)
The canonical JVM implementation from Sun applies some pretty sophisticated optimization to bytecode to obtain near-native execution speeds after the code has been run a few times. The question is, why isn't this compiled code cached to disk for use during subsequent uses of the same function/class. As it stands, every time a program is executed, the JIT compiler kicks in afresh, rather than using a pre-compiled version of the code. Wouldn't adding this feature add a significant boost to the initial run time of the program, when the bytecode is essentially being interpreted?
Source: (StackOverflow)
I was trying to figure out hands-on how tail calls are handled by the C# compiler.
(Answer: They're not. But the 64bit JIT(s) WILL do TCE (tail call elimination). Restrictions apply.)
So I wrote a small test using a recursive call that prints how many times it gets called before the StackOverflowException
kills the process.
class Program
{
static void Main(string[] args)
{
Rec();
}
static int sz = 0;
static Random r = new Random();
static void Rec()
{
sz++;
//uncomment for faster, more imprecise runs
//if (sz % 100 == 0)
{
//some code to keep this method from being inlined
var zz = r.Next();
Console.Write("{0} Random: {1}\r", sz, zz);
}
//uncommenting this stops TCE from happening
//else
//{
// Console.Write("{0}\r", sz);
//}
Rec();
}
Right on cue, the program ends with SO Exception on any of:
- 'Optimize build' OFF (either Debug or Release)
- Target: x86
- Target: AnyCPU + "Prefer 32 bit" (this is new in VS 2012 and the first time I saw it. More here.)
- Some seemingly innocuous branch in the code (see commented 'else' branch).
Conversely, using 'Optimize build' ON + (Target = x64 or AnyCPU with 'Prefer 32bit' OFF (on a 64bit CPU)), TCE happens and the counter keeps spinning up forever (ok, it arguably spins down each time its value overflows).
But I noticed a behaviour I can't explain in the StackOverflowException
case: it never (?) happens at exactly the same stack depth. Here are the outputs of a few 32-bit runs, Release build:
51600 Random: 1778264579
Process is terminated due to StackOverflowException.
51599 Random: 1515673450
Process is terminated due to StackOverflowException.
51602 Random: 1567871768
Process is terminated due to StackOverflowException.
51535 Random: 2760045665
Process is terminated due to StackOverflowException.
And Debug build:
28641 Random: 4435795885
Process is terminated due to StackOverflowException.
28641 Random: 4873901326 //never say never
Process is terminated due to StackOverflowException.
28623 Random: 7255802746
Process is terminated due to StackOverflowException.
28669 Random: 1613806023
Process is terminated due to StackOverflowException.
The stack size is constant (defaults to 1 MB). The stack frames' sizes are constant.
So then, what can account for the (sometimes non-trivial) variation of stack depth when the StackOverflowException
hits?
UPDATE
Hans Passant raises the issue of Console.WriteLine
touching P/Invoke, interop and possibly non-deterministic locking.
So I simplified the code to this:
class Program
{
static void Main(string[] args)
{
Rec();
}
static int sz = 0;
static void Rec()
{
sz++;
Rec();
}
}
I ran it in Release/32bit/Optimization ON without a debugger. When the program crashes, I attach the debugger and check the value of the counter.
And it still isn't the same on several runs. (Or my test is flawed.)
UPDATE: Closure
As suggested by fejesjoco, I looked into ASLR (Address space layout randomization).
It's a security technique that makes it hard for buffer overflow attacks to find the precise location of (e.g.) specific system calls, by randomizing various things in the process address space, including the stack position and, apparently, its size.
The theory sounds good. Let's put it into practice!
In order to test this, I used a Microsoft tool specific for the task: EMET or The Enhanced Mitigation Experience Toolkit. It allows setting the ASLR flag (and a lot more) on a system- or process-level.
(There is also a system-wide, registry hacking alternative that I didn't try)

In order to verify the effectiveness of the tool, I also discovered that Process Explorer duly reports the status of the ASLR flag in the 'Properties' page of the process. Never saw that until today :)

Theoretically, EMET can (re)set the ASLR flag for a single process. In practice, it didn't seem to change anything (see above image).
However, I disabled ASLR for the entire system and (one reboot later) I could finally verify that indeed, the SO exception now always happens at the same stack depth.
BONUS
ASLR-related, in older news: How Chrome got pwned
Source: (StackOverflow)
What does a JIT compiler specifically do as opposed to a non-JIT compiler? Can someone give a succinct and easy to understand description?
Source: (StackOverflow)
I have heard that Java must use a JIT to be fast. This makes perfect sense when comparing to interpretation, but why can't someone make an ahead-of-time compiler that generates fast Java code? I know about gcj
, but I don't think its output is typically faster than Hotspot for example.
Are there things about the language that make this difficult? I think it comes down to just these things:
What am I missing? If I avoid these features, would it be possible to compile Java code once to native machine code and be done?
Source: (StackOverflow)
Outside of the ensuring that they cannot be changed (to the tune of a compiler error), does the JIT make any optimisations for const locals?
Eg.
public static int Main(string[] args)
{
const int timesToLoop = 50;
for (int i=0; i<timesToLoop; i++)
{
// ...
}
}
Source: (StackOverflow)
I've been hearing a lot about the PyPy project. They claim it is 6.3 times faster than the CPython interpreter on their site.
Whenever we talk about dynamic languages like Python, speed is one of the top issues. To solve this, they say PyPy is 6.3 times faster.
The second issue is parallelism, the infamous Global Interpreter Lock (GIL). For this, PyPy says it can give GIL-less Python.
If PyPy can solve these great challenges, what are its weaknesses that are preventing wider adoption? That is to say, what's preventing someone like me, a typical Python developer, from switching to PyPy right now?
Source: (StackOverflow)
The following code gives different output when running the release inside Visual Studio, and running the release outside Visual Studio. I'm using Visual Studio 2008 and targeting .NET 3.5. I've also tried .NET 3.5 SP1.
When running outside Visual Studio, the JIT should kick in. Either (a) there's something subtle going on with C# that I'm missing or (b) the JIT is actually in error. I'm doubtful that the JIT can go wrong, but I'm running out of other possiblities...
Output when running inside Visual Studio:
0 0,
0 1,
1 0,
1 1,
Output when running release outside of Visual Studio:
0 2,
0 2,
1 2,
1 2,
What is the reason?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Test
{
struct IntVec
{
public int x;
public int y;
}
interface IDoSomething
{
void Do(IntVec o);
}
class DoSomething : IDoSomething
{
public void Do(IntVec o)
{
Console.WriteLine(o.x.ToString() + " " + o.y.ToString()+",");
}
}
class Program
{
static void Test(IDoSomething oDoesSomething)
{
IntVec oVec = new IntVec();
for (oVec.x = 0; oVec.x < 2; oVec.x++)
{
for (oVec.y = 0; oVec.y < 2; oVec.y++)
{
oDoesSomething.Do(oVec);
}
}
}
static void Main(string[] args)
{
Test(new DoSomething());
Console.ReadLine();
}
}
}
Source: (StackOverflow)
I was going through a document which talks about just-in-time compiler (JIT) optimization techniques for Java. One of them was "loop inversion". And the document says:
You replace a regular while
loop with a do-while
loop. And the
do-while
loop is set within an if
clause. This replacement leads to two fewer jumps.
How does loop inversion work and how does it optimize our code path?
N.B.: It would be great if somebody can explain with an example of Java code and how JIT optimizes it to native code and why is it optimal in modern processors.
Source: (StackOverflow)
Let's say the bottleneck of my Java program really is some tight loops to compute a bunch of vector dot products. Yes I've profiled, yes it's the bottleneck, yes it's significant, yes that's just how the algorithm is, yes I've run Proguard to optimize the byte code, etc.
The work is, essentially, dot products. As in, I have two float[50]
and I need to compute the sum of pairwise products. I know processor instruction sets exist to perform these kind of operations quickly and in bulk, like SSE or MMX.
Yes I can probably access these by writing some native code in JNI. The JNI call turns out to be pretty expensive.
I know you can't guarantee what a JIT will compile or not compile. Has anyone ever heard of a JIT generating code that uses these instructions? and if so, is there anything about the Java code that helps make it compilable this way?
Probably a "no"; worth asking.
Source: (StackOverflow)
While writing an answer to another question, I noticed a strange border case for JIT optimization.
The following program is not a "Microbenchmark" and not intended to reliably measure an execution time (as pointed out in the answers to the other question). It is solely intended as an MCVE to reproduce the issue:
class MissedLoopOptimization
{
public static void main(String args[])
{
for (int j=0; j<3; j++)
{
for (int i=0; i<5; i++)
{
long before = System.nanoTime();
runWithMaxValue();
long after = System.nanoTime();
System.out.println("With MAX_VALUE : "+(after-before)/1e6);
}
for (int i=0; i<5; i++)
{
long before = System.nanoTime();
runWithMaxValueMinusOne();
long after = System.nanoTime();
System.out.println("With MAX_VALUE-1 : "+(after-before)/1e6);
}
}
}
private static void runWithMaxValue()
{
final int n = Integer.MAX_VALUE;
int i = 0;
while (i++ < n) {}
}
private static void runWithMaxValueMinusOne()
{
final int n = Integer.MAX_VALUE-1;
int i = 0;
while (i++ < n) {}
}
}
It basically runs the same loop, while (i++ < n){}
, where the limit n
is once set to Integer.MAX_VALUE
, and once to Integer.MAX_VALUE-1
.
When executing this on Win7/64 with JDK 1.7.0_21 and
java -server MissedLoopOptimization
the timing results are as follows:
...
With MAX_VALUE : 1285.227081
With MAX_VALUE : 1274.36311
With MAX_VALUE : 1282.992203
With MAX_VALUE : 1292.88246
With MAX_VALUE : 1280.788994
With MAX_VALUE-1 : 6.96E-4
With MAX_VALUE-1 : 3.48E-4
With MAX_VALUE-1 : 0.0
With MAX_VALUE-1 : 0.0
With MAX_VALUE-1 : 3.48E-4
Obviously, for the case of MAX_VALUE-1
, the JIT does what one could expect: It detects that the loop is useless, and completely eliminates it. However, it does not remove the loop when it is running up to MAX_VALUE
.
This observation is confirmed by a look at the JIT assembly output when starting with
java -server -XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintAssembly MissedLoopOptimization
The log contains the following assembly for the method that runs up to MAX_VALUE
:
Decoding compiled method 0x000000000254fa10:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} 'runWithMaxValue' '()V' in 'MissedLoopOptimization'
# [sp+0x20] (sp of caller)
0x000000000254fb40: sub $0x18,%rsp
0x000000000254fb47: mov %rbp,0x10(%rsp) ;*synchronization entry
; - MissedLoopOptimization::runWithMaxValue@-1 (line 29)
0x000000000254fb4c: mov $0x1,%r11d
0x000000000254fb52: jmp 0x000000000254fb63
0x000000000254fb54: nopl 0x0(%rax,%rax,1)
0x000000000254fb5c: data32 data32 xchg %ax,%ax
0x000000000254fb60: inc %r11d ; OopMap{off=35}
;*goto
; - MissedLoopOptimization::runWithMaxValue@11 (line 30)
0x000000000254fb63: test %eax,-0x241fb69(%rip) # 0x0000000000130000
;*goto
; - MissedLoopOptimization::runWithMaxValue@11 (line 30)
; {poll}
0x000000000254fb69: cmp $0x7fffffff,%r11d
0x000000000254fb70: jl 0x000000000254fb60 ;*if_icmpge
; - MissedLoopOptimization::runWithMaxValue@8 (line 30)
0x000000000254fb72: add $0x10,%rsp
0x000000000254fb76: pop %rbp
0x000000000254fb77: test %eax,-0x241fb7d(%rip) # 0x0000000000130000
; {poll_return}
0x000000000254fb7d: retq
0x000000000254fb7e: hlt
0x000000000254fb7f: hlt
[Exception Handler]
[Stub Code]
0x000000000254fb80: jmpq 0x000000000254e820 ; {no_reloc}
[Deopt Handler Code]
0x000000000254fb85: callq 0x000000000254fb8a
0x000000000254fb8a: subq $0x5,(%rsp)
0x000000000254fb8f: jmpq 0x0000000002528d00 ; {runtime_call}
0x000000000254fb94: hlt
0x000000000254fb95: hlt
0x000000000254fb96: hlt
0x000000000254fb97: hlt
One can clearly see the loop, with the comparison to 0x7fffffff
and the jump back to inc
. In contrast to that, the assembly for the case where it is running up to MAX_VALUE-1
:
Decoding compiled method 0x000000000254f650:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} 'runWithMaxValueMinusOne' '()V' in 'MissedLoopOptimization'
# [sp+0x20] (sp of caller)
0x000000000254f780: sub $0x18,%rsp
0x000000000254f787: mov %rbp,0x10(%rsp) ;*synchronization entry
; - MissedLoopOptimization::runWithMaxValueMinusOne@-1 (line 36)
0x000000000254f78c: add $0x10,%rsp
0x000000000254f790: pop %rbp
0x000000000254f791: test %eax,-0x241f797(%rip) # 0x0000000000130000
; {poll_return}
0x000000000254f797: retq
0x000000000254f798: hlt
0x000000000254f799: hlt
0x000000000254f79a: hlt
0x000000000254f79b: hlt
0x000000000254f79c: hlt
0x000000000254f79d: hlt
0x000000000254f79e: hlt
0x000000000254f79f: hlt
[Exception Handler]
[Stub Code]
0x000000000254f7a0: jmpq 0x000000000254e820 ; {no_reloc}
[Deopt Handler Code]
0x000000000254f7a5: callq 0x000000000254f7aa
0x000000000254f7aa: subq $0x5,(%rsp)
0x000000000254f7af: jmpq 0x0000000002528d00 ; {runtime_call}
0x000000000254f7b4: hlt
0x000000000254f7b5: hlt
0x000000000254f7b6: hlt
0x000000000254f7b7: hlt
So my question is: What is so special about the Integer.MAX_VALUE
that prevents the JIT from optimizing it in the same way as it does for Integer.MAX_VALUE-1
? My guess would be that has to do with the cmp
instruction, which is intended for signed arithmetic, but that alone is not really a convincing reason. Can anybody explain this, and maybe even give a pointer to the OpenJDK HotSpot code where this case is treated?
(An aside: I hope that the answer will also explain the different behavior between i++
and ++i
that was asked for in the other question, assuming that the reason for the missing optimization is (obviously) actually caused by the Integer.MAX_VALUE
loop limit)
Source: (StackOverflow)
I am updating some old code, and have found several instances where the same object is being cast repeatedly each time one of its properties or methods needs to be called. Example:
if (recDate != null && recDate > ((System.Windows.Forms.DateTimePicker)ctrl).MinDate)
{
((System.Windows.Forms.DateTimePicker)ctrl).CustomFormat = "MM/dd/yyyy";
((System.Windows.Forms.DateTimePicker)ctrl).Value = recDate;
}
else
{
(System.Windows.Forms.DateTimePicker)ctrl).CustomFormat = " ";
}
((System.Windows.Forms.DateTimePicker)ctrl).Format = DateTimePickerFormat.Custom;
My inclination is to fix this monstrosity, but given my limited time I don't want to bother with anything that's not affecting functionality or performance.
So what I'm wondering is, are these redundant casts getting optimized away by the compiler? I tried figuring it out myself by using ildasm on a simplified example, but not being familiar with IL I only ended up more confused.
UPDATE
So far, the consensus seems to be that a)no, the casts are not optimized, but b)while there may possibly be some small performance hit as a result, it is not likely significant, and c)I should consider fixing them anyway. I have come down on the side of resolving to fix these someday, if I have time. Meanwhile, I won't worry about them.
Thanks everyone!
Source: (StackOverflow)
When I read about the performance of JITted languages like C# or Java, authors usually say that they should/could theoretically outperform many native-compiled applications. The theory being that native applications are usually just compiled for a processor family (like x86), so the compiler cannot make certain optimizations as they may not truly be optimizations on all processors. On the other hand, the CLR can make processor-specific optimizations during the JIT process.
Does anyone know if Microsoft's (or Mono's) CLR actually performs processor-specific optimizations during the JIT process? If so, what kind of optimizations?
Source: (StackOverflow)
Why does commenting out the first two lines of this for loop and uncommenting the third result in a 42% speedup?
int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
var isMultipleOf16 = i % 16 == 0;
count += isMultipleOf16 ? 1 : 0;
//count += i % 16 == 0 ? 1 : 0;
}
Behind the timing is vastly different assembly code: 13 vs. 7 instructions in the loop. The platform is Windows 7 running .NET 4.0 x64. Code optimization is enabled, and the test app was run outside VS2010. [Update: Repro project, useful for verifying project settings.]
Eliminating the intermediate boolean is a fundamental optimization, one of the simplest in my 1980's era Dragon Book. How did the optimization not get applied when generating the CIL or JITing the x64 machine code?
Is there a "Really compiler, I would like you to optimize this code, please" switch? While I sympathize with the sentiment that premature optimization is akin to the love of money, I could see the frustration in trying to profile a complex algorithm that had problems like this scattered throughout its routines. You'd work through the hotspots but have no hint of the broader warm region that could be vastly improved by hand tweaking what we normally take for granted from the compiler. I sure hope I'm missing something here.
Update: Speed differences also occur for x86, but depend on the order that methods are just-in-time compiled. See Why does JIT order affect performance?
Assembly code (as requested):
var isMultipleOf16 = i % 16 == 0;
00000037 mov eax,edx
00000039 and eax,0Fh
0000003c xor ecx,ecx
0000003e test eax,eax
00000040 sete cl
count += isMultipleOf16 ? 1 : 0;
00000043 movzx eax,cl
00000046 test eax,eax
00000048 jne 0000000000000050
0000004a xor eax,eax
0000004c jmp 0000000000000055
0000004e xchg ax,ax
00000050 mov eax,1
00000055 lea r8d,[rbx+rax]
count += i % 16 == 0 ? 1 : 0;
00000037 mov eax,ecx
00000039 and eax,0Fh
0000003c je 0000000000000042
0000003e xor eax,eax
00000040 jmp 0000000000000047
00000042 mov eax,1
00000047 lea edx,[rbx+rax]
Source: (StackOverflow)