I remember reading somewhere that to really optimize & speed up certain section of the code, programmers write that section in Assembly language. My questions are -
- Is this practice still done? and How does one do this?
- Isn't writing in Assembly Language a bit too cumbersome & archaic?
- When we compile C code (with or without -O3 flag), the compiler does some code optimization & links all libraries & converts the code to binary object file. So when we run the program it is already in its most basic form i.e. binary. So how does inducing 'Assembly Language' help?
I am trying to understand this concept & any help or links is much appreciated.
UPDATE: Rephrasing point 3 as requested by dbemerlin- Because you might be able to write more effective assembly code than the compiler generates but unless you are an assembler expert your code will propably run slower because often the compiler optimizes the code better than most humans can.
The only time it's useful to revert to assembly language is when
the CPU instructions don't have functional equivalents in C++ (e.g. atomic operations, single-instruction-multiple-data instructions, explicit memory cache synchronisation, single instructions that give div and mod results)
- AND the compiler doesn't provide extra functions to wrap these operations
- AND there isn't a good third-party library (e.g. http://mitpress.mit.edu/catalog/item/default.asp?tid=3952&ttype=2)
OR
for some inexplicable reason - the optimiser is failing to use the best CPU instructions
...AND...
- the use of those CPU instructions would give some significant and useful performance boost to bottleneck code.
Simply using inline assembly to do an operation that can easily be expressed in C++ - like adding two values or searching in a string - is actively counterproductive, because:
- the compiler knows how to do this equally well
- to verify this, look at its assembly output (e.g. gcc -S) or disassemble the machine code
- you're artificially restricting its choices regarding register allocation, CPU instructions etc., so it may take longer to prepare the CPU registers with the values needed to execute your hardcoded instruction, then longer to get back to an optimal allocation for future instructions
- compiler optimisers can choose between equivalent-performance instructions specifying different registers to minimise copying between them, and may choose registers in such a way that a single core can process multiple instructions during one cycle, whereas forcing everythingt through specific registers would serialise it
- in fairness, GCC has ways to express needs for specific types of registers without constraining the CPU to an exact register, still allowing such optimisations, but it's the only inline assembly I've ever seen that address this
- compiler optimisers can choose between equivalent-performance instructions specifying different registers to minimise copying between them, and may choose registers in such a way that a single core can process multiple instructions during one cycle, whereas forcing everythingt through specific registers would serialise it
- if a new CPU model comes out next year with another instruction that's 1000% faster for that same logical operation, then the compiler vendor is more likely to update their compiler to use that instruction, and hence your program to benefit once recompiled, than you are
- the compiler will select an optimal approach for the target architecture its told about: if you hardcode one solution then it will need to be a lowest-common-denominator or #ifdef-ed for your platforms
- assembly language isn't as portable as C++, both across CPUs and across compilers, and even if you seemingly port an instruction, it's possible to make a mistake re registers that are safe to clobber, argument passing conventions etc.
- other programmers may not know or be comfortable with assembly