Dead Code: Every Thing You Need To Know
The term “dead code” is more of a jargon than a scientific term for sections of a program that cannot be controlled and thus never run. Of course, in normal programs such sections should not be. But since programming languages are getting more and more complicated ( and programmers are getting dumber and dumber, kidding! ) anything can be in the code of programs.
Therefore, in compilers, one of the usual tasks at the stage of optimization is the identification and elimination of non-executable sections of the program.
By the way, I know an example of “dead code” that performed a useful function. In the core of the ancient MS DOS, several commands from an even older version of this OS were inserted among the data. Management, of course, never got to them, but by the code of these commands (that is, by their signature), very ancient resident programs like the SIdekick editor looked for the address of the MS DOS busy flag. Therefore, such a “dead code”, left for compatibility, could not be thrown away. But this is still an exceptional case, usually the compiler needs to find and destroy all the “dead codes”.
In my work I use a very small compiler, which I myself maintain and improve as much as I can. The optimizer in this compiler works at the lowest level – almost at the level of x86-64 instructions. Such a local or, as I call it, “ tactical ” optimization has more modest opportunities than, for example, LLVM code optimizers, but some local optimization tasks, including throwing out “dead code”, become trivial.
In my compiler, at the stage of processing the internal representation of the program, each operation of this internal representation immediately causes the formation of one or more x86 instructions. But initially, these commands are just internal compiler structures organized in a two-way linked list. And the fields of the future x86 binary code in these structures are still empty, and the code length is zero.
Therefore, such instruction blanks can be easily rearranged/excluded/replaced during optimization. And after the stage of generating the binary code, these commands can still be excluded, not even thrown out of the linked list, but simply by resetting the command length field (i.e. making them “empty” again).
I remember once they asked the question: how many processor instructions do compilers use? In my case (not counting the FPU commands) there are 90 pieces, and not all of them are actually x86 commands. For example, this number also includes “label” commands, which, of course, do not have any code, but have an address in the code, like regular x86 commands. There are two varieties of “command” labels: the compiler’s label, which it puts, for example, when generating conditional operator codes, and the programmer’s label, which the programmer has the right to put himself almost anywhere in the source text. Of course, the names of the subroutines and functions described in the source text are also “label” commands.
So, at one of the first stages of compilation – the stage of register allocation, a linked list of future x86 commands is viewed. If this list contains a return command or an unconditional jump command, then the command following it must necessarily be the mentioned “label” command, otherwise it turns out just the same unattainable “dead code” that cannot be reached without a label and which can be safely delete until the next “tag” command, without even starting to generate binary code for these deleted commands.
Thus, a single check in the compiler, which is also performed along the way when preparing to generate the x86 code, makes it possible to identify and immediately remove unreachable code sections without cumbersome analysis of the source code of the program.
True, the matter does not end there. The compiler now still needs to determine who created this unreachability, he himself or the programmer in the source code?
If it is the compiler itself that has played out like this, then you just need to throw out unreachable commands and keep quiet. But if this follows from the source text, it is necessary to issue a warning to the programmer, since this is often the result of some errors in the program.
The simplest example. The whole program consists of one “endless” cycle of reading and processing a file. An end-of-file handler is written before the loop.
test:proc main;
dcl f file;
on endfile(f) stop;
do repeat;
end repeat;
end test;
Generally speaking, the compiler adds an implicit return at the end of each subroutine. Therefore, if the control has reached the end of the text of the subroutine, it automatically exits from it. In this case, the example shows the main program. You can also exit it, because before starting it, the address of the system call to complete all work is always placed on the stack, and in the case of an explicit or implicit return from the main one, as here, the entire program will end and exit to the operating system. However, because of the infinite loop, control will never hit the implicit return at the end of the program. And this implicit return will be automatically thrown by the compiler (yeah, I set it myself – I threw it out myself, typical optimization) without any warning.
But if you put any unlabeled statement before the line “end test;” – this operator will be deleted already with a warning.
True, if you place a labeled statement there, but this label is not jumped anywhere in the program text, what should the compiler do? To delete or not to delete?
I believe that any label set by a programmer should be “sacred” for the compiler, regardless of whether there is a transition to it in the program or not. For example, I sometimes place such labels only for interactive debugging, during which I can force control of the debugger to such a label, even if there are no corresponding jump statements in the program.
And in some cases, “dead code” can even be useful. In the example above, an implicit return at the end of a subroutine could not result in an error, whether the compiler threw it or not. But, for example, in the PL/1 language (the compiler from which I am describing) there is a potential danger of unpleasant errors associated with the description of functions.
In an ordinary subroutine (i.e., in a procedure in terms of PL/1 or in a function returning void, in terms of C), the exit occurs either by an explicit return or by reaching the end of the “body” of the subroutine, where the compiler always substitutes an implicit return. But for a function, an explicit return statement with a value is required. And in PL / 1, the description of procedures and functions differs from each other only in the heading and type of return statements, which can be placed anywhere and any way, and in the source text of the return procedures may not be at all, but in the text of the functions they must be. This is where the risk of error comes in.
Of course, the compiler checks that there is at least one return in the source code of the function, but the error can be more subtle.
For example, if the text of some function f ends with an expression like:
… if x>0 then return(1); else return(-1); end f;
then everything is fine, there will always be an output with the value of the function.
But, if I, breccia stupidity, wrote something like:
… if x>0 then return(1); if x<0 then return(-1); end f;
then in the case of x=0, it becomes possible to reach the end of the source text of the function without calculating any of its values, although there are returns. And regardless of whether there is also an implicit return at the end or not, nothing good will come of it.
To detect such unpleasant errors without cumbersome source code analysis, the compiler uses “dead code”. Namely, at the end of each procedure-function, the pseudo -command func must first be placed , which is also included in the notorious 90 commands. It has a binary breakpoint code (byte 0CCH) and must be exactly as “dead code” at the next stages of compilation and thrown away. If it remains in the linked list of future x86 commands, then this code is not “dead” and it is potentially possible for control to get to this point. Therefore, it is possible to issue a warning that it is possible to exit from such and such a function without a value. And the error in this example could be corrected somehow like this:
… if x>0 then return(1); if x<0 then return(-1); return(0); end f;
and then both the warning and the 0CCH code disappear.
If you do not pay attention to the warning, and still run the program with such a potential error, then if it really happens, due to the remaining 0CCH code, the program will crash with a “checkpoint” exception (or exit to the interactive debugger), which is much better than a silent breakdown in an unpredictable place.
Thus, “dead code” is the harsh realities of programming. It can occur both as a result of the compiler (especially during optimization), and as a result of errors in the program. In some cases, an unexecutable section of the program can even be made intentionally, for example, when debugging, a jump / return statement can be specially inserted into the source text so that some part of the program would not be executed yet.
In any case, compilers are usually required to detect such sections of code and, if they occur through the fault of the programmer, then it is necessary to warn him about this. From the point of view of the size of the program, it is advisable, of course, to remove all the “dead code”. Although its presence in itself should not affect the operation of the program.