decompiler/doc/cfg.tex

\section{Control flow analysis}

The control flow analysis is a multi-step process:

\begin{itemize}
\item Create a graph with one instruction per vertex, and edges going from instructions to their possible successors
\item Do a depth-first search to determine the expected stack level at each vertex
\item Optionally detect functions
\item Merge vertices to form groups
\item Perform analysis on vertices
\end{itemize}

Calls to in-script functions are not represented with edges in the graph. This is done to keep functions separate from one another, so if your engine uses a jump as part of calling functions, you need to make sure you represent the jump using a \code{CallInstruction} instead of a \code{JumpInstruction}.

The first step is handled in the constructor, while the next three steps are handled by the \code{createGroups()} method. The last step is handled by the \code{analyze} method.

\subsection{Function detection}
\label{sec:autofunc}
Prior to grouping, the control flow can be used to detect new functions. This detection is automatically activated if the Engine method \code{detectMoreFuncs} returns true.

When function detection is enabled, unreachable blocks of code will be treated as functions, unless the presumed entry point is located within the range of another function. The end point of the function will then be the last instruction reachable from the entry point.

Functions detected this way will be given the name \code{auto\_}. You can use this as a prefix to the actual name to signify that the function may not actually be a function, or you can ignore it and just replace it with the name you would normally use.

You can also have the function detection determine the end point of an existing function. To do so, \code{\_endIt} must be the same as \code{\_startIt} for that function. In this case, only the end point will be changed within the function; the name will stay the same.

Note that the control flow analysis has no way of determining an appropriate name, number of arguments, return value, or metadata. You will have to fill that in yourself using \code{postCFG} in the engine.

If this step is not enabled, and no functions have been defined before the control flow analysis is started, there will still be added a single function covering the entire script. This is done to avoid having a special case in the code, and it will not affect the output of your script in any way.

\subsection{Group generation}
\label{sec:groupgen}
Groups are initially formed according to these rules:
\begin{itemize}
\item If the next instruction is a jump or a return, end the group here.
\item If the next instruction has multiple predecessors, end the group here.
\item If the current instruction brings the stack to a lower level than the start of the current group, make the new level the expected stack level (to support clean-up after control structures).
\item If the current instruction brings the stack level to the same as the start of the current group, and at least one instruction with a non-negative stack effect exists in the group, end the group here.
\end{itemize}

The final two rules are based on a property of stack-based engines: when the stack becomes balanced, as defined by these two rules, this indicates that the current group corresponds to a single line of code. This property of the groups can be helpful when performing code generation, since it adds some context -- e.g., conditions are placed in a group of their own.

However, this only works for stack-based engines; for a non-stack-based engine, each instruction ends up in a group of its own, which is particularly bad for visualization of the code flow, since dot tends to choke on the large amount of vertices resulting from this. For this reason, engines are able to request \emph{pure grouping} via the virtual \code{usePureGrouping} method, mentioned in Section~\vref{sec:addingengine}. In pure grouping, only the first two rules are applied, creating the minimum number of groups for any grouping algorithm.

Prior to applying these rules, a depth-first search is performed to calculate the expected stack level at each instruction. If multiple paths are found to the same instruction, a warning will be output if the expected stack level from each path differs.

Unreachable code will not be processed during the group generation, but will remain as individual instructions.

\subsection{Short-circuit detection}
\emph{NOTE: This feature is currently disabled.}

As part of the group generation, the decompiler can combine multiple, consecutive groups if it detects them as being part of a single condition check that are merely split up due to short-circuiting.

The rules used to detect if two consecutive groups $A$ and $B$ are oart of the same short-circuited check are as follows:
\begin{itemize}
\item Both group $A$ and $B$ end with a conditional jump (they have two possible successors each)
\item For each outgoing edge from A, the edge must either go to B, or to a successor of B -- in other words, merging may not add any new successors. More formally, $\text{succ}(A) \subseteq \{B\} \cup \text{succ}(B)$, where $\text{succ}(X)$ is the set of possible successors for the group $X$.
\end{itemize}

\subsection{Construct detection}
After the groups have been created, we then analyze the groups to find the various kind of control flow constructs. The constructs are detected in multiple steps, with one construct per step, in the following order:
\begin{itemize}
\item \code{do-while}
\item \code{while}
\item \code{break}
\item \code{continue}
\item \code{if}
\item \code{else}
\end{itemize}

Each of these five constructs are marked using a \code{GroupType} member of the \code{Group} type, while \code{else} blocks are flagged using two booleans, \code{\_startElse} and \code{\_endElse}. If \code{\_startElse} is true, then an \code{else} block starts with this group, and should be output prior to the code in this group. If \code{\_endElse} is true, an \code{else} block ends with this group, and the end should be output after the code in this group.

Once a group has been flagged as being some construct, it is skipped for the other constructs.

The criteria used for each construct are as follows:

\paragraph{Do-while detection}
Group must end with a conditional jump (i.e., have two outgoing edges). Jump must go to an earlier place in the code.

\paragraph{While detection}
Group must end with conditional jump. Block must have an ingoing edge from some group later in the code, unless that edge comes from a do-while condition (in which case it is assumed to be an \code{if} instead).

\paragraph{Break detection}
Unconditional jump to some place later in the code. That place must either be the group immediately after a \code{do-while} condition, or the jump target of a \code{while} condition. Additionally, the jump is verified to go to the appropriate loop (so it does not exit multiple loops at once).

\paragraph{Continue detection}
Unconditional jump to a \code{while} or \code{do-while} condition, unless it is targeting a \code{while} condition which jumps to the next sequential group (in which case it is merely the end of the \code{while}-loop). Just as with \code{break}s, the jump is verified to go to the appropriate loop.

\paragraph{If detection}
All unflagged conditional jumps are flagged as \code{if}s.

\paragraph{Else detection}
All \code{if}s are processed to see if they may have an \code{else} attached. If the jump target of an \code{if} is immediately preceded by an unconditional jump, which is neither a break or a continue, and that jump goes to later in the code, this signifies a possible \code{else} block, starting with the jump target of the if condition and ending with the group immediately before the target of the jump immediately before the jump target of the if condition. To avoid false positives, this block is then validated to not cross block boundaries. If the check passes, the \code{else} block data is added to the graph.

\subsection{Graph output}
The graph can be output in DOT format by using the \code{-g} switch. In the graph, arrows on edges will be hollow if the edge is a jump, and filled if the edge represents the usual sequential order.

\subsection{Limitations}
Currently, only unconditional jumps are supported for \code{break} and \code{continue}; however, for code of the form \code{if (...) break;} or \code{if (...) continue;}, it is a pretty straight-forward optimization to use the \code{break}/\code{continue} jump as the conditional jump for the \code{if} condition check. Since \code{if}s are found last, it should be possible to simply check unmarked conditional jumps as well and see if they meet the other criteria for a \code{break}/\code{continue}, although there might be some false positives for an if that stretches to the end of the loop it is placed in.

It is currently assumed that all conditional jumps in \code{if} condition checks go to a later place in the code. If optimized continue statements are used in a while (as described above), this will cause the analysis to be incorrect.