p153 yunhe

2020-02-27 58浏览

1.Virtual MachineShowdown:Stack Versus Registers Yunhe Shi, David Gregg, Andrew Beatty M. Anton Ertl Department of Computer Science University of Dublin, Trinity College Dublin 2, Ireland Institut für Computersprachen TU Wien Argentinierstraße 8 A-1040 Wien, Austria {yshi, David.Gregg, Andrew.Beatty}@cs.tcd.ie anton@complang.tuwien.ac.at ABSTRACT be interpreted or compiled. The most popular VMs, such as the Java VM, use a virtual stack architecture, rather than the register architecture that dominates in real processors. A long-running question in the design of VMs is whether stack architecture or register architecture can be implemented more eﬃciently with an interpreter. On the one hand stack architectures allow smaller VM code so less code must be fetched per VM instruction executed. On the other hand, stack machines require more VM instructions for a given computation, each of which requires an expensive (usually unpredictable) indirect branch for VM instruction dispatch. Several authors have discussed the issue [12, 15, 11, 16] and presented small examples where each architecture performs better, but no general conclusions can be drawn without a larger study. The ﬁrst large-scale quantitative results on this question were presented by Davis et al. [5, 10] who translated Java VM stack code to a corresponding register machine code. A straightforward translation strategy was used, with simple compiler optimizations to eliminate instructions which become unnecessary in register format. The resulting register code required around 35% fewer executed VM instructions to perform the same computation than the stack architecture. However, the resulting register VM code was around 45% larger than the original stack code and resulted in a similar increase in bytecodes fetched. Given the high cost of unpredictable indirect branches, these results strongly suggest that register VMs can be implemented more eﬃciently than stack VMs using an interpreter. However, Davis et al’s work did not include an implementation of the virtual register architecture, so no real running times could be presented. This paper extends the work of Davis et al. in two respects. First, our translation from stack to register code is much more sophisticated. We use a more aggressive copy propagation approach to eliminate almost all of the stack load and store VM instructions. We also optimize constant load instructions, to eliminate common constant loads and move constant loads out of loops. The result is that an average of more than 47% of executed VM instructions are eliminated. The resulting register VM code is roughly 25% larger than the original stack code, compared with 45% for Davis et al. We ﬁnd that the increased cost of fetching more VM code involves only 1.07 extra real machine loads per VM instruction eliminated. Given that VM dispatches are much more expensive than real machine loads, this indicates strongly that register VM code is likely to be much Virtual machines (VMs) are commonly used to distribute programs in an architecture-neutral format, which can easily be interpreted or compiled. A long-running question in the design of VMs is whether stack architecture or register architecture can be implemented more eﬃciently with an interpreter. We extend existing work on comparing virtual stack and virtual register architectures in two ways. Firstly, our translation from stack to register code is much more sophisticated. The result is that we eliminate an average of more than 47% of executed VM instructions, with the register machine bytecode size only 25% larger than that of the corresponding stack bytecode. Secondly we present an implementation of a register machine in a fully standardcompliant implementation of the Java VM. We ﬁnd that, on the Pentium 4, the register architecture requires an average of 32.3% less time to execute standard benchmarks if dispatch is performed using a C switch statement. Even if more eﬃcient threaded dispatch is available (which requires labels as ﬁrst class values), the reduction in running time is still approximately 26.5% for the register architecture. Categories and Subject Descriptors D.3 [Software]: Programming Language; D.3.4 [Programming Language]: Processor—Interpreter General Terms Performance, Language Keywords Interpreter, Virtual Machine, Register Architecture, Stack Architecture 1. MOTIVATION Virtual machines (VMs) are commonly used to distribute programs in an architecture-neutral format, which can easily Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. VEE’05, June 11-12, 2005, Chicago, Illinois, USA. Copyright 2005 ACM 1-59593-047-7/05/0006...$5.00. 153
2.more time-eﬃcient when implemented with an interpreter, although at the cost of increased VM code size. The second contribution of our work is an implementation of a register machine in a fully standard-compliant implementation of the Java VM. While implementing the register VM interpreter is simple, integrating it with the garbage collection, exception handling and threading systems is more complicated. We present experimental results on the behaviour of the stack and register versions of JVMs, including hardware performance counter results. We ﬁnd that on the Pentium 4, the register architecture requires an average of 32.3% less time to execute standard benchmarks if dispatch is performed using a C switch statement. Even if more efﬁcient threaded dispatch is available (which requires labels as ﬁrst class values), the reduction in running time is still about 26.5% for the register architecture. The rest of this paper is organised as follows. In section 2 we describe the main diﬀerences between virtual stack and virtual register machines from the point of view of the interpreter. In section 3, we show how stack-based Java bytecode is translated into register-based bytecode. In sections 4 and 5, our copy propagation and constant instruction optimization algorithms are presented. Finally, in section 6, we analyze the static and dynamic code behaviour before and after optimization, and we show the performance improvement in our register-based JVM when compared to original stack-based JVM. 2. More sophisticated approaches, such as Piumarta and Ricardi’s [14] approach of copying executable code just-in-time further reduce dispatch costs, at a further cost in simplicity, portability and memory consumption. Context threading [2] uses subroutine threading to change indirect branch to call/return, which better exploits hardware’s return-address stack, to reduce the cost of dispatches. As the cost of dispatches falls, any beneﬁt from using a register VM instead of a stack VM falls. However, switch and simple threaded dispatch are the most commonly used interpreter techniques, and switch is the only eﬃcient alternative if ANSI C must be used. The second cost component of executing a VM instruction is accessing the operands. The location of the operands must appear explicitly in register code, whereas in stack code operands are found relative to the stack pointer. Thus, the average register instruction is longer than the corresponding stack instruction; register code is larger than stack code; and register code requires more memory fetches to execute. Small code size, and small number of memory fetches are the main reasons why stack architectures are so popular for VMs. The ﬁnal cost component of executing a VM instruction is performing the computation. Given that most VM instructions perform a simple computation, such as an add or load, this is usually the smallest part of the cost. The basic computation has to be performed, regardless of the format of the intermediate representation. However, eliminating invariant and common expressions is much easier on a register machine, which we exploit to eliminate repeated loads of identical constants (see section 5). STACK VERSUS REGISTERS The cost of executing a VM instruction in an interpreter consists of threecomponents:• Dispatching the instruction 3. TRANSLATING STACK TO REGISTER • Accessing the operands In this section we describe a system of translating JVM stack code to register code just-in-time. However, it is important to note that we do not advocate run-time translation from stack to register format as the best or only way to use virtual register machines. This is clearly a possibility, maybe even an attractive one, but our main intention in doing this work is to evaluate free-standing virtual register machines. Run-time translation is simply a mechanism we use to compare stack and register versions of the JVM easily. In a real system, we would use only the register machine, and compile for that directly. Our implementation of the JVM pushes a new Java frame onto a run-time stack for each method call. The Java frame contains local variables, frame data, and the operand stack for the method(See ﬁgure 1). In the stack-based JVM, a local variable is accessed using an index, and the operand stack is accessed via the stack pointer. In the register-based JVM both the local variables and operand stack can be considered as virtual registers for the method. There is a simple mapping from stack locations to register numbers, because the height and contents of the JVM operand stack are known at any point in a program [9]. All values on the operand stack in a Java frame can be considered as temporary variables (registers) for a method and therefore are short-lived. Their scope of life is between the instructions that push them onto the operand stack and the instruction that consumes the value on the operand stack. On the other hand, local variables (also registers) are longlived and their life scope is the time of method execution. In the stack-based JVM, most operands of an instruction • Performing the computation Instruction dispatch involves fetching the next VM instruction from memory, and jumping to the corresponding segment of interpreter code that implements the VM instruction. A given task can often be expressed using fewer register machine instructions than stack ones. For example, the local variable assignment a = b + c might be translated to stack JVM code as ILOAD c, ILOAD b, IADD, ISTORE a. In a virtual register machine, the same code would be a single instruction IADD a, b, c. Thus, virtual register machines have the potential to signiﬁcantly reduce the number of instruction dispatches. In C, dispatch is typically implemented with a large switch statement, with one case for each opcode in the VM instruction set. Switch dispatch is simple to implement, but is rather ineﬃcient. Most compilers produce a range check, and an additional unconditional branch in the generated code for the switch. In addition, the indirect branch generated by most compilers is highly (around 95% [7]) unpredictable on current architectures. The main alternative to the switch statement is threaded dispatch. Threaded dispatch takes advantage of languages with labels as ﬁrst class values (such as GNU C and assembly language) to optimize the dispatch process. This allows the range check and additional unconditional branches to be eliminated, and allows the code to be restructured to improve the predictability of the dispatch indirect branch (to around 45% [7]). 154
3.Table 1: Bytecode translation.Assumption:current stack pointer before the code shown below is 10. In most cases, the first operand in an instruction is the destination register Stack-based bytecode Register-based bytecode iload 1 move r10, r1 iload 2 move r11, r2 iadd iadd r10, r10, r11 istore 3 move r3, r10 JVM instruction set (multianewarray, tableswitch, and lookupswitch). In addition to the original three variable-length instructions, all method call instructions become variable in length after the translation to register-based bytecode format. Here is the instruction format for methodcall:Figure 1: The structure of a Java frame op cpi1 cpi2 ret_reg arg1 arg2 ... op is the opcode of a method call. cpi1 and cpi2 are the two-byte constant-pool indexes. ret reg is the return value register number. arg1, arg2, . . . are the argument register numbers. The number of arguments, which can be determined when the method call instructions are executed in the interpreter loop, are not part of the instruction format. The main reason for doing so is to reduce the codesize. are implicit; they are found on the top of the operand stack. Most of the stack-based JVM instructions are translated into corresponding register-based virtual machine instructions, with implicit operands translated to explicit operand registers. The new register-based instructions use one byte for the opcode and one byte for each operand register (similar to the stack JVM). Table 1 shows a simple example of bytecode translation. The function of the bytecode is to add two integers from two local variables and store the result back into another local variable. There are a few exceptions to the above one-to-one translationrule:4. COPY PROPAGATION • Operand stack pop instructions(pop and pop2 ) are translated into nop because they are not needed in registerbased code. • Instructions related to loading of a local variable onto operand stack and storing data from operand stack into a local variable are translated into move instructions • Stack manipulation instructions(e.g. dup, dup2 . . . ) are translated into appropriate sequences of move instructions by tracking the state of the operand stack 3.1 Parameter Passing A common way to implement stack-based JVM is to overlap the current Java frame’s operand stack (which contains a method call’s parameters) and a new Java frame’s local variables. The parameters on the stack in the current Java frame will become the start of the called method’s local variables. Although this provides eﬃcient parameter passing, it prevents us from copy propagating into the source registers (parameters) of a method call. To solve this problem, we change the parameter passing mechanism in the register VM to non-overlapping and copy all the parameters to the location where the new Java frame will start. The beneﬁt is that we can eliminate more move instructions. The drawback is that we need to copy all the parameters before we push a new Java frame onto the Java stack. In the stack-based JVM, operands are pushed from local variables onto the operand stack before they can be used, and results must be stored from the stack to local variables. More than 40% of executed instructions in common Java benchmarks consist of loads and stores between local variables and the stack [5]. Most of these stack push and pop operations are redundant in our register-based JVM as instructions can directly use local variables (registers) without going through the stack. In the translation stage, all loads and stores from/to local variables are translated into register move instructions. In order to remove these redundant move instructions, we apply both forward and backward copy propagation. We take advantage of the stack-based JVM’s stack operation semantics to help implement both varieties of copy propagation. During copy propagation, we use the stack pointer information after each instruction, which tells us which values on the stack are still alive. 4.1 Forward Copy Propagation The convention in Java is that the operand stack is usually empty at the end of each source statement, so the lifetimes of values on the stack are usually short. Values pushed onto the operand stack are almost immediately consumed by a following instruction. Thus, we mainly focus on copy propagation optimization on basic blocks. We separate move instructions into diﬀerent categories and apply diﬀerent types of copy propagation depending on the location of the source and destination operands in the original JVM stack code. We apply forward propagation to the following categories of moveinstructions:• Local variables → stack 3.2 Variable Length Instructions • Local variables → local variables (these do not exist in the original translated code but will appear after forward or backward copy propagation) Most of the instructions in Java bytecode are ﬁxed-length. There are three variable-length instructions in stack-based 155
4.• All dup (such as dup, dup2 x2) instructions are translated into one or more move instructions which allows them to be eliminated using our algorithm. Table 2: Forward copy propagation algorithm. X is a move instruction being copy propagated and Y is a following instruction in the same basic block. src and dest are source and destination registers of these instructions X Y src dest X.src = Y.src X.dest = Y.src src Do nothing Replace Y.src with X.src X.src = Y.dest X.dest = Y.dest dest X.src redeﬁned after Y X.dest redeﬁned after Y Can’t remove X / stop Can remove X / stop • All iinc instructions are moved as far towards the end of a basic block as possible because iinc instructions are commonly used to increment an index into an array. The push of the index onto the stack and iinc instruction used to increase the index are usually next to each other and thus prevent us from forward copy propagation. • In a few special cases, forward copy propagation across basic block boundaries is used to eliminate more move instructions. If a move instruction’s forward copy propagation reaches the end of a basic block and its destination operand is on the stack, we can follow its successor basic blocks to ﬁnd all the usages of the operand and then trace back from the operand consumption instruction to the deﬁnition instruction. If we don’t ﬁnd any other instructions except the one instruction being copy propagated forward, then we can continue the cross basic block copy propagation. • Stack → stack (these originate from the translation of dup instrutions) The main beneﬁt of forward copy propagation is to collapse dependencies on move operations. In most cases, this allows the move to be eliminated as dead code. While doing forward copy propagation, we try to copy forward and identify whether a move instruction can be removed (See Table 2). X is a move instruction which is being copied forward and Y is a following instruction in the same basic block. X.dest is the destination register of the move instruction and X.src is the source register of the move instruction. In a following Y instruction, Y.src represents all the source registers and Y.dest is the destination register. The forward copy propagation algorithm is implemented with additional operand stack pointer information after each instruction to help to decide whether a register is alive or redeﬁned. The following outlines our algorithm for copypropagation:4.2 Backward Copy Propagation Backward copy propagation is used to backward copy and eliminate the following types of moveinstructions:• Stack → local variables Most stack JVM instructions put their result on the stack and a stores instruction stores the result into a local variable. The role of backward copy propagation is to store the result directly into the local variable without going through the operand stack. In reality, we can’t copy forward this type of move instruction because after the instruction the source register is above the top of the stack pointer. Due to the characteristics of this type of move instruction, a lot of criteria required by backward copy propagation are already satisﬁed. Suppose Z is a move instruction being considered for backward copy propagation. Y is a previous instruction in the same basic block which has Y.dest = Z.src. Whether we can do the backward copy propagation and remove instruction Z depends on the followingcriteria:• Y.dest = X.dest. X.dest is redeﬁned, stop copy propagation and remove instruction X. • After instruction Y, stack pointer is below X.dest if X.dest is a register on stack. X.dest can be considered to be redeﬁned, stop copy propagation, and remove instruction X. • If Y is a return instruction, stop copy propagation and remove instruction X. • If Y is an athrow and X.dest is on operand stack, stop copy propagation and remove instruction X because the operand stack will be cleared during exception handling. 1. Y.dest is a register 2. Z is a move instruction • Y.dest = X.src. X.src is redeﬁned and value in X.dest would still be be used after Y instruction. Stop copy propagation and don’t remove instruction X. However, We can continue to ﬁnd out whether X.dest is not used in the following instructions and then is redeﬁned. If so, remove instruction X. 3. Z.dest is a register 4. Z.src = Y.dest 5. Z.dest is not consumed between Y..Z 6. Z.dest is not redeﬁned between Y..Z • After instruction Y, stack pointer is below X.src if X.src is a register on stack. X.src can be considered as being redeﬁned, stop copy propagation and don’t remove instruction X. We ignore this rules for the second run of forward copy propagation; it is quite similar to above rule. 7. Y.dest is not alive out of the basic block, which is satisﬁed because Y.dest = Z.src and Z.src is above the top of stack pointer after Z 8. After the copy propagation, original Y.dest(Z.src) is not used anymore. It is satisﬁed as long as 5 and 6 are satisﬁed because Y.dest(Z.src) is above the top of stack pointer after the instruction Z. Several techniques are used to improve the ability of our algorithm to eliminate move insructions. 156
5.instruction with all registers. This allows us to combine sequences of instructions which add a small integer to an value on the stack. We scan the translated register-based instructions to ﬁnd all those iadd and isub instructions which has one of its operands pushed by a constant instruction with a byte constant value , due to the byte immediate value in iinc instruction. Then we use an iinc instruction to replace an iadd (or isub) instruction and a constant instruction. Another way to think of the backward copy propagation in our case is that some computation puts the result on the operand stack and then a move instruction stores the result from the stack to a local variable in the stack-based Java virtual machine. In a register-based Java virtual machine, we can shortcut the steps and save the result directly into a local variable. A simple version of across basic-block backward copy propagation is also used. If a backward copy instruction reaches the beginning of a basic block, we need to ﬁnd out whether we can backward copy to all its predecessors. If so, we backward copy to all its predecessors. 5.2 Move Constant Instructions out of Loop and Eliminate Duplicate Constant Instruction 4.3 Example Because the storage locations of most constant instructions are on the stack, they are temporary variables, and are quickly consumed by a following instruction. The only way that we can reuse the constant value is to allocate a dedicated register for the same constant value above the operand stack. We only optimize those constant instructions that store a constant values onto the stack locations and those constant values are consumed in the same basic block. The constant instructions that store value into local variables, which have wider scope, are not targeted by our optimization. A constant instruction which stores directly into a local variable can appear after backward copy propagation. The following steps are carried out to optimize constantinstructions:The following example demonstrates both forward and backward copy propagation. We assume that the ﬁrst operand register in each instruction is the destination register. 1. 2. 3. 4. move move iadd move r10, r1 r11, r2 r10, r10, r11 r3, r10 //iload_1 //iload_2 //iadd //istore_3 Instructions 1 and 2 move the values of registers r1 and r2 (local variables) to registers r10 and r11 (stack) respectively. Instruction 3 adds the values in register r10 and r11 (stack) and put the result back into register r10 (stack). Instruction 4 moves the register r10 (stack) into register r3 (local variable). This is typical of stack-based Java virtual machine code. We can apply forward copy propagation to instructions 1 and 2 and their source are copy propagated into instruction 3’s sources. We can apply backward copy propagation to instruction 4 and backward copy progagate into instruction 3’s destination which is replaced by instruction 4’s destination. After both copy propagations, instructions 1, 2, and 4 can be removed. The only remaining instructionis:• Scan all basic blocks in a method to ﬁnd (1) multiple constant VM instructions which push the same constant and (2) constant VM instructions that are inside a loop. All constant values pushed by these VM instructions onto the operand stack must be consumed by a following instruction in the same basic block for our optimization to be applied. • A dedicated virtual register is allocated for each constant value used in the method1 . The constant VM instruction’s destination virtual register will be updated to the new dedicated virtual register, as will the VM instruction(s) that consume the constant. 3. iadd r3, r1, r2 5. CONSTANT INSTRUCTIONS In stack-based Java virtual machine, there are a large number of constant instructions pushing immediate constant values or constant values from constant pool of a class onto the operand stack. For example, we have found that an average of more than 6% of executed instructions in the SPECjvm98 and Java Grande benchmarks push constants onto the stack. In many cases the same constant is pushed onto the stack every iteration of a loop. Unfortunately, it is diﬃcult to reuse constants in a stack VM, because VM instructions which take values from the stack also destroy those values. Virtual register machines have no such problems. Once a value is loaded to a register, it can be used repeatedly until the end of the method. To remove redundant loads of constant values, we apply the following optimizations. • All load constant VM instructions are moved to the beginning of the basic block. All load constant VM instructions inside a loop are moved to a loop preheader. • The immediate dominator tree is used to eliminate redundant initializations of dedicated constant registers. The above procedure produces two beneﬁts. First, redundant loads of the same constant are eliminated. If there are more than two constant instructions that try to initialize the same dedicated constant registers in the same basic block or in two basic blocks in which one dominates the other, the duplicate dedicated constant register initialization instruction can be removed. The other beneﬁt is to allow us to move the constant instructions out of loops. 5.1 Combine Constant Instruction and iinc Instruction 1 Given that we use one byte indices to specify the virtual register, a method can use up to 256 virtual registers. Thus, our current implementation does not attempt to minimize register usage, because we have far more registers than we need. A simple register allocator could greatly reduce register requirements. In the stack-based JVM, the iinc instruction can only be used to increase a local variable by an immediate value. However, in the register machine we make no distinction between stack and local variables, so we can use the iinc 157
6.6. 7. 8. 9. 10. 11. 12. 13. 14. Figure 2: The control flow of the medium-size example 5.3 Putting it all together The runtime process for translating stack-based bytecode and optimizing the resulting register-based instructions for a Java method is asfollows:• Find basic blocks, their predecessors and successors • Translate stack-based bytecode into intermediate registerbased bytecode representation. • Find loops and build the dominator matrix. move r17, r0 agetfield_quick 2, 0, r17, r17 move r3, r17 move r17, r0 getfield_quick 4, 0, r17, r17 move r4, r17 iconst_0 r17 move r5, r17 goto 0, 2 //jump to basic block 2 basic block(1) 15. bipush r17, 31 16. move r18, r1 17. imul r17, r17, r18 18. move r18, r3 19. move r19, r2 20. iinc r2, r2, r1 21. caload r18, r18, r19 22. iadd r17, r17, r18, 23. move r1, r17 24. iinc r5, r5, 1, basic block(2): 25. move r17, r5 26. move r18, r4 27. if_icmplt 0, 1, r17, r18 // jump to basic block 1 • Apply forward copy propagation. • Apply backward copy propagation. • Combine constant instruction and iadd /isub instructions into iinc instructions. basic block(3): 28. move r17, r1 29. ireturn r17 • Move iinc instructions as far down their basic block as possible. The intermediate code afteroptimizations:'>optimizations: