Reproduce two critical bugs in Move VM and know why
#1: The Crash Bug
A bug in the move virtual machine can cause crash in aptos applications.
Introduction
The Move programming language is used in many well-known projects, like Aptos and Sui.
Move virtual machine (movevm) like Ethereum Virtual Machine evm are the same, where it needs to compile the source code into byte code and exectued in the virtual machine.
- the bytecode is loaded in through the function execute_script
- Execute load_script function, this function is mainly used to deserialize the bytecode, and verify whether the bytecode is legal, if the verification fails, it will return as a failure
- After successful verification, the real bytecode code is then executed
- Execute the bytecode, access or modify the state of global storage, including resources, modules
Verification Module
before the real execution of bytecode code, verification of bytecode is performed. The verification can be subdivided into a number of sub-processes respectively.
- *BoundsChecker*, is mainly used to check the boundary security of the module and script. This includes checking the boundary of signature, constants, etc.
- *DuplicationChecker*, a module that implements a checker to verify whether each vector in a CompiledModule contains different values
- *SignatureChecker*, which checks that the field structure is correct when the signature is used for function parameters, local variables, and structure members
- *InstructionConsistency*, which verifies instruction consistency
- *Constants* are used to verify that constants are of the original type and that the data of constants are correctly serialized to their type
- *CodeUnitVerifier*, to verify the correctness of the function body code, via stack_usage_verifier.rs and abstract_interpreter.rs respectively
- *script_signature*, to verify that a script or entry function is a valid signature
CodeUnitVerifier
The vulnerability existed in the logic of checking stack size check when verifying blocks of code.
Specifically, stack_size_increment
can be indirectly controlled by constructing an oversized num_pushes
, resulting in an integer overflow vulnerability.
Reproduction
install aptos-cli tool:
1 | $ curl -fsSL "https://aptos.dev/scripts/install_cli.py" | python3 |
run aptos node
1 | $ git clone git@github.com:aptos-labs/aptos-core.git ~/aptos-core && cd ~/aptos-core |
poc bytecode file
Disassemble the bytecode. We can see the code at line 1 and line 2 are two VecUnpack instructions.
The function of VecUnpack is to push all the data to the stack when the vector object is encountered in the code. That said the instruction will unpack a statically known number of elements onto the stack.
Thus, two lines of code will let VM add 3315214543476364830 and 18394158839224997406 via stack_size_increment += num_pushes;
, which is greater than the maximum value of u64 (18446744073709551615). This cause add overflow panic featured by Rust and let the Aptos node crash down.
1 | move new Test |
Crash the node:
1 | aptos move run-script --compiled-script-path 1.mv |
The node backend logs show:
#2: The Loss of Found Bug / Type Confusion Bug
A bug in the move virtual machine can cause loss of found in aptos applications due to type confusion problem.
Introduction
Lets first see the README document in the move repo for bytecode verifier. https://github.com/move-language/move/tree/96d7dd69c5fe2e1aa2c36831c8d0154c3e3acfe0/language/move-bytecode-verifier
So here the bug is happened at the type safety checking stage.
Type Safety
The second phase of the analysis checks that each operation, primitive or defined function, is invoked with arguments of appropriate types. The operands of an operation are values located either in a local variable or on the stack. The types of local variables of a function are already provided in the bytecode. However, the types of stack values are inferred. This inference and the type checking of each operation can be done separately for each block. Since the stack height at the beginning of each block is n and does not go below n during the execution of the block, we only need to model the suffix of the stack starting at n for type checking the block instructions. We model this suffix using a stack of types on which types are pushed and popped as the instruction stream in a block is processed. Only the type stack and the statically-known types of local variables are needed to type check each instruction.
We can see the fix commit is to add additional check in VecPack
operator. As seen below, previous code didn’t check the consistency of operand types in a vector while after code checks.
We see how VecPack specifically works in Move VM.
aptos move init --name test_move
1 | module test::test_move{ |
move disassemble --name test_move
1 | module cafe.test_move { |
The first instruction is VecPack
. This operator is to initialize a vector. The first parameter represents a type index, which will be parsed into a specific type when compiling. In this example, u64 is the type for this vector. The second parametere represents the number of elements when initializing. Here because we use vector::empty<u64>()
, we do not hv element to initialize the vector and the parameter is 0. For example, if we use vector<u64>[0,1,2];
, the second parameter will be 3.
1 | // Move bytecode v5 |
That’s introduction of VecPack
. Then lets reproduce this vulnerability in an example.
Reproduction
First, clone the aptos project:
1 | git clone https://github.com/aptos-labs/aptos-core.git |
Switch to a old commit version.
1 | git checkout 649fb13e021f8e6a4d28c3410767a9af6106dbb1 |
The code in this commit already fixed the bug. So we need to replace the move version in code here.
1 | Replace any "77750b37bb3663d00a7c4058937fed42ceb3089e" to "59265662d0a44ba53b09ba3c4b2248efdf08c622" in the repo. |
First, open a tmux session and setup a local test aptos environment by following command:
1 | cargo run --release -p aptos -- node run-local-testnet --with-faucet --faucet-port 8081 --force-restart --assume-yes |
This command will also build a aptos-cli
tool at aptos-core/target/release/aptos
for us to interact with the validator node.
We can copy this tool to our shell path cp target/release/aptos ~/bin
Open another tmux session and publish module to the aptos network.
1 | aptos init |
Now we published a coin called TestCoin to the local aptos network.
TestCoin code:
1 | module TestCoin::test_coin { |
We mainly focus on two structs defined here. one is called OneCap
, another is called TwoCap
.
We have also a function called test
, which will print a “hello” being invoked when the parameter is OneCap
.
1 | aptos move publish --named-addresses TestCoin=default |
As the fix commit shows, we can use VecPack to pack a element of a type into anther targeted type of vector, which later unpacking this vector will give us the targeted type.
Our original script looks like this:
1 | script { |
This is a valid script, we get a OneCap
instance and call test
with this onecap
.
This the details of the above bytecode (move provide a document to eplain their bytecode format https://github.com/move-language/move/blob/main/language/documentation/spec/vm.md)
1 | MAGIC |
Let’s directly run this script first.
1 | aptos move run-script --compiled-script-path build/TestCoin/bytecode_scripts/poc.mv --framework-local-dir ~/move-language/aptos-core/aptos-move/framework/aptos-framework |
The test
function will be invoked normally.
Now, let’s try to directly call the test
function using TwoCap
instance.
We can do this by modifying the code section in bytecode like below.
1 | Code Content |
(Don’t forget also changing the number of instructions in code hearder)
After modification, we get the bytecode:
1 | aptos move run-script --compiled-script-path test_poc.mv --framework-local-dir ~/move-language/aptos-core/aptos-move/framework/aptos-framework |
No surprises, we get an error:
1 | { |
Because the element in the vector is TwoCap
type and we unpack this twocap on the stack, taking it as parameter to call test
.
But as we mentioned before, the VecPack
actually miss a check when packing the elements. We can actually change the type index in the Vecpack
instruction and also VecUnpack
. Then we cange forge a OneCap
using a TwoCap
and vector operaion.
1 | aptos move run-script --compiled-script-path test_poc1.mv --framework-local-dir ~/move-language/aptos-core/aptos-move/framework/aptos-framework |
We successfully call the test()
function.
Actually, we understand the bug now and we can simplify the code.
- call the get_two_cap(), the return value will be pushed on the stack.
- vecpack the value on the stack, using OneCap index
- unpack the value back on the stack, the value become OneCap
- call the test() using the value on stack
- ret
1 | 1101 CALL(index->FUNCTION_HANDLES) |
Our poc works.
Moreover, it seems only we can do this via struct
. I furthur test changing u64
tou128
and it failed. Internal implementation will raise an error INTERNAL_TYPE_ERROR for casting u64
to u128
.
This example can happen in real-world application when OneCap is some capability type that should be restricted like MintCapability
and TwoCap are not like ViewBalanceCapability
.
Then it will easily cause loss of found or other severe results.