Solana Co-Founder: Building a Global State Machine and Analyzing Solana's Ultimate Architecture

2023-12-19 06:11:12

**Words: Anatoly Yakovenko, CEO (Co-Founder & CEO), Solana

Compiler: 1912212.eth, Foresight News

Solana’s goal is to synchronize a single, permissionless global state machine as quickly as possible, in compliance with the laws of physics. I believe the architecture that will be able to achieve this will look like this:

Large number of full nodes, more than 10,000 (N > 10,000)

In order for the network to function as a global state machine, it needs to support numerous full nodes. Turbine has proven that fast replication to very large networks is scalable on modern hardware and networks.

Large number of block generation leaders, over 10,000 (N > 10,000)
Concurrent leaders produce blocks at the same time, randomly selected in the range of 4 to 16.

Concurrency Leader enables the network to have multiple locations across the globe to order user transactions. It reduces the distance between users and the network, eliminating the need for full node verification before transactions are added to the chain.

The block time is 120 milliseconds

Short block times create fast finality points, enhance censorship resistance, improve user experience, reduce windows for reordering transactions, and accelerate the network overall.

Some of the voting consensus nodes in the approval subcommittee, with a number of between 200 and 400, are randomly selected and rotated every 4 to 8 hours per epoch.

Consensus is essential for choosing a fork, which occurs due to network partitioning. A sample of 200 or more nodes will be statistically representative of all major partitions in the network and closely match their actual distribution. Therefore, all full node votes are not required, 200 is sufficient. Limit approval to subcommittees to reduce the memory and network bandwidth required to support 120 ms blocks. Reducing the block time naturally increases the number of votes sent per second, putting some pressure on the resources allocated for consensus.

The real challenge in the 120ms block is to replay all user transactions. Since the network is permissionless, it is extremely difficult to guarantee a homogeneous execution environment with a reliable time to execute arbitrary user code. While there is a possibility, it can only be achieved by limiting the available compute resources for user transactions and ensuring that every node is overallocated to the worst-case scenario.

However, there is no reason to enforce a full state for consensus nodes that vote for a fork or a leader who builds on a fork. In order to keep the approvals of consensus nodes and leaders in sync, the state only needs to be calculated once per period.

Asynchronous execution

Motivation

Synchronous execution requires all nodes that vote and create blocks to be super-configured in any block to determine the worst-case execution time. Asynchronous execution is one of the very few cases where there are few trade-offs. Consensus nodes can perform less work before voting. Work can be aggregated and batched, making it efficient in execution without any cache loss. It can even be executed on a completely different machine than the consensus node or leader. Users who want synchronous execution can allocate enough hardware resources to perform each state transition in real-time without having to wait for the entire network.

Given the diversity of applications and core developers, it’s worth planning a major protocol change every year. If I had to choose one, my choice would be to execute asynchronously.

OVERVIEW

Currently, validators quickly repeat all transactions on each block and vote only after the complete state has been calculated for the block. The goal of this proposal is to separate voting decisions on forks from calculating the full state transitions of blocks.

Validators who vote in the approval only need to select the fork; they don’t need to perform any state at all. Only on each epoch do they need the status to calculate the next approval.

The voting procedure was adjusted so that it could be carried out independently. Nodes execute voting procedures only before voting. Since validators don’t take up much space, memory requirements should be relatively small. Since voting has a very predictable execution time, there should be little to no jitter in the execution of the voting procedure.

All non-voting transactions can be calculated asynchronously. This allows replay to execute all non-voting transactions in batches, prefetch and JIT all programs in advance, virtually eliminating all cache loss. The long-term goal is that only machines that require real-time, low-latency, full-state calculations will be configured for this task. Presumably, users will pay for the additional hardware.

Once the fork selection and state execution are separated, it becomes easier to speed things up:

Asynchronous execution
Each epoch rotates a fixed number of voting committees
200 ms block time

Since user transaction replay does not block the fork selection, volatility is no longer an issue when subtracting the block time. The only thing to consider is that at 200 milliseconds, the validator’s voting rate doubles. Making a fairly straightforward change to how the quota is calculated for approvals will allow us to fix the size of the approval at 200 or 400, or whatever number seems appropriate.

It is also natural to completely separate implementation from consensus. It will be faster to restart the consensus node that only needs to check the voting program account in the fixed size approval.

In fact, I believe that the confirmation time will increase because the vast majority of approvals are voted on as quickly as possible, and while these votes are propagated, nodes that provide users with full state execution results can execute transactions at the same time. Therefore, any replay jitter we see today should occur at the same time as voting network propagation.

Vote

Voting accounts must have a sufficient number of SOL to cover the votes of 2 epochs.
Voting transactions must be simple. A non-simple vote is bound to fail. Block generators should give up complex voting.
Withdrawing SOL from voting accounts is allowed, as long as the balance does not drop to less than 1 epoch of votes.
In order to zero out all lamports, the Vote CLOSE directive must require the full epoch to elapse. Voting accounts are marked as CLOSE in Era 1, but can only be CLOSED in Era 2. CLOSE allows you to withdraw all SOL and delete voting accounts. Once an account is marked as CLOSE, it can only be completely deleted and cannot be reopened.
Votes contain a VoteBankHash instead of a regular BankHash.

Leader regulation and approval

Only validators meet the following criteria:

Stake amount > X
As well as SOL > 2 epochs of voting
and is not marked as CLOSE

to enter the leader schedule and count toward approval. For version 2, we can separate the LeaderSchedule from the Quorum, and they don’t have to have the same requirements each.

VoteBankHash calculation

Unlike Bankhash, which calculates all transactions, validators only compute VoteBankHash for simple voting transactions related to validators in the LeaderScheduler. All other transactions are ignored. After replaying all the votes, the VoteBankHash is calculated in the same format as the current BankHash.

The VoteBankHash should accumulate the previous VoteBankHash, not the full BankHash.

BankHash calculations

For all optimistic confirmed blocks (configurable for all blocks), the validator starts calculating the UserBankHash, which includes all state transitions, but excludes transactions that have been considered in the VoteBankHash calculation.

The BankHash is then derived from the accumulation of (VoteBankHash, UserBankHash). The top 99.5% of validators submit BankHash as part of their voting every 100 time slots. While committed every 100 time slots, it is calculated at every time slot. It’s worth noting that it might be worthwhile for a small percentage of nodes to consistently submit BankHash in gossip as a soft signal with no non-deterministic observed.

If less than 67% of validators submit full BankHash calculations, leaders should reduce the block space available for user transactions and writable accounts by 50%. This measure is to protect the chain from abuses that could excessively increase replay time.

BankHash should accumulate previous BankHashes.

Go to the bank leader

During block creation, it is likely that the leader will not be able to obtain the state used to create the block, and it is not ideal to execute all transactions during the block creation period.

Leaders maintain a cache of paid account balances.
If a paid account is used as a source for system transfers, or as a writable account passed along with a system program to another program, then the paid account balance is set to 0.
Blocks are packed according to declared compute units (CUs) in local fee priority order until the blocks are filled.
Fees are deducted from the paid account balance cache.
Paid account balance caching is supplemented by BankHash calculations.

The network incurs relatively little cost to transaction spam failures, consisting only of the bytes stored in the archive and the bandwidth required to propagate the transactions in the block.

Given that validators are already looking to maximize their earnings, they have ample incentives to maintain an accurate cache of paid accounts. In addition, if there is no penalty in place, anyone in any network can easily serve the cache in the long run. In the event of server corruption, a bankless leader operator should be able to easily switch or sample from multiple sources.

This means that due to validators’ motivation to maximize earnings, they will strive to maintain an accurate cache of paid accounts. In the absence of a penalty mechanism, this cache may be served by any node in the network for a long time. In addition, if a server fails, an operator without a bank leader should be able to easily switch or sample from multiple sources.

Trade-offs

The main trade-off is the lack of an acknowledgment signature on the full node servicing the user state to confirm that its delivery status is exactly the same as the approved rest. The sole authoritative explanation of the state should remain the same, even if each transaction is replayed sequentially in the ledger. Any performance optimizations should not alter the results. So, once the fork is finalized, only the correct state is left to calculate, as long as the runtime implementation is error-free.

Nodes designed to reliably provide state should run multiple machines and clients, and if there is a discrepancy in state execution, they should stop operating. This is essentially what operators should be doing today, because relying solely on the rest of the network introduces most assumptions of integrity.

Users can also sign transactions that assert a BankHash or trigger an abort. The rest of the network will execute these transactions only if the exact BankHash calculated is exactly the same as the BankHash provided to the user by the RPC provider.

Long-term stateless consensus node roadmap

Networks with fixed-size approvals require only a very small amount of state to start. The approval itself and its pledge weight, as well as all voting account balances. This is a very small amount of memory and a tiny snapshot file that can be quickly distributed and quickly initialized on a reboot.

If the approval is inconsistent with the full node, the full node that is tracking both the approval and the status will stop running. This means that if there is a disagreement between the approval and the status, the exchange, fiat channel, RPC, bridge, etc., will all cease to function. This requires only a very small percentage of defective stateless consensus nodes.

Bankless leaders can rely on a sample of multiple full nodes to provide a cache of the initial balance of a paid account. Even if it is flawed, the result will be spam in the block rather than consensus failure. Operators should be able to monitor the health of their leaders and the percentage of spam they inject into their blocks, and respond quickly to failures.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes