On February 25th, 2023, the Solana Mainnet Beta cluster experienced long block finalization times that degraded the network’s performance. The root cause of the outage according to a recent report by Solana was a congestion within the primary block-propagation protocol, known as “Turbine.”
The unusual network load overwhelmed Turbine’s capacity, forcing the majority of block data to be transferred using the significantly slower fallback Block Repair protocol, Solana explained. Validator operators assumed that the freshly upgraded validator software was responsible and attempted a live downgrade of the cluster without success. Validators eventually rebooted the system manually by downgrading to the last known stable validator software version, which resolved the issue.
Further analysis by Solana on the outage indicated that the abnormal Turbine traffic was caused by block forwarding services that failed when they encountered an unexpectedly large block. This block overpowered validator deduplication filters, resulting in block data being constantly reforwarded across validators. As fresh blocks were created, they aggravated the problem until the protocol became saturated.
The degradation of the network began when a malfunctioning validator broadcasted an abnormally large block during its leader slot. Core developers identified that Turbine was flooded with data, and the Block Repair protocol saw abnormally high traffic.
It was determined by the developers that the huge block had been built from a parent slot a long time ago. The block’s enormous size was most likely caused by padding out proof of history with virtual ticks, which were required to indicate that the leader had observed the intermediate period of time.
During the event, data shreds from the large block were properly filtered, however, recovery shreds did not include the parent slot metadata, resulting in the large block’s recovery shreds overpowering the deduplication logic. This resulted in the huge block’s recovery shreds overpowering the deduplication logic, resulting in false negatives and duplicate shreds not being deleted but instead retransmitted in a continuous cycle.
The turbine was swamped by looping shreds, and block propagation was relegated to Block Repair, a slower protocol designed for retrieving shreds that failed to arrive via Turbine as well as receiving block data during initial validator catch-up.
These continuously recurring shreds hindered block propagation, which prevented consensus block finalization. For a while, optimistic block confirmation persisted. When a validator notices that its most recently voted-upon slot is 400 or more slots away from the last finalized slot, it enters this safety state in which it only includes votes in blocks and not economic transactions.
Following the root cause analysis, Solana stated that improvement strategies are being implemented. Specifically, “enhancements to the deduplication logic are now in place to mitigate saturation of this filter in the Solana Labs validator client v1.13.7 and v1.14.17,” Solana said.
Read also; Brave browser integrates Solana DApp
What do you think of this article? Share your comments below.