[CCS'19] Learning to Fuzz from Symbolic Execution with Application to Smart Contracts

Posted on 2021-02-23 Edited on 2024-03-27

Abstract

Fuzzing and symbolic execution are two complementary techniques for discovering software vulnerabilities. Fuzzing is fast and scalable, but can be ineffective when it fails to randomly select the right inputs. Symbolic execution is thorough but slow and often does not scale to deep program paths with complex path conditions.

In this work, we propose to learn an effective and fast fuzzer from symbolic execution, by phrasing the learning task in the framework of imitation learning. During learning, a symbolic execution expert generates a large number of quality inputs improving coverage on thousands of programs. Then, a fuzzing policy, represented with a suitable architecture of neural networks, is trained on the generated dataset. The learned policy can then be used to fuzz new programs.

We instantiate our approach to the problem of fuzzing smart contracts, a domain where contracts often implement similar func- tionality (facilitating learning) and security is of utmost importance. We present an end-to-end system, ILF (for Imitation Learning based Fuzzer), and an extensive evaluation over >18K contracts. Our re- sults show that ILF is effective: (i) it is fast, generating 148 transac- tions per second, (ii) it outperforms existing fuzzers (e.g., achieving 33% more coverage), and (iii) it detects more vulnerabilities than existing fuzzing and symbolic execution tools for Ethereum.

Outline

Summary

Fuzzing is simple but tends to be stuck after running for a while, whereas symbolic execution is thorough but slow and expensive-to-execute. The authors notice that the inputs generated by a symbolic execution expert can form a valuable knowledge base for fuzzer to learn how to generate high-quality inputs, hence propose to bring imitation learning into the context of program testing.

Since smart contracts often implement similar functionality, the imitation learning based fuzzing idea is instantiated as a system so-called ILF on smart contracts. However, smart contracts are stateful programs, whose deep states can only be reached after executing a sequence of transactions in a valid order. Such features make pure fuzzing and symbolic execution technique struggle to achieve high coverage.

To address such challenge, ILF first leverages symbolic execution to generate effective inputs as dataset, then trains a neural network on the dataset to play the role of fuzzer. Extensive evaluation shows that the learned fuzzer preserves fast inputs generation capability while becomes more effective even on unseen programs. Specifically, it achieves more coverage than the existing smart contract fuzzer and uncovers more vulnerabilities than both pure fuzzing and symbolic execution tools.

Question: Could this system be enhanced with a coverage-guided policy? How to take care of the dependencies between transactions in a single sequence? Do they consider the influence between different sequences? For example, different sequences should cover different codes?

Overview

Architecture

Github link: https://github.com/eth-sri/ilf

MDP Mapping

Network

\(f(\bar{x})\)

sender

Amount