Usages and problems of Taint Tools

1. Tool Categories

Existing taint analysis tools can be divided into three categories.

a. Dynamic Taint Analysis

The first category of tools track the information flow from taint source to taint sink at runtime following the execution trace. Most of these dynamic analysis tools are built on the top of dynamic binary instrumentation (DBI) framework such as Pin and Valgrind. These tools dynamically insert analysis code into the target executable while the executable is running. Some other tools like PolyTracker work as an LLVM Pass that instruments taint tracking logic into the programs during compilation to perform taint analysis with lower overhead. Such tools always ship with their own compiler, for example, polybuild and polybuild++ of Tool PolyTracker. Given a specific input, the dynamic analysis tools will track how the taint is propagated along the executed path.

Tools Foundation Data Dependency
Dytan DBI Platform: Pin Explicit and Implicit
Triton DBI Platform: Pin Explicit
LibDFT DBI Platform: Pin Explicit
LibDFT64 DBI Platform: Pin Explicit
Taintgrind DBI Platform: Valgrind Explicit
DataFlow Sanitizer LLVM Pass
PolyTracker LLVM Pass Explicit and Implicit
TaintInduce
b. Static Taint Analysis but Emulate the Execution

The second category of tools work as an interpreter or a CPU emulator, which means these tools do not execute the target program, but emulate the functionalities and collect the essential information for analysis. Usually, we need to set up the environment such as stack, heap, the content of memory, and program input before performing analysis. During analysis, these tools process machine code one by one and transform the code into their pre-defined intermediate language, then the memory and register will be manipulated based on the intermediate language, with the taint propagated at the same time. At each program point during execution, the value of registers and memory cells can be monitored to check if they are tainted or not. As we can see, tools in this category also maintain an execution trace and only take care of the data and instructions along this execution trace, which is the same as dynamic taint analysis.

Tools Description Data Dependency
Bincat Plugin for IDA Explicit
Triton Convert instruction to IR to emulate execution Explicit
c. Static Taint Analysis

The third category of tools analyze the target program statically and wholly. All paths are taken into consideration. The taint source and sink are specified using annotations. Generally speaking, the sources are the return value of user-controllable functions, like input(), and the sinks are the parameters of functions that should be treated carefully or may lead to potential security issues, like os.command(). The tracking granularity of tools in this category is quite larger compared with the two before mentioned. Except for the taint source and sink, the tools in this category also support for sanitizing functions. Especially, Psalm, a tool mainly aiming for security analysis of web applications, will automatically remove the taint from data when executing functions like stripslashes, which removes backslashes used to prevent cross-site-scripting attacks.

Tools Description Data Dependency
Pysa reasons about data flows in Python applications Explicit
Psalm find errors in PHP applications Explicit

In the following sections, I will demonstrate the usage of several tools to illustrate the difference between these three kinds of tools. In the end, with the guidance of the idea that if the output changes with the different inputs, the output should be tainted once the input is tainted, I will show the potential difficulties in testing these analysis tools.

2. Dynamic Analysis Tool: Taintgrind

This tool Taintgrind is built on the top of Valgrind. There are two ways to track taint. One is using TNT_TAINT client request to introduce taint into a variable or memory location in the program. Take the following program as an example, we taint variable a, then pass it to get_sign function. The return value of get_sign implicitly depends on the parameter.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include "taintgrind.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int get_sign(int x){
if (x == 0){
return 0;
}
if (x < 0){
return -1;
}
return 1;
}

int main(int argc, char **argv){
unsigned int t;
int a = atoi(argv[1]);
//Defines int as tainted
TNT_TAINT(&a,sizeof(a));
int s = get_sign(a);
TNT_IS_TAINTED(t, &s, sizeof(s));

if (t)
printf("s is_tainted: %08x\n", t);
else
printf("s is not tainted\n");
return 0;
}

Compile the above program and run it with input 1. We check if the return value of get_sign is tainted using the client request TNT_IS_TAINT. The output of dynamic analysis shows how taint flows in the execution path through instructions one by one in the following form.

1
Location | Assembly instruction | Instruction type | Runtime value(s) | Information flow

However, we can see that Taintgrind does not support to analyze the implicit information flow, even though the taint has flowed into the conditional expression.

As to the explicit information flow like below, the return value of get_sign can be successfully tainted.

1
2
3
int get_sign(int x){
return x;
}

Another usage of Taintgrind is to taint the file input given an executable, such as gzip, and show how taint flows along the execution path.

1
./build/bin/taintgrind --file-filter=FAQ.txt gzip -c FAQ.txt

3. Dynamic Analysis Tool: PolyTracker

Unlike Taintgrind, PolyTracker instruments the taint analysis logic into programs during compilation to track which bytes of an input file are operated by which functions. Therefore, compared with dynamic binary instrumentation, such tools impose negligible performance overhead for almost all inputs, and is capable of tracking every byte of input at once. PolyTracker is built on the top of the LLVM DataFlowSanitizer.

PolyTrack works as LLVM Pass. The generated shared object .so can be loaded either through opt tool to instrument the LLVM bitcode, or through clang when the source code is avaliable.

1
2
3
4
# Through opt tool
opt -local -mypass.so -mypass code.bc
# Through clang
clang -Xclang -load -Xclang mypass.so code.c

PolyTracker provides compilers polybuild and polybuild++, which are wrappers around gclang and gclang++. When the source code is avaiable, we can instrument it by simply replacing the compiler with PolyTracker and passing --instrument-target option before the cflags during the normal build process.

1
2
3
4
# export CC=`which polybuild`
${CC} --instrument-target -g -o my_target my_target.c
# export CXX=`which polybuild++`
${CXX} --instrument-target -g -o my_target my_target.cpp

If only the binary executable is given, we can first extract the bitcode from the target, then instrument it by passing --instrument-bitcode option.

1
2
get-bc -b target
${CC}/{CXX} --instrument-bitcode target.bc -o target_track --libs <any libs go here>

To run the instrumented program, POLYPATH environment variable is requited to specify which input file's bytes are meant to be tracked. Currently, polytracker only expose the capability to specify a single file as the tainted input. Assume we have the following C program. Given an input file, this program computes a key based on the first three strings, then compares the key with 0x80000000, and finally outputs the result.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/utsname.h>
#include <sys/fcntl.h>
#include <fcntl.h>

struct user_info {
char *company;
char *dep;
char *name;
char *progname;
};

int custom_crc32(int len, char *buf) {
unsigned int poly = 0x04C11DB7;
unsigned int sr = ~0;
int i, j, k;

for (i = 0; i < len; i++) {
for (j = 0; j < 8; j++) {
k = sr >> 31;
sr <<= 1;
if ((k ^ *buf) & 1) {
sr ^= poly;
}
}
buf++;
}
return sr;
}

int compute_key(struct user_info *info) {
unsigned int k_comp, k_dept, k_name, k_progname;

k_comp = custom_crc32(strlen(info->company), info->company);
k_dept = custom_crc32(strlen(info->dep), info->dep);
k_name = custom_crc32(strlen(info->name), info->name);
k_progname = custom_crc32(strlen(info->progname), info->progname);

return k_comp*k_name;
}


int main(int argc, char *argv[]) {
int fd, i = 0;
unsigned int key;
char *provided_lic, *ptr, *temp[5];
char str[128], delim[] = " ";
struct user_info info;

info.company = info.name = info.dep = NULL;

if(argc < 2) {
printf("Usage: %s inputfile\n", argv[0]);
return 1;
}
// if use fopen and fread, polytracker will fail
fd = open(argv[1], O_RDONLY);
read(fd, str, 128);
ptr = strtok(str, delim);

// example file content:
// company department name 419B0F77970C4AC453973A5D99B995C989EBF204
//
// spilt the file content on whitespace
while (ptr != NULL) {
temp[i] = malloc(strlen(ptr));
memcpy(temp[i], ptr, strlen(ptr));
printf("%s\n", temp[i]);
ptr = strtok(NULL, delim);
i++;
}

info.progname = argv[0];
info.company = temp[0];
info.dep = temp[1];
info.name = temp[2];
provided_lic = temp[3];

// compute the key
key = compute_key(&info);
printf("Key=>[0x%x]\n", key);

// output result
if (key >= 0x8000000)
printf("Invalid serial %s\n", provided_lic);
else
printf("Thank you for registering !\n");
}

We can find that the last string (provided_lic) is actually not used when computing the key. Further, the three strings are only used in the condition statement (line 25) at function custom_crc32. We compile it using polybuild, the output shows several functions are instrumented.

image-20200926235948282

With the instrumented binary my_target,run the binary with POLYPATH environment to taint the input file.

1
2
3
4
POLYPATH=inputfile ./my_target inputfile

# inputfile:
# company department name 419B0F77970C4AC453973A5D99B995C989EBF204

The binary is executed as normal, also two artifacts polytracker_process_set.json and polytracker_forest.bin are created in current directory at the same time.

image-20200927000434633

The analysis results are saved inpolytracker_process_set.json, which contains the information of taint_sources, runtime_cfg, tainted_function, etc. The json file generated by my_target is shown as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
{
"canonical_mapping": { // mapping [taint label, original file offset]
"inputfile": [ // Totoally 65 bytes in inputfile
[65, 64],
[64, 63],
...
[2, 1],
[1, 0],
]
},
"runtime_cfg": { // the control flow graph made of instrumented function
"dfs$custom_crc32": [
"dfs$compute_key" // caller
],
"dfs$compute_key": [
"main" // caller
],
"main": [
"" // no caller for main
]
},
"taint_sources": { // taint source
"inputfile": {
"end_byte": 64,
"start_byte": 0
},
// stdin is the default taint source,
// however, without environment POLYPATH, the
// program cannot be executed, which leaves
// no space for tracking the taint from stdin
"stdin": {
"end_byte": 2147483646,
"start_byte": 0
}
},
"tainted_functions": { // how the tainted function process taint label
"dfs$custom_crc32": {
"cmp_bytes": [ // the taint label are used at condition statements (line 25)
1,
... // 1-7: company
7,
9,
... // 9-18: department
18,
20,
... // 20-23: name
23
],
"input_bytes": [ // the taint labels are the input of function
1,
... // 1-7: company
7,
9,
... // 9-18: department
18,
20,
... // 20-23: name
23
]
},
"main": {
"input_bytes": [
1,
2,
...
65
]
}
},
"tainted_input_blocks": {
"inputfile": [
[0, 64]
]
},
"version": "2.2.0"
}

Polytrack outputs function-to-input-bytes mapping to indicate which bytes of an input file are operated on by which functions and how. If an input byte flows into a function as parameter but has never been read by the function, the function will not be counted as a tainted function.

1
2
3
4
5
6
void testfunc(int len, char *buf) { // not a tainted function
for (int i = 0; i < len; i++) {
*buf = 1; // only write into the memory location buf
buf++;
}
}

Besides, Polytracker can identify whether the input bytes are used in condition statements (listed under the key tainted_functions/function/cmp_bytes in JSON file). However, there are some exceptions. The following function uses the length of buf to decide the value of a, then returns a to its caller. Polytracker will not label this function as tainted, even if the value of a is implicitly depends on input buf.

1
2
3
4
5
6
7
8
9
10
int testfunc(char *buf) { // not be labeled as tainted
int a = 0;

if (strlen(buf) > 5){
a = 1;
} else {
a = 2;
}
return a;
}

In fact, Polytrack is designed to be used in conjunction with PolyFile to automatically determine the semantic purpose of the functions in a parser. PolyFile is a file identification utility to parse the file structure and map each byte of the file into the corresponding semantic. Take PDF file as an example, PolyFile first identifies that the given file is in PDF format, then outputs the sematic of each byte in the file, such as PDFComment, PDFObject.

Assume we want to reverse-engineer the PDF parser MuPDF. PolyTracker is used to instrument the parser to output the associate each byte of PDF file with each function in that parser. With the file format parsed by PolyFile as ground truth, we can easily infer the sematic purpose of each function in MuPDF by merging the two mappings (function <-> input byte <-> semantic). Based on this design consideration, it's straightforward to answer why Polytracker only use function as taint sink granularity to perform analysis.

4. Static Analysis Tool but Emulate Execution: Bincat

The following keygen-me-style program takes a few arguments as command-line parameters, then generates a hash depending on these parameters, and compares it to an expected license value.

Open executable get_key_x86 in IDA Pro. We can see address 0x93B is the start point of main function.

Focus on the BinCAT Configuration pane, set the start address as 0x93B as the start address. Edit the configuration to initialize the stack, setup the content of command-line parameters, and corresponding memory.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mem[0xb8000000*4099] = |00|?0xFF # initialize the stack to unknown value
mem[0xb8001004] = 5 # argc = 5
mem[0xb8001008] = 0x200000 # argv = 0x200000

mem[0x200000] = 0x300100 # argv[0]
mem[0x200004] = 0x300140 # argv[1]
mem[0x200008] = 0x300180 # argv[2]
mem[0x20000C] = 0x3001C0 # argv[3]
mem[0x200010] = 0x300200 # argv[4]

mem[0x300100] = |6c6f6c3300| # argv[0] = "lol3"
mem[0x300140] = |636f6d70616e7900| # argv[1] = "company"
mem[0x300180] = |64657000| # argv[2] = "dep"
mem[0x3001C0] = |6c6f6c3100| # argv[3] = "lol1"
mem[0x300200] = |6c6f6c2100| # argv[4] = "lol!"

image-20200921214609971

Click start to perform analysis. When analysis finishes, open the BinCAT Memory view, we can see the five parameters are located at the corresponding memory addresses.

image-20200921220518093

Switch to the IDA View-A view, go to an arbitrary address such as 0x98C, we can check the value of each register. From the BinCAT IL view, the intermediate language of instruction mov [ebp+var_22C], edi is (32)[ebp+0xfffffdd4]<-edi.

image-20200921222843992

Override the taint of every byte at addresses 0x300140-0x300147 which contains the null-terminated company string.

image-20200921220721353

Set it to !|FF| (this fully taints the byte, without changing its value).

image-20200921220823674

We can check that the user-defined taint overrides at BinCAT Overrides view. Re-run the analysis.

image-20200921221039291

Advance to the instruction at address 0x93F, and observe that this memory range is indeed tainted: both the ASCII and hexadecimal representations of this string are displayed as green text.

image-20200921221127514

In the IDA View-A view, notice that some instructions are displayed against a non-gray background, since they manipulate tainted data. For example, the instruction at address 0x9E6 (push eax) is highlighted because eaxis tainted.

image-20200921221444701

In the IDA View-A view, we can see some instructions are displayed with a white background, since these instructions are not executed with the given parameters.

image-20200921224200323

5. Static Analysis Tool: Psalm

Psalm is a static analysis tool for PHP applications. Assume we plan to analyze the following target.php program using Psalm. $_POST is user-controlled input, which is defined as a taint source by default. echo is the place that we don't want untrusted user data to end up, which is also defined as taint sink by default. In this program, the data explicitly flows from $_POST to echo $name and implicitly from $_POST to echo $a.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<?php
$a = 0;
$name = "";
$input = "";

if ($_SERVER["REQUEST_METHOD"] == "POST") {
$input = $_POST["name"];
$name = test_input($input);

if (strlen($input) >= 2) {
$a = 1;
} else {
$a = 2;
}
echo $a;
echo $name;
}

function test_input($data) {
$data = trim($data);
$data = stripslashes($data);
return $data;
}
?>

In the directory of the above program, run the following command to add a psalm.xml config file.

1
./vendor/bin/psalm --init

Psalm will scan the project, set the project files to be analyzed and figure out an appropriate error level for the codebase.

1
2
3
4
5
6
7
8
9
10
11
12
13
<?xml version="1.0"?>
<psalm
errorLevel="3"
...
>
<projectFiles>
<file name="target.php" />
<ignoreFiles>
<directory name="vendor" />
</ignoreFiles>
</projectFiles>
</psalm>

To perform taint analysis, run the ./vendor/bin/psalm with flag --taint-analysis. However, the analysis shows that no errors found in the target.php program.

image-20200922144006760

There are two reasons. First, Psalm not support implicit data flow tracking. Second, operation stripslashes remove taints from data because this function removes backslashes in the string, which eliminates the risk of suffering cross-site-scripting attacks. In that case, there is no need to track how taint flows anymore. We can feel that Psalm is developed mainly for security analysis of web applications. However, this is not consistent with our idea: If the output changes with the different inputs, the output should be tainted once the input is tainted.

If we put echo $data ahead of stripslashes function, the error can be detected.

image-20200922145913877

Except for the three default taint sources $_POST/$_GET/$_COOKIE, we can define our custom taint sources using annotations @psalm-taint-source <taint-type>, which are always the return value of user-controlled functions. The following example defines the return value of getUser function as a taint source whose type is input. When the taint flows into the taint sink echo, an error will be reported.

image-20200922151007945

Also, except for a number of default builtin taint sinks such as echo, we can define our custom taint sinks using annotation @psalm-taint-sink <taint-type> <param-name>, which are always the parameters of functions that should be carefully treated. The following example defines the parameter $str of myExec function as taint sink whose type is shell.

image-20200922152410422

6. Static Analysis Tool: Pysa

Pysa is a static analysis tool for python. We first define different types of taint sources and sink in taint.config file. In the following config, we define source type UserControlled, sink type RemoteCodeExecution, as well as a rule showing that we concern about the data flows from source UserControlled to sink RemoteCodeExecution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
"sources": [
{
"name": "UserControlled",
"comment": "use to annotate user input"
}
],

"sinks": [
{
"name": "RemoteCodeExecution",
"comment": "use to annotate execution of code"
}
],
"features": [],

"rules": [
{
"name": "Possible shell injection",
"code": 5001,
"sources": [ "UserControlled" ],
"sinks": [ "RemoteCodeExecution" ],
"message_format": "Data from [{$sources}] source(s) may reach [{$sinks}] sink(s)"
}
]
}

Then we need to link the source and sink type with target program. The following .pysa file using python annotation syntax indicates that the return value of input is taint source with type UserControlled, and the parameter of print is taint sink with type RemoteCodeExecution. Once there is a data flow path from input to print, Pysa will report the potential issue.

1
2
3
4
5
6
7
# taint/general.pysa

# model for raw_input
def input(__prompt = ...) -> TaintSource[UserControlled]: ...

# model for print
def print(values: TaintSink[ReturnSign], sep = ..., end = ..., file = ...) : ...

Therefore, for the target program, the input function is a taint source since it gets input directly from the user. The print function is a taint sink, since we do not want user-controlled values to flow into it.

1
2
3
def func():
a = input()
print(a)

Execute command pyre analyze to perform taint analysis, here are the results. The information shows that the taint source a flows into sink at line 3, in which a is used as the parameter of print function. The message also shows that the detected issue points to function source.func. Note that Pysa only analyzes data flow in functions, which means if the following statement 2 and 3 are in main function or even not any function, then Pysa will say that there is no issue.

image-20200922172253744

Pysa cannot track implicit information flow, currently, only conditional tests are supported. In the following case, the taint source a is used in conditional test expression in line 4. Pysa can report this problem, but cannot detect that the taint has implicitly flowed into sink print at line 5 and line 7.

image-20200922223740657

7. Conclusion

We can feel that even though there are a bunch of taint analysis tools, they are developed for different purposes, which leads to different user interfaces, different taint granularity, different features. We may need to be careful to infer the taint "ground truth" based on the output of the program. Furthermore, these taint tools prefer to under-taint and even make efforts to reduce false positive. In details, the implicit dependence such as if(x){ y = 2; } are not supported by majority tools. Besides, for dynamic tools and those static tools that simulate execution, they only focus on one path that depends on the user input. Mutations that don't affect the information flow of that path may not be useful to test the analysis tool. Therefore, test case construction is another aspect that should be treated seriously.