Compiling

Here is how to compile a C program to AT&T syntax assembly:

$ cat test.c
int
main(int argc, char** argv) {
  int x = 2;
  int y = 3;
  return(x + y);
}

$ gcc -S test.c

A file named test.s is created.

Disassembling

To disassemble an executable on Linux:

$ objdump -d /bin/ls

The -D flag will dissemble more sections.

This does not create something that can be assembled with gcc, though...

To disassemble an executable on Mac OS X:

$ otool -tV /bin/ls

Assembly Language

AT&T syntax assembly instructions usually consist of (1) a mnemonic for the op code, (2) the source, and (3) the destination. The destination is modified, but the source is not:

mov $5, %eax

Operands which start with a dollar sign $ refer to immediate values. These are read-only values; in the case of a mov instruction they cannot be used as the destination. Immediate values are integers. Decimal, octal, hex, or binary notation can be used: $42, $052, $0x2a, $0b101010. They can be negative: $-42.

Operands which start with a percent sign % refer to a register. If the operand is a source, the value is read from the register. If the operand is a destination, the value produced by the instruction is stored in the register.

The size of the operands is determined by the instruction suffix. If there is no instruction suffix, the size of the destination register is used. The suffixes are:

  • b: byte (8 bits)
  • w: word (32 bits)
  • q: quad (64 bits)
  • s: short/single (16 bit integer or 32 bit float)
  • l: long (32 bit integer or 64 bit float)
  • t: ten bytes (80 bit float)

Locations in memory are referred to using effective addresses, which have a offset(base) syntax. This instruction copies 8 bytes from the %rsi register to a location 16 bytes before the address in %rbp:

movq	 %rsi, -16(%rbp)

The op codes and registers that are available depend on the architecture of the processor. We assume an x64 architecture such as used on most modern desktops and laptops.

The Intel architectures are a series of mostly backwardly compatible architectures. Each new architecture expanded on the previous architecture by added new instructions and registers. Here are some of the instructions supported by the 80486 architecture.

The MOV instruction can copy data from a register, memory, or an immediate value to a register or memory. Thanks to the L1 cache, this can sometimes be done in a single clock cycle (that is about half a nanosecond) even when the source or the destination is memory. If the source is not in the L1 cache, then the CPU stalls while the data is fetched. If the data can be found in the L2 or L3 cache, the access time is about 7ns. If the data must be fetched from memory, the access time is about 100ns. Making a program or a section of code small enough to fit in the L2 cache (perhaps 256k to 8M) or better yet the L1 cache (maybe 64k) is way to guarantee fast performance.

It appears that the assembler will fail with an error when MOV operands are not of the correct size.

Lines which end with a semicolon are labels:

        jmp     foo
        movl    $4, -24(%rbp)
foo:
        movl    -20(%rbp), %edi

One use of a label is as the argument of JMP, which transfer execution to the instruction after the label unconditionally. Presumably the linker replaces refererences to labels with the actually memory location. However, if you assemble the code and then run nm on it, the label names might still be there; the strip command can remove them. The start point for the executable is labeled _main. Other uses of labels are discussed below.

The basic registers are:

name 32-bit 64-bit
accumulator EAX RAX
base EBX RBX
counter ECX RCX
data EDX RDX
base pointer EBP RBP
stack pointer ESP RSP
source index ESI RSI
destination index EDI RDI

The second column registers are the lower 4 bytes of the third column registers.

In particular, the EAX register is the lower 4 bytes of the RAX register. The AX register is the lower 2 bytes of the EAX register, and AL and AH are the lower and upper bytes of AX.

Similarly for BX, BL, BH, CX, CL, CH, DX, DL, and DH.

On the x64 architecture, there are 8 additional 64-bit registers: R8, R9, ..., R15.

The EAX, EBX, ECX, and EDX registers are mostly general purpose, though they are special to some instructions.

JCXZ, JECXZ, and JRCXZ are conditional jumps. The jump is performed if the CX, ECX, or RCX register, respectively, is zero.

The CPU has bits called flags which it can set to communicate information with a running program.

bit name code
0 carry CF
1 1
2 parity PF
3 0
4 auxiliary AF
5 0
6 zero ZF
7 sign SF
8 trap TF
9 interrupt IF
A direction DF
B overflow OF
C i/o privilege level IOPL
D i/o privilege level IOPL
E nested task TM
F 0
10 resume RF
11 virtual mode VM

Jxx

strings MOVSB and SI, DI

PUSH, POP and SP, BP

Instructions that start with a period are assembler directives. They do not correspond to an instruction in the executable. Here are the GNU assembler directives.

The AT&T syntax supports # and /* */ style comments. Blank lines are ignored. Although the compiler and the disassembler use tabs when generating assembly, tabs can be replaced by spaces and leading whitespace on a line can be removed.

Function Calls

This function will be used to illustrate how the C compiler implemenents function calls:

void func(char *argv1, int arg2, int arg3) {
    char loc1[4];
    int loc2;
    ...
}

If caller data is already on the stack, then everything to the left of it is pushed onto the stack when func is called.

stack, larger addresses are to the right
... loc2 loc1 %ebp %eip arg1 arg2 arg3 caller data 0xffffffff

First the arguments passed to the function are pushed onto the stack. They are pushed in reverse order, so that the later functions are at higher addresses.

Next the instruction pointer of the calling function—the value in the register %eip—is pushed onto the stack. It is restored to the register func returns.

Next the base pointer of the calling function is pushed onto the stack so it can be restored when func returns. The base pointer is used to get the addressses of the local variables and the arguments.

Finally the local variables of the called function are pushed onto the stack.

Syscalls

The traditional way to make an operating system call was to put an integer in %eax and then use the INT $0x80 instruction. INT is the interrupt instruction. It takes as an argument a byte, so there can be at most 256 types of interrupt. When the syscall returns, it puts the return value in %eax.

One must know the number of each syscall. Also, registers can be used to pass values to some of the syscalls. A chart of the syscalls as of the Linux 2.2 kernel is here:

Linux System Call Table

Intel Syntax

On Linux, use the following command to generate Intel syntax assembly:

$ gcc -S -masm=intel test.c

On Mac OS X, use:

$ clang -S -mllvm --x86-asm-syntax=intel test.c

The main differences of Intel syntax are:

  • no $ and % prefixes for immediate values and registers
  • immediate values use h suffix for hex, o or q for octal, and b suffix for binary. d and t suffixes are optional for decimal.
  • destination is first operand, source is second operand
  • [base - offset] syntax refers to memory location offset bytes before base.
  • semicolons ; start comments which go to the end of the line.

Timing

Executable Formats

One can use gcc to convert source code in assembly to an executable:

$ gcc -o test test.s

The original Unix executable format was called a.out. System V introduced the COFF format, which is also used by Windows, and later the ELF format, which is used by Linux and most other modern Unix variants, such as BSD and Solaris.

The PEF format was used by the classic Mac OS in its later years. The format was also supported by Mac OS X when running on a PowerPC. The standard format on Mac OS X is Mach-O, which it inherited from NeXTSTEP.

ELF

The ELF format consisted broadly of four sections: (1) the ELF (or file) header, (2) the program header table, (3) sections describing the text and data of the program, and (4) the section header table.

The ELF header is 64 bytes. To display it:

$ readelf -h /bin/ls

The lengths and offsets of the fields depend upon whether the format is 32-bit or 64-bit:

ELF header (64 bit)
offset length in bytes field description
0x00 4 e_ident magic header: %x7F followed by ASCII "ELF"
0x04 1 e_ident 1 or 2 to indicate 32-bit or 64-bit
0x05 1 e_ident 1 or 2 to indicate little-endian or big-endian; effects
values starting at 0x10; little endian means least significant byte
in smallest address
0x06 1 e_ident elf version number (always 1?)
0x07 1 e_ident OS ABI (often set to 0, but value for Linux is 0x03
0x08 1 e_ident interpretation depends on previous field
0x09 7 e_ident padding; unused
0x10 2 e_type 1, 2, 3, or 4 for "relocatable", "executable", "shared", or "core"
0x12 2 e_machine instruction set architecture: 0x03 for x86, 0x3E for x86-64
0x14 4 e_version always 1?
0x18 8 e_entry first instruction in process to execute
0x20 8 e_phoff start of program header table
0x28 8 e_shoff start of section header table
0x30 4 e_flags
0x34 2 e_ehsize header size (64 bytes)
0x36 2 e_phentsize size of program header table entries
0x38 2 e_phnum number of program header table entries
0x3A 2 e_shentsize size of section header table entries
0x3C 2 e_shnum number of section header table entries
0x3E 2 e_shstrndx index of section header table entry containing section names

To display the program header table:

$ readelf -l /bin/ls

To display the section header table:

$ readelf -S /bin/ls

The sections are numbered starting from zero. To display section 29 in hex and to display the strings in section 29:

$ readelf -x 29 /bin/ls
$ readelf -p 29 /bin/ls

Mach-O

OS X ABI Mach-O File Format

A Mach-O file consists of 3 parts: (1) the Mach-O header, (2) the load commands, and (3) the segments.

Show the Mach-O header:

$ otool -h /bin/ls

Show the load commands:

$ otool -l /bin/ls

Show the text and and data sections. The -V flag disassembles the text section:

$ otool -tV /bin/ls

$ otool -d /bin/ls

Linkers and Symbols

A linker takes input files and produces an executable output file. The inputs and output are in executable formats, often the same type of format, though some linkers may support different formats. When the formats are ELF, the e_type is 1 for "relocatable", i.e. an object file, and 2 for "executable", i.e. the output file that can be run.

link editor command language; compiler drivers

Symbols can be displayed using the nm command or (on Linux) objdump -t:

$ nm /bin/ls
$ objdump -t /bin/ls

nm uses single letters to indicate the type of symbol. These can vary between operating systems, but the following are common:

B/b: BSS section (i.e. uninitialized, statically allocated data)
D/d: data section
T/t: text section
U: undefined

Uppercase letters are global symbols; lowercase letters are local to an object.

Symbols are used by the linker and the debugger. If the final executable does not need to be debugged, the symbols can be removed. The following commands remove all symbols and just the debugger symbols on Linux:

$ strip -s foo
$ strip -d foo

how symbols are represented in memory

If the code was compiled from C++, the names will be mangled. Mangling is used for (1) function overloading, (2) namespaces, and (3) templates.

The scheme used for mangling identifiers varies from compiler to compiler; we describe the scheme used by g++. First, how to see the original identifier:

$ c++filt __Z7reversePim
reverse(int*, unsigned long)

The double underscore Z is a prefix used to avoid collisions with C code identifiers.

What follows are digits giving the length of the identifier for the function name.

After the function name are symbols for the types of the arguments: Pi for a pointer to an integer, and m for an unsigned long.

Here are the the letters used for primitive types:

c: char
d: double
f: float
h: unsigned char
i: int
j: unsigned int
l: long
m: unsigned long
x: long long
y: unsigned long long

Here is an example of an identifier in a name space. The fully qualified identifier starts with an N and ends with an E. Each label in the fully qualified identifer starts with digits indicating the length of the symbol:

$ c++filt __ZN4util7reverseEPim
util::reverse(int*, unsigned long)

A template symbol:

$ c++filt __ZN3FooIiEC1Ei
Foo<int>::Foo(int)

C

calling the standard library CALL, RET
calling C standard library from assembly
calling assembly in a different translation unit from C
embedding assembly in C code

Hardware Integers

x64 architecture processors provide integer operations on 8-bit, 16-bit, 32-bit, and 64-bit integers.

The integers can be signed or unsigned. Because the signed integers are represented using two's complement notation, the same instructions can be often be used on signed or unsigned operands.

In two's complement notation, the most significant bit determines the sign. If it is zero, the signed integer is zero or positive, otherwise the integer is negative.

For zero and positive signed integers, the interpretation is the same as for unsigned integers. 0b00001111 represents 8 + 4 + 2 + 1 = 15.

For negative integers, one takes the value of the integer if the sign bit were not set and subtracts it from 2N-1, where N is the number of bits in the type. 0b10001111 represents 27 - (8 + 4 + 2 + 1) = 128 - 15 = 113.

The instructions INC and DEC can be used to increment or decrement a value in a register by one. They work for both signed and unsigned integers. Incrementing the largest possible unsigned integer produces zero.

Incrementing the largest possible positive signed integer produces the largest (in absolute magnitude) possible negative integer. This also sets the overflow flag. The JO and JNO op codes can be used to perform a conditional jump if the overflow flag is set or not set. The SETO sets the value of its operand to 1 if the overflow flag is set, otherwise 0. The SETNO instruction sets the value of its operand to 0 if the overflow flag is set, otherwise 1.

NEG

ADD SUB

MUL DIV

The result of a multiplication must be stored in two registers of the same size as the multiplicands. The data is written into AX:AL, DX:AX, or EDX:EAX when the multiplication is 8 bit, 16 bit, and 32 bit, respectively. One of the multiplicands must be in AL, AX, or EAX depending upon the size of the multiplication.

overflow flag, carry flag, borrow flag: OF, CF, BF

bit operations NOT AND OR XOR ROR ROL SHL SHR

Hardware Floats

The most recent standard for floating point arithmetic is IEEE 754-2008.

The first floating point arithmetic standard was IEEE 754-1985. The Intel 8087 coprocessor implemented the standard in 1980 well before it was adopted by the IEEE. The original IBM PC included a slot on the mother board so that an 8087 could be installed as an upgrade. With the 80486 chip Intel discontinued floating point coprocessors and integrated floating point instructions in the main CPU processor.

types of binary floats, half, single, double, quadruple (16, 32, 64, 128 bits)

10 byte (80 bit) floats

Floats are an approximation of real numbers. However, they can only represent a finite subset of the dyadic numbers accurately.

special values positive and negative zero, subnormal numbers, positive and negative infinity, and two varieties of NaN.

The layout for a 32 bit and 64 bit binary floats is:

  • sign: 1 bit, 1 bit
  • exponent: 8 bits, 11 bits
  • significand: 23 bits, 52 bits

The exponent bits are all zeros for zeros and subnormal numbers. The exponent bits are all ones for infinity and NaN.
Otherwise it is the base two exponent of a non-zero, finite number: -126 to 127 (32-bit) or -1022 to 1023 (64 bit).

subnormal numbers

rounding rules

exceptions

using floats to represent integers

GPU

Installed CUDA 6.5 on my Mac.

$ export PATH=/Developer/NVIDIA/CUDA-6.5/bin:$PATH
$ export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-6.5/lib:$DYLD_LIBRARY_PATH
$ cd /Developer/NVIDIA/CUDA-6.5/samples
$ make -C 0_Simple/vectorAdd
$ ./bin/x86_64/darwin/release/vectorAdd