



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Notes; Class: Prin Cmpt Oper Sys; Subject: Computer Science (CSC) ; University: University of Miami; Term: Unknown 1989;
Typology: Study notes
1 / 6
This page cannot be seen from the preview
Don't miss anything!
ak@suse.de
x86-64 is a 64-bit extension for the IA32 architec- ture, which is supported by the next generation of AMD CPUs. New features include 64-bit pointers, a 48-bit address space, 16 general purpose 64-bit inte- ger registers, 16 SSE (Streaming SIMD Extensions) registers, and a compatibility mode to support old binaries.
The Linux kernel port to x86-64 is based on the existing IA32 port with some extensions, including a new syscall mechanism, 64-bit support and use of interrupt stacks. It also adds a translation layer to allow execution of the system calls of old IA binaries.
This paper gives a short overview of the x86-64 ar- chitecture and the new x86-64 ABI and then dis- cusses internals of the kernel port.
x86-64 is a new architecture developed by AMD. It is an extension to the existing IA32 architecture. The main new features over IA32 are 64-bit point- ers, a 48-bit address space, 16 64-bit integer regis- ters, and 16 SSE2 registers. This paper describes the Linux port to this new architecture. The new 64-bit kernel is based on the existing i386 port. It is ambitious in that it tries to exploit new features, not just do a minimum port, and redesigns parts of the i386 port as necessary. The x86-64 kernel is developed by AMD and SuSE as a free software project.
I will start with a short overview of the x86-64 ex- tensions. This section assumes that the reader has basic knowledge about IA32, as only changes are explained. For an introduction to IA32, see [Intel].
x86-64 CPUs support new modes: legacy mode and long mode. When they are in legacy mode, they are fully IA32 compatible and should run all exist- ing IA32 operating systems and application software unchanged. Optionally, the operating system can switch on long mode, which enables 64-bit opera- tion. In the following only long mode is discussed. The x86-64 linux port runs in long mode only.
Certain unprivileged programs can be run in com- patibility mode in a special code segment, which allows existing IA32 programs to be executed un- changed. Other programs can run in long mode and exploit all new features. The kernel and all inter- rupts run in long mode.
A significant new feature is support for 64-bit ad- dresses, so that more than 4GB of memory can be addressed directly. All registers and other struc- tures dealing with addresses have been enlarged to 64-bit. Eight new integer registers added (R8-R16), so that there is now a total of 16 general purpose 64-bit registers. Without address prefixing, the code usually defaults to 32-bit accesses to registers and memory, except for the stack which is always 64-bit aligned and jumps. 32-bit operations on 64-bit reg- isters do zero extension. 64-bit immediates are only supported by the new movabs instruction.
A new addressing mode, RIP-relative, has been added which allows addressing of memory relative to the current program counter.
x86-64 supports the SSE2 SIMD extensions. Eight new SSE2 registers (XMM8-XMM15) have been
added over the existing XMM0-XMM7. The x register stack is unchanged.
Some obsolete features of IA32 are gone in long mode. Some rarely used instructions have been re- moved to make space for the new 64-bit prefixes. Segmentation is mostly gone: segment bases and limits are ignored in long mode. fs and gs can be still used as kinds of address registers with some lim- itations and kernel support. vm86 mode and 16-bit segments are also gone. Automatic task switching is not supported anymore.
Page size stays at 4KB. Page tables have been ex- tended to four levels to cover the full 48-bit address room of the first implementations.
For more information see the x86-64 architecture manual [AMD2000].
3 ABI
As x86-64 has more registers than IA32, and does not support direct calling of IA32 code, a new mod- ern ABI was designed for it. The basic type sizes are similar to other 64-bit Unix environments: long and pointers are 64-bit, int stays 32-bit. All data types are aligned to their natural 1 size.
The ABI uses register arguments extensively. Up to six integer and nine 64-bit floating point arguments are passed in registers, in addition to arguments.
Structures are passed in registers where possible. Non prototyped functions have a slight penalty as the caller needs to initialize an argument count reg- ister to allow argument saving for variable argument support. Most registers are caller saved to save code space in callees.
Floating point is by default passed in SSE2 XMM registers now. This means doubles are always cal- culated in 64-bit unlike IA32. The x87 stack with 80-bit precision is only used for long double. The frame pointer has been replaced by an unwind ta- ble. An area 128 bytes below the stack pointer is reserved for scratch space to save more space for leaf functions.
Several code models have been defined: small, (^1) On IA32 64-bit long long was not aligned to 64-bit
medium, large, kernel. Small is the default; while it allows full 64-bit access to the heap and the stack, all code and preinitialized data in the executable is limited to 4GB, as only RIP relative addressing is used. It is expected that most programs will run in small mode. Medium is the same as small, but allows a full 64-bit range of preinitialized data, but is slower and generates larger code. Code is limited to 4GB. Large allows unlimited 2 code and initial- ized data, but is even slower than medium. kernel is a special variant of the small model. It uses nega- tive addresses to run the kernel with 32-bit displace- ments and the upper end of the address space. It is used by the kernel only.
So far the goal of the ABI to save code size is suc- cessful: gcc using it generates code sizes comparable to 32-bit IA32^3
For more information on the x86-64 ABI see [Hubicka2000]
4 Compiler
A basic port of the gcc 3 compiler and binutils to x86-64 has been done by Jan Hubicka. This includes implementation of SSE2 support for gcc and full support for the long mode extensions and the new 64-bit ABI. The compiler and tool chain are stable enough for kernel compiling and system porting.
5 Kernel
The x86-64 kernel is a new Linux port. It was orig- inally based on the existing i386 architecture code, but is now independently maintained. The following discusses the most important changes over the 32- bit i386 kernel and some interesting implementation details.
(^2) Unlimited in the 64-bit, or rather 39-bit address space, of the first kernel (^3) Not counting the unwind table sizes.
tained by the kernel. To isolate this code from user space vsyscalls have been added by Andrea Arcan- geli. A special code area is mapped into every user process by the kernel. The functions in there can be directly called by the user via a special offset ta- ble at a magic address, avoiding the overhead of a system call.
Vsyscalls have some problems with signal and ex- ception handling. The x86-64 ABI requires a dwarf unwind table to do a backtrace in case of a crash and the kernel needs to provide an unwind table for the user mode vsyscall pages in case a signal or ex- ception occurs while they run. This is still work in progress.
8 Processor Data Area
To solve the SYSCALL supervisor stack bootstrap problem described above, a data structure called the Per processor Data Area (PDA) is used. A pointer to the PDA is stored on bootup in a hidden register of the CPU using the KERNEL GS BASE MSR. Each time the kernel is entered from user space via exceptions, system calls or interrupts, the SWAPGS instruction is executed. It swaps the userland value of the GS register with the PDA value from the hidden register. The original contents of the GS register are restored on exit from the kernel.
The PDA is currently used to store information for fast syscall entry ( such as the kernel stack pointer of the current task), a pointer to the current task itself, and the old user stack pointer on a system call. It also contains the per CPU stack.
It is hoped that future Linux versions will move more information into a central generic PDA struc- ture that is used by the architecture independent kernel. As of Linux 2.4, various subsystems keep their own private arrays padded to cache lines and indexed by CPU number. Accessing such arrays is costly as the CPU number has to be first retrieved, the index computed, and the required cache line padded to avoid false sharing of old data. The PDA offers a faster alternative, at the disadvantage of be- ing less modular because PDA data structures have to be maintained in a central include file.
9 Partial stack frame
To speed up interrupts and system calls the ker- nel entry code only saves registers that are actu- ally clobbered by the C code in the portable part. Some system calls and kernel functions need to see a full register state. These include for example fork, which has to copy all registers to the child process, and exec, which has to restore all registers, signal handling, and needs to present all registers to the signal handler. Special stubs are used to save the full register set in this case.
After a fast system call entry through SYSCALL the kernel stack frame is partially uninitialized. Some information such as the user program pointer (RIP) and the user stack pointer (RSP) are saved in the PDA or in special registers. On other entry points (like for the i386 syscall emulation), they are on the normal stack frame on the kernel stack. To shield C code from these differences, the CPU part of stack frame is always fixed by a special stub before calling any function that looks at the kernel stack frame. After the system call returns to the emulation layer the PDA state is restored using the stack frame to handle context switches.
10 Kernel stack
On Linux, every process and kernel thread has its own kernel stack. This stack is also used for inter- rupts while the process runs.
Over time, the Linux memory allocator will en- counter problems allocating more than two consec- utive pages reliably due to memory fragmentation. Every process needs a contiguous kernel stack that should be directly mapped for efficiency. Like the i386, the x86-64 has a 4K page size. This limits the kernel stack in practice on i386 and x86-64 to 6- 8K. This also helps to keep the per-thread overhead of LinuxThreads (the most common threads pack- age under Linux) low, which uses a separate kernel stack for each thread.
On i386 ,the 6K stack available is already tight under heavy interrupt load. 64-bit code needs more stack space than 32-bit code because the stack is always 64-bit aligned, and its data structures on the stack are bigger. To avoid stack overflow for nested
interrupts, the x86-64 port uses a separate per-CPU interrupt stack.
The x86-64 architecture supports interrupt stacks in the architecture. Unfortunately, this causes prob- lems with nested interrupts, which are common in Linux. Instead of the hardware mechanism, a more flexible software stack switching scheme using an in- terrupt counter in the PDA is used.
For double fault and stack fault exceptions, the hardware interrupt stacks are used to handle invalid kernel stack pointers with a debugging message in- stead of silently rebooting the system.
11 Finding yourself
On a machine with multiple CPUs it can be quite complicated to find the current process. A global variable cannot be used, as it is CPU local informa- tion. i386 uses a special trick to solve this problem: the task structure is always stored at the bottom of the two aligned kernel stack pages 4 and can be efficiently accessed using an AND operation on the current stack pointer.
One disadvantage of this is that the task structures of all processes end up on the same cache sets for not-fully-associative CPU and chipset caches, be- cause the lower 13 bits of their address is always zero. This can cause cache trashing in the scheduler for some workloads.
In the 64-bit kernel, accessing the task structure through the stack pointer doesn’t work as interrupts running on the special interrupt stack also need to access it, for example, to maintain the per-process system and user time statistics
On x86-64 the current process counter is stored into the PDA which is efficiently accessed using the GS register. This will also allow the task struct to be moved to a separate cache coloring slab cache, work- ing around the cache problems described above, and giving the 64-bit kernel in user context 8K of stack space instead of 6K.
This setup is still experimental. If it turns out in further tests that an 8K stack is not enough for the
(^4) Which is why i386 can use only 6K of the 8K available from the two kernel stack pages.
64-bit user context kernel code without interrupts, then the port will have to move to a kernel stack that is not physically contiguous, which will be slower due to increased use of TLB resources, but can be made bigger without stressing the page allocator. This will also require auditing drivers to ensure they do not perform DMA from the kernel stack.
12 Context switch
The basic context switch of x86-64 is very similar to the i386 port except that it also saves and re- store the extended R8-R15 integer registers. The extended SSE registers are handled transparently by the FXSAVE instruction. d drivers still need work. It is hoped that in future 32/64bit translation will be a generic feature of a linux driver to avoid a hard to maintain central translation layer.
This 64-bit conversion is currently done in an architecture-specific module for the x86-64, but it is expected to be moved into architecture-independent code in 2.5, as it is a common problem.
Legacy mode i386 applications see the full 4GB of virtual space reachable by 32-bit pointers. A 32-bit i386 kernel only gives them part of the 4GB address space (usually 3GB), as it also needs some address space of its own. Therefore, on a 64-bit kernel, even 32-bit applications can use more address space.
13 Status
The kernel, compiler and tool chain work. The ker- nel boots and works on the simulator, which is used for the porting of userland code and for running programs.
14 Availability
All the code discussed in this paper can be down- loaded from http://www.x86-64.org. The gcc port will be part of gcc 3.1. The x86-64 toolchain is part of the standard GNU binutils sources. Gdb and glibc ports are worked on and they are available in