NOTE: this paper describes the PRE PEND_INTS algorithm code. it will be updated once PEND_INTS is finalized. ====================================================================== A Description of the APIC I/O Subsystem Copyright (c) 1997 Steve Passe The implementation of symmetric I/O in the FreeBSD SMP kernel is based upon the Intel(tm) Advanced Programmable Interrupt Controller (APIC). This part is used as a replacement for the functionality of the 8259 ICUs used in traditional PCs. For a complete description of the hardware aspects of a motherboard using APICs see [Intel MP spec, 1]. ------------- Introduction: The APIC differs from an (8259) ICU in several aspects. The most notable is its support for distributed CPUs. This is accomplished with its bus structure: there are multiple APICs on a motherboard, all connected via a private APIC bus. Each CPU has an APIC, refered to as a "LOCAL APIC". A motherboard has at least one APIC that is refered to as an "IO APIC". External hardware INTerrupt sources are routed to pins of the IO APIC. These include motherboard sources such as the 8254 timer, as well as ISA/PCI card slots. INTerrupts are "collected" by this part, and forwarded via the APIC bus to one or more CPUs when they occur. A class of interrupt known as an InterProcessor Interrupt (IPI) is also supported. The IPI is a method for 1 CPU to directly send an INTerrupt "message" to one or more CPUs, including itself. --------- The Code: All the APIC specific code (except a small portion of mandatory setup code) is bracketed in the source with "#ifdef APIC_IO / #if defined(APIC_IO)". It was kept separate from SMP specific code in anticipation that the APIC may someday become common on UP motherboards as it supports more INTs than the standard pair of 8259 ICUs. It is enabled by adding "options APIC_IO" to a kernel config file. It has only been tested in the context of an SMP enabled kernel, and probably would fail to work if used on its own. The primary files implementing APIC support are: - /sys/i386/include/apic.h: register defines, etc. - /sys/i386/include/mpapic.h: header file for mpapic.c. defualt defines, macros. inline functions. - /sys/i386/i386/mpapic.c: APIC specific functions: IO APIC initialization. low-level Inter Process Interrupt (IPI) functions. INTerrupt mask functions (for IO APIC). - /sys/i386/i386/mp_machdep.c: LOCAL APIC setup. parsing of the motherboard MP table. vm mapping of LOCAL/IO APICs. utility functions for the "APIC sub-system": functions that return "device to INTerrupt" associations, etc. functions that install/enable MP bootstrap code.. functions that startup the Application Processors (APs). TLB invalidation IPI functions. --------------- Boottime Setup: During boot, in init386(), the function mp_start() is called. mp_start() looks for and (if found) parses the MP table provided by the motherboard. If the MP table is good, or a "default" configuration is found, it calls mp_enable, if not it panics(). If APIC_IO is enabled, mp_enable() programs the APIC sub-system. This includes the programming of individual IO APIC registers, as well as setting the appropriate vectors in the kernel Interrupt Descriptor Table (IDT). mp_enable finishes by installing bootstrap code for the Application Processors (APs) and starting them up. Each AP bootstraps itself to an initial run state and then blocks on a lock that will later be released by code in init_smp.c. --------------------- INTerrupt Management: The APIC sub-system attempts to steer INTerrupts to CPUs as appropriate for efficient kernel operation. In general this means sending hardware INTerrupts from the IO APIC to the CPU most able to service the INTerrupt. Once we have fine-grained locking (for this discussion I'm refering to one lock per INTerrupt source) this will mean the CPU operating at the lowest priority level. As long as we use the current "giant-lock" model we are stuck with a different algorithm. Specifically this means sending the INTerrupt to the CPU currently holding the lock, as any other CPU would just "busy-spin" waiting to get the lock. We attempt to enforce this behaviour by using the Task Priority Register (TPR) to keep all CPUs at an arbitrary "middle level priority". When a CPU acquires the lock it sets its TPR to a lower priority, making it the first candidate for receiving INTerrupts from the IO APIC. This works in the case where a CPU holds the lock as a result of a system call/trap, but FAILS when it holds the lock as a result of being in an INTerrupt Service Routine (ISR). This is because the Processor Priority Register (PPR) is set to the HIGHER of the TPR and the In Service Register Vector (ISVR). Thus when servicing an INT, the CPU's PPR is bumped up from the TPR setting to the priority of the highest ISR bit. For more info see [PPro Devel Manual, 2] ------------------ Future Directions: When "fine-grained" locking is achieved the above algorithm will be changed. Details are still being determined, but essentually we will let the APICs behave as they were designed to, ie. let INTerrupts be delived according to the basic hardware priority algorithms, without attempting to fool them with the current "lower my priority" scheme. An IO APIC has 16 (Neptune) or 24 (Triton/Natoma) distinct channels. We currently map the traditional IRQ0-IRQ15 INTs to the IDT vectors from 32 thru 47, an the additional APIC INT sources (IRQ16-IRQ23) to the following 8 IDT entries. As discussed previously [PPro Devel Manual, 1] the 240 supported vectors (16 thru 255) are grouped by 16: priority = vector / 16 This dense grouping invites deadlock thru LOCAL APIC fifo saturation. It also is a performance bottleneck. The future design will spread out the ISRs thru all the 240 available IDT vectors. Ideally this would include a guarantee that no more than 2 vectors (2 deep fifo) are assigned to any one "slot". We have only 13 available slots: 240 / 16 == 15 slots - 1 slot used by the linux syscall - 1 slot at the top which is "uninterruptable" -- 13 available slots Since we need 1 or 2 slots for the exclusive use of IPIs, we will probably have about 10 slots left for hardware INTs. There may be some grouping along the line of swi_tty, swi_net, _softclock, swi_ast. We might also group all network cards in one slot, all sio INTS in another, etc. The code in vector.s, as well as portions of icu.[ch], will be abandoned and APIC specific functionality rewritten from the ground up. vector.s will continue to exist, both for the UP kernel, as well as "mixed-mode" programming. The details for achieving this are as yet undefined. It is envisioned that a scheme for boot-time modification of the APIC ISR functions may be used as oppossed to the MACRO model used in vector.s. --- There will be a set of "per-CPU" pages that allow each CPU to have a "private" copy of certain variables. For example 'my_id' would be the CPU's logical APIC id. This eliminates the need to "read APIC_ID, shift, mask, convert" to determine which CPU we are. This affects everyplace that currently uses cpunumber()/GETCPUID to do this. _curproc etc will move into the private pages as well, along with the local run queues, _curpcb, _runtime, the clock variables, etc. The big "schedule" routine would be responsible for placing processes in each cpu's local run queue. This supports pid to particular processor binding and making processes "prefer" a particular cpu in the normal case. (It won't cause one cpu to be overloaded and another run idle, because the idle one will cause a reschedule.. the rescedule can factor in a "cost" of moving a process from one cpu to another.) Mapping of the LOCAL/IO APICs at "known" virtual addresses. This will simplify code from: movl _apic_base, %ecx /* - get cpu# */ movl APIC_ID(%ecx), %ecx to: movl $_apic_base + APIC_TPR, %ecx in many places. It will allow c code to change from: extern volatile u_int* apic_base; id = apic_base[APIC_ID]; io0id = io_apic_base[APIC_ID]; to: extern volatile local_apic_t *local_apic; extern volatile io_apic_t *io_apic[NAPIC]; id = local_apic->id; io0id = io_apic[0]->id; ------- Fubars: The Intel MP spec is fairly "loose" in many places. It provides support for older Intel parts at the cost of "design elegance". Many vendors fail to adhere to the details of the specification. All of these facts make parsing/using the MP table a real pain! An edge triggered (ie ISA) INTerrupt that occurs while it is masked in the IO APIC is LOST! For details see [82489DX APIC, 1]. To deal with this we use "lazy masking" where an INTerrupt is left unmasked on the first occurrance, only being masked if it re-occurs while servicing/pending the 1st occurance. Some motherboards do NOT connect the 8254 timer to the IO APIC. The MP spec allows this (see above comments on the spec) since it is impossible to get this signal "off-chip" on some chipsets. This is dealt with by using what is refered to as "mixed-mode" programming. Essentially this involves routing the lower 8259 ICU INTR out to IRQ0 of the IO APIC. The IO APIC is then programmed to expect an external device (ExtInt in MP spec terminology) on that pin. When an INTerrupt is received via this channel the IO APIC does the IntAck dance with the 8259 to fetch the vector. The end result is that timer INTs are "noticed" by the traditional 8259, and then passed thru the IO APIC's IRQ0 pin. The real fubar is that some motherboards show the 8254 timer connected directly to the APIC, when in reality IT IS NOT! This is dealt with as a "rogue hardware" situation and requires the use of "options SMP_TIMER_NC" in the kernel config file to inform the kernel to take corrective action. There is a 2-deep fifo per priority level for received INTerrupts in the LOCAL APICs. This means that it is possible for deadlocks to occur from the inability to deliver an INT because a fifo is full. Careful programming can solve this problem, the details of which will be covered elsewhere. For details see [PPro Devel Manual, 1]. ------------------------------------------------------------------------------- References: --- Intel MP spec, 1: Intel MultiProcessor Specification version 1.4, July 1995 section 3.6: Multiprocessor Interrupt Control --- 82489DX APIC, 1: Intel486 Microprocessors and Related Products ISBN 1-55512-235-3 82489DX Advanced Programmable Interrupt Controller Consideration 5, page 4-302 --- PPro Devel Manual, 1: Pentium Pro Family Developer's Manual ISBN 1-55512-261-2 Multiple Processor Management section 7.4.2: Valid Interrupts. - PPro Devel Manual, 2: Multiple Processor Management section 7.4.10.2/3: Task Priority Reg. / Processor Priority Reg. ---