COM 314 – COMPUTER ARCHITECTURE
INTRODUCTION TO COMPUTER ARCHITECTURE
INTRODUCTION TO COMPUTER SYSTEMS
Computer systems have conventionally been defined through their interfaces at a number of layered abstraction levels, each providing functional support to its predecessor. Included among the levels are the application programs, the high-level languages, and the set of machine instructions. Based on the interface between different levels of the system, a number of computer architectures can be defined.
The interface between the application programs and a high-level language is referred to as a language architecture. A different definition of computer architecture is built on four basic viewpoints. These are the structure, the organization, the implementation, and the performance. In this definition, the structure defines the interconnection of various hardware components, the organization defines the dynamic interplay and management of the various components, the implementation defines the detailed design of hardware components, and the performance specifies the behaviour of the computer system.
OVERVIEW OF COMPUTER ORGANISATION AND ARCHITECTURE
Computer architecture deals with the functional behaviour of a computer system as viewed by a programmer. This view includes aspects such as the sizes of data types (e.g. using 16 binary digits to represent an integer), and the types of operations that are supported (like addition, subtraction, and subroutine calls). Computer organization deals with structural relationships that are not visible to the programmer, such as interfaces to peripheral devices, the clock frequency, and the technology used for the memory.
Computer Architecture refers to those attributes of a system visible to a programmer or those attributes that have a direct impact on the logical execution of a program. Computer Organisation refers to the operational units and their interconnections that release the architectural specifications. Examples of architectural attributes includes the instruction set, the number of bits used to represent various data types (e.g. numbers, characters), I/O mechanisms and techniques for addressing memory. Organisational attributes include those hardware details transparent to the programmer, such as control signals, interfaces between the computer and peripherals, and the memory technology used.
As an example, it is an architectural design issue whether a computer will have a multiply instruction. It is an organisational issue whether that instruction will be implemented by a special multiply unit or by a mechanism that makes repeated use of the add unit of the system.
PERFORMANCE MEASURES
There are various facets to the performance of a computer. For example, a user of a computer measures its performance based on the time taken to execute a given job (program). On the other hand, a laboratory engineer measures the performance of his system by the total amount of work done in a given time. While the user considers the program execution time a measure for performance, the laboratory engineer considers the throughput a more important measure for performance
Performance analysis should help answering questions such as how fast can a program be executed using a given computer? In order to answer such a question, we need to determine the time taken by a computer to execute a given job. We define the clock cycle time as the time between two consecutive rising (trailing) edges of a periodic clock signal (Fig. 1.1). Clock cycles allow counting unit computations, because the storage of computation results is synchronized with rising (trailing) clock edges. The time required to execute a job by a computer is often expressed in terms of clock cycles.
We denote the number of CPU clock cycles for executing a job to be the cycle count (CC), the cycle time by CT, and the clock frequency by f = 1/CT. The time taken by the CPU to execute a job can be expressed as
CPU time = CC × CT = CC/f
It may be easier to count the number of instructions executed in a given program as compared to counting the number of CPU clock cycles needed for executing that program.
Therefore, the average number of clock cycles per instruction (CPI) has been used as an alternate performance measure. The following equation shows how to compute the CPI.
CPI = CPU clock cycles for the program
Instruction count
CPU time = Instruction count × CPI × Clock cycle time
= Instruction count × CPI
Clock rate
It is known that the instruction set of a given machine consists of a number of instruction categories: ALU (simple assignment and arithmetic and logic instructions), load, store, branch, and so on. In the case that the CPI for each instruction category is known, the overall CPI can be computed as
CPI = i=1nCPIi ×IiInstruction count
where Ii is the number of times an instruction of type i is executed in the program and CPIi is the average number of clock cycles needed to execute such instruction.
Example 1: Moore's law, which is attributed to Intel founder Gordon Moore, states that computing power doubles every 18 months for the same price. An unrelated observation is that floating point instructions are executed 100 times faster in hardware than via emulation. Using Moore's law as a guide, how long will it take for computing power to improve to the point that floating point instructions are emulated as quickly as their (earlier) hardware counterparts?
SOLUTION: Computing power increases by a factor of 2 every 18 months, which generalizes to a factor of 2x every 18x months. If we want to figure the time at which computing power increases by a factor of 100, we need to solve 2x = 100, which reduces to x = 6.644. We thus have 18x = 18 × (6.644 months) = 120 months, which is 10 years.
Example 2: Consider computing the overall CPI for a machine A for which the following performance measures were recorded when executing a set of benchmark programs. Assume that the clock rate of the CPU is 200 MHz.
Instruction Category
Percentage of occurrence
No. of cycles per instruction
ALU
38
1
Load & Store
15
3
Branch
42
4
Others
5
5
Assuming the execution of 100 instructions, the overall CPI can be computed as
CPIa = i=1nCPIi ×IiInstruction count = 38 ×1 + 15×3 + 42×4 + 5×5100=2.76
It should be noted that the CPI reflects the organization and the instruction set architecture of the processor while the instruction count reflects the instruction set architecture and compiler technology used.
A different performance measure that has been given a lot of attention in recent years is MIPS (million instructions-per-second (the rate of instruction execution per unit time)), which is defined as:
MIPS = Instruction count = Clock Rate
Execution time × 106 CPI × 106
Example 3: Suppose that the same set of benchmark programs considered above were executed on another machine, call it machine B, for which the following measures were recorded.
Instruction Category
Percentage of occurrence
No. of cycles per instruction
ALU
35
1
Load & Store
30
2
Branch
15
3
Others
20
5
What is the MIPS rating for the machine considered in the previous example (machine A) and machine B assuming a clock rate of 200 MHz?
CPIa = i=1nCPIi ×IiInstruction count = 38 ×1 + 15×3 + 42×4 + 5×5100=2.76
MIPSa = Clock Rate = 200 ×106 =70.24
CPIa × 106 2.76 ×106
CPIb = i=1nCPIi ×IiInstruction count = 35 ×1 + 30×2 + 20×5 + 15×3100=2.4
MIPSb = Clock Rate = 200 ×106 =83.67
CPIb × 106 2.4 ×106
Thus MIPSb >MIPSa.
It is interesting to note here that although MIPS has been used as a performance measure for machines, one has to be careful in using it to compare machines having different instruction sets. This is because MIPS do not track execution time. Consider, for example, the following measurement made on two different machines running a given set of benchmark programs. Assume that the clock rate is 200MHz.
Instruction Category
No. of instructions (in millions)
No. of cycles per instruction
Machine (A)
ALU
8
1
Load & Store
4
3
Branch
2
4
Others
4
3
Machine (B)
ALU
10
1
Load & Store
8
2
Branch
2
4
Others
4
3
CPIa = i=1nCPIi ×IiInstruction count = 8 ×1 + 4×3 + 2×4 + 4×3 × 106 8+4+2+4 × 106=2.2
MIPSa = Clock Rate = 200 ×106 =90.9
CPIa × 106 2.2 ×106
CPU timea = Instruction count × CPIa = 18 ×106 × 2.2 =0.198s
Clock rate 200 ×106
CPIb = i=1nCPIi ×IiInstruction count = 10 ×1 + 8×2 + 2×4 + 4×3 × 106 8+4+2+4 × 106=1.9
MIPSb = Clock Rate = 200 ×106 =105.3
CPIb × 106 1.9 ×106
CPU timeb = Instruction count × CPIb = 24 ×106 × 1.9 =0.228s
Clock rate 200 ×106
MIPSb > MIPSa and CPUb > CPUa
The example shows that although machine B has a higher MIPS compared to machine A, it requires longer CPU time to execute the same set of benchmark programs.
Million floating-point instructions per second, MFLOP (rate of floating-point instruction execution per unit time) has also been used as a measure for machines' performance. It is defined as
MFLOPS = Number of floating-point operations in a program
Execution time × 106
BASIC PROCESSOR ARCHITECTURE
A processor's architecture refers to the way in which its memory and control is organized. Most early computers were designed as accumulator machines. The processor has a single register called the accumulator where arithmetic, logic and comparison operations occur. All other values and variables are stored in memory and transferred to the accumulator register when needed.
Figure 1.2 Typical Microprocessor Architecture
Inside the CPU
The basic function of a CPU is to fetch, decode and execute instructions held in ROM or RAM. To accomplish this it must fetch data from an external memory source and transfer it into its own internal memory, each addressable component of which is called a register. It must also be able to distinguish between instructions and operands, that is, the. read/write memory locations containing the data to be operated on. These may be byte addressable location in ROM, RAM or in the CPU's own registers. In addition, the CPU must perform additional tasks such as responding to external events such as resets and interrupts, provide memory management facilities to the operating system, etc. Figure 1.2 illustrates a typical microprocessor architecture.
Microprocessors must perform the following activities:
Provide temporary storage for addresses and data
Perform arithmetic and logic operations
Control and schedule all operations
Registers
Registers are used for a variety of purposes such as holding the address of instructions and data, storing the result of an operation, signalling the result of a logic operation, or indicating the status of the program or the CPU itself. Registers are locations where data or control information is temporarily stored. Some registers may be accessible to programmers, while others are reserved for us by the CPU itself. Registers store binary values such as 1 or 0 as electrical voltages of say 5 volts (or less) or 0 volts. They consist of several integrated transistors which are configured as a flip-flop circuits each of which can be switched into a 1 or 0 state. They remain in that state until changed under control of the CPU or until the power is removed from the processor. Each register has a specific name and is addressable, some, however, are dedicated to specific tasks while the majority are 'general purpose'. The width of a register depends on the type of CPU, e.g., a 16, 32 or 64 bit microprocessor. In order to provide backward compatibility, registers may be sub-divided. For example, the Pentium processor is a 32 bit CPU, and its registers are 32 bits wide. Some of these are sub-divided and named as 8 and 16 bit registers in order to run 8 and 16 bit applications designed for earlier x86 microprocessors.
The 8 General-Purpose Registers are: 1. Accumulator Register (AX): used in arithmetic operations. 2. Counter Register (CX): used in shift/rotate instructions and loops. ` 3.Data Register (DX): used in arithmetic operations and I/O Operations. 4. Base Register (BX): used as a pointer to data (located in segment register DS, when in segmented mode). 5. Stack Pointer Register (SP): Pointer to the top of the stack. 6. Stack Base Pointer Register (BP): used to point to the base of the stack. 7. Source Index Register (SI): used as a pointer to a source in stream operations 8. Destination Index Register (DI): used as a pointer to a destination in stream operations.
Stack Pointer
A 'stack' is a small area of reserved memory used to store the data in the CPU's registers when: (1) system calls are made by a process to operating system routines; (2) when hardware interrupts generated by input/output (I/O) transactions on peripheral devices; (3) when a process initiates an I/O transfer; and (4) when a process rescheduling event occurs on foot of a hardware timer interrupt. This transfer of register contents is called a 'context switch'. The stack pointer is the register which holds the address of the most recent 'stack' entry. Hence, when a system call is made by a process (to say print a document) and its context is stored on the stack, the called system routine uses the stack pointer to reload the register contents when it is finished printing. Thus the process can continue where it left off.
Instruction Decoder
The Instruction Decoder is an arrangement of logic elements which act on the bits that constitute the instruction. Simple instructions with corresponding logic hard-wired into the execution unit are simply passed to the Execution Unit, complex instructions are decoded so that related microcode modules can be transferred from the CPU's microcode ROM to the execution unit. The Instruction Decoder will also store referenced operands in appropriate registers so data at the memory locations referenced can be fetched.
Program or Instruction Counter
The Program Counter (PC) is the register that stores the address in primary memory (RAM or ROM) of the next instruction to be executed. In 32 bit systems, this is a 32 bit linear or virtual memory address that references a byte (the first of 4 required to store the 32 bit instruction) in the process's virtual memory address space. This value is translated to determine the real memory address in which the instruction is stored. When the referenced instruction is fetched, the address in the PC is incremented to the address of the next instruction to be executed. Remember each byte in RAM is individually addressable, however each complete instruction is 32 bits or 4 bytes, and the address of the next instruction in the process will be 4 bytes on.
Accumulator
The accumulator may contain data to be used in a mathematical or logical operation, or it may contain the result of an operation. General purpose registers are used to support the accumulator by holding data to be loaded to/from the accumulator.
Arithmetic and Logic Unit
The Arithmetic and Logic Unit (ALU) performs all arithmetic, comparisons and logic operations in a microprocessor viz. addition, subtraction, multiplication, division, logical AND, OR, X-OR, etc. A typical ALU is connected to the accumulator and general purpose registers and other CPU components that help transfer the result of its operations to RAM via the Bus Interface Unit and the system bus.
Machine cycle
For every instruction, a processor repeats a set of four basic operations, which comprise a machine cycle. Step 1: Fetching – Fetching is the process of obtaining a program instruction or data item from memory. Step 2: Decoding – The term decoding refers to the process of translating the instruction into signals the computer can execute. Step 3: Executing – Executing is the process of carrying out the commands. Step 4: Storing (if necessary) – Storing, in this context, means writing the result to memory (not to a storage medium)
Control Unit
The control unit coordinates and manages CPU activities, in particular the execution of instructions by the arithmetic and logic unit (ALU). The Control Unit coordinates the input and output devices of a computer system. It fetches the codes of all the instructions in the microprograms. The control unit is the circuitry that controls the flow of data through the processor, and coordinates the activities of the other units within it.
Instruction Cycle
An instruction cycle consists of the activities required to fetch and execute an instruction. The length of time take to fetch and execute is measured in clock cycles. When the CPU finishes the execution of an instruction it transfers the content of the program or instruction register into the Bus Interface Unit. This is then gated onto the system address bus and the read signal is asserted on the control bus. This is a signal to the RAM controller that the value of this address is to be read from memory and loaded onto the data bus. The instruction is read in from the data bus and decoded. The fetch and decode activities constitute the first machine cycle of the instruction cycle. The second machine cycle begins when the instruction's operand is read from RAM and ends when the instruction is executed and the result written back to memory.
The System Clock
The processor relies on a small quartz crystal circuit called the system clock to control to control the timing of all computer operations. The system clock generates regular electronic pulses, or ticks, that set the operating pace of components of the system unit. Each tick equates to a clock cycle.
The pace of the system clock, called the clock speed is measured by the number of ticks per second. A hertz is one cycle per second. A computer that operates at 3GHz has 3 billion clock cycles in one second. The faster the clock speed, the more instructions the processor can execute per second. The speed of the system clock influences a computer's performance.
PRACTISE QUESTIONS
Consider having a program that runs in 50 s on computer A, which has a 500 MHz clock. We would like to run the same program on another machine, B, in 20 s. If machine B requires 2.5 times as many clock cycles as machine A for the same program, what clock rate must machine B have in MHz?
Suppose that we have two implementations of the same instruction set architecture. Machine A has a clock cycle time of 50 ns and a CPI of 4.0 for some program, and machine B has a clock cycle of 65 ns and a CPI of 2.5 for the same program. Which machine is faster and by how much?
A compiler designer is trying to decide between two code sequences for a particular machine. The hardware designers have supplied the following facts:
Instruction class
CPI of the instruction class
A
1
B
3
C
4
For a particular high-level language, the compiler writer is considering two sequences that require the following instruction counts:
Instruction counts (in millions)
Code Sequence
A B C
1
2 1 2
2
4 3 1
What is the CPI for each sequence? Which code sequence is faster? By how much?
Consider a machine with three instruction classes and CPI measurements as follows:
Instruction class
CPI of the instruction class
A
2
B
5
C
7
Suppose that we measured the code for a given program in two different compilers and obtained the following data:
Instruction counts (in millions)
Code Sequence
A B C
Compiler 1
15 5 3
Compiler 2
25 2 2
Assume that the machine's clock rate is 500 MHz. Which code sequence will execute faster according to MIPS? And according to execution time?
Explain Moore's Law.
Briefly describe the basic operational functions of processor architecture with a well-annotated diagram.