You are on page 1of 147

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization COMPUTER ORGANIZATION

R 402

2+1+0

Module 1 Introduction: Organization and Architecture Review of basic operational concepts CPU- single bus and two bus organization, Execution of a complete instruction interconnection structures layered view of a computer system. Module 2 CPU - Arithmetic: Signed addition and subtraction serial and parallel adder BCD adder Carry look ahead adder, Multiplication Array multiplier Booths Algorithm, Division Restoring and non-restoring division, floating point arithmetic - ALU Design. Module 3 Control Unit Organization: Processor Logic Design Processor Organization Control Logic Design Control Organization Hardwared control Microprogram control PLA control Microprogram sequencer, Horizontal and vertical micro instructions Nano instructions. Module 4 Memory: Memory hierarchy RAM and ROM Memory system considerations Associative memory, Virtual memory Cache memory Memory interleaving. Module 5 Input Output: Printers, Plotters, Displays, Keyboard, Mouse, OMR and OCR, Device interface I/O processor Standard I/O interfaces RS 232 C, IEEE 488.2 (GPIB).

References 1. Computer Organization - Hamacher, Vranesic and Zaky, Mc Graw Hill 2. Digital Logic and Computer Design - Morris Mano, PHI 3. Computer Organization and Architecture -William Stallings, Pearson Education Asia. 4. Computer Organization and Design - Pal Chaudhuri, PHI 5. Computer Organization and Architecture -M Morris Mano, PHI 6. Computer Architecture and Organization - John P Hayes, Mc Graw Hill

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

1.1 Introduction to Computer organization and architecture


In describing computer system, a distinction is often made between computer architecture and computer organization. Computer architecture refers to those attributes of a system visible to a programmer, or put another way, those attributes that have a direct impact on the logical execution of a program. Computer organization refers to the operational units and their interconnection that realize the architecture specification. Examples of architecture attributes include the instruction set, the number of bit to represent various data types (e.g.., numbers, and characters), I/O mechanisms, and technique for addressing memory. Examples of organization attributes include those hardware details transparent to the programmer, such as control signals, interfaces between the computer and peripherals, and the memory technology used. As an example, it is an architectural design issue whether a computer will have a multiply instruction. It is an organizational issue whether that instruction will be implemented by a special multiply unit or by a mechanism that makes repeated use of the add unit of the system. The organization decision may be bases on the anticipated frequency of use of the multiply instruction, the relative speed of the two approaches, and the cost and physical size of a special multiply unit. Historically, and still today, the distinction between architecture and organization has been an important one. Many computer manufacturers offer a family of computer model, all with the same architecture but with differences in organization. Consequently, the different models in the family have different price and performance characteristics. Furthermore, an architecture may survive many years, but its organization changes with changing technology. Basic Structure of a Computer Figure 1 shows the general structure of the IAS computer. It consists of: A main memory, which stores both data and instructions. An arithmetic-logical unit (ALU) capable of operating on binary data. A control unit, which interprets the instructions in memory and causes them to be executed. Input and output (I/O) equipment operated by the control unit.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Fig.1 Basic structure of a computer.

1.2 Review of basic operational concepts


Now we focus on the processing unit, which executes machine instructions and coordinates the activities of other units. This unit is often called the instruction Set Processor (ISP), or simply the processor. We examine its internal structure and how it performs the tasks of fetching, decoding, and executing instructions of a program. The processing unit used to be called the central processing unit (CPU). The term central is less appropriate today because many modern computer systems include several processing units. The organization of processors has evolved over the years, driven by developments in technology and the need to provide high performance. A common strategy in the development of high-performance processors is to make various functional units operate in parallel as much as possible. High-performance processors have a pipelined organization where the execution of one instruction is started before the execution of the preceding instruction is completed. In another approach, known as superscalar operation, several instructions are fetched and executed at the same time. Pipelining and superscalar architectures are discussed later. A typical computing task consists of a series of steps specified by a sequence of machine instructions that constitute a program. An instruction is executed by carrying out a sequence of more rudimentary operations. These operations and the means by which they are controlled are the main topic of this chapter.

1.3 CPU- single bus organization


To execute a program, the processor fetches one instruction at a time and performs the operations specified. Instructions are fetched from successive memory locations until a branch or a jump instruction is encountered. The processor keeps track of the address of the memory location containing the next instruction to be fetched using the program counter, PC. After fetching an instruction, the contents of the PC are updated to point to the next instruction in the sequence. A branch instruction may load a different value into the PC. Another key register in the processor is the instruction register, IR. Suppose that each instruction comprises 4 bytes, and that it is stored in one memory word. To execute an instruction, the processor has to perform the following three steps:

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

I. Fetch the contents of the memory location pointed to by the PC. The contents of this location are interpreted as an instruction to be executed. Hence, they are loaded into the IR. Symbolically, this can be written as IR [[PC]] 2. Assuming that the memory is byte addressable, increment the contents of the PC by 4, that is, PC [PC] +4 3. Carry out the actions specified by the instruction in the IR. In cases where an instruction occupies more than one word, steps I and 2 must be repeated as many times as necessary to fetch the complete instruction. These two steps are usually referred to as the fetch phase; step 3 constitutes the execution phase. To study these operations in detail, we first need to examine the internal organization of the processor. They can be organized and interconnected in a variety of ways. We will start with a very simple organization. Later in this chapter and in Chapter 8 we will present more complex structures that provide high performance. Figure 1.1 shows an organization in which the arithmetic and logic unit (ALU) and all the registers are interconnected via a single common bus. This bus is internal to the processor and should not be confused with the external bus that connects the processor to the memory and 110 devices. The data and address lines of the external memory bus are shown in Figure 1.1 connected to the internal processor bus via the memory data register, MDR, and the memory address register, MAR, respectively. Register MDR has two inputs and two outputs. Data may be loaded into MDR either from the memory bus or from the internal processor bus. The data stored in MDR may be placed on either bus. The input of MAR is connected to the internal bus, and its output is connected to the external bus. The control lines of the memory bus are connected to the instruction decoder and control logic block. This unit is responsible for issuing the signals that control the operation of all the units inside the processor and for interacting with the memory bus. The number and use of the processor registers R0 through R(n - 1) vary considerably from one processor to another. Registers may be provided for general-purpose use by the programmer. Some may be dedicated as special-purpose registers, such as index registers or stack pointers. Three registers, Y, Z, and TEMP in Figure 1.1, have not been mentioned before. These registers are transparent to the programmer, that is, the programmer need not be concerned with them because they are never referenced explicitly by any instruction. They are used by the processor for temporary storage during execution of some instructions. These registers are never used for storing data generated by one instruction for later use by another instruction. The multiplexer MUX selects either the output of register Y or a constant value 4 to be provided as input A of the ALU. The constant 4 is used to increment the contents of the program counter. We will refer to the two possible values of the MUX control input Select as Select4 and SelectY for selecting the constant 4 or register Y, respectively. As instruction execution progresses, data are transferred from one register to another, often passing through the ALU to perform some arithmetic or logic operation. The instruction decoder and control logic unit is responsible for implementing the actions specified by the instruction loaded in the JR register. The decoder generates the control Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

signals needed to select the registers involved and direct the transfer of data. The registers, the ALU, and the interconnecting bus are collectively referred to as the datapath. With few exceptions, an instruction can be executed by performing one or more of the following operations in some specified sequence: Transfer a word of data from one processor register to another or to the ALU Perform an arithmetic or a logic operation and store the result in a processor register Fetch the contents of a given memory location and load them into a processor register Store a word of data from a processor register into a given memory location We now consider in detail how each of these operations is implemented, using the simple processor model in Figure 1.1.

Fig.1.1 Single bus organization of the data path inside a processor


1. 3 .1 REGISTER TRANSFERS

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Instruction execution involves a sequence of steps in which data are transferred from one register to another. For each register, two control signals are used to place the contents of that register on the bus or to load the data on the bus into the register. This is represented symbolically in Figure 1.2. The input and output of register Ri are connected to the bus via switches controlled by the signals Riin and Riout, respectively. When Riin is set to 1, the data on the bus are loaded into Ri. Similarly, when Riout is set to 1, the contents of register Ri are placed on the bus. While Riout is equal to 0, the bus can be used for transferring data from other registers.

Suppose that we wish to transfer the contents of register Rl to register R4. This can be accomplished as follows:
Enable the output of register Ri by setting Riout to 1. This places the contents of R 1 on the processor bus. Enable the input of register R4 by setting R4in to 1. This loads data from the processor bus into register R4.

All operations and data transfers within the processor take place within time periods defined by the processor clock. The control signals that govern a particular transfer are asserted at the start of the clock cycle. In our example, R1out and R4in are set to 1. The registers consist of edge-triggered flip-flops. Hence, at the next active edge of the clock, the flip-flops that constitute R4 will load the data present at their inputs. At the same time, the control signals R1out and R4in will return to 0. We will use this simple model of the timing of data transfers for the rest of this chapter. However, we should point out that other schemes are possible. For example, data transfers may use both the rising and falling edges of the clock. Also, when edge-triggered flip-flops are not used, two or more clock signals may be needed to guarantee proper transfer of data. This is known as multiphase clocking.
An implementation for one bit of register Ri is shown in Figure 1.3 as an example. A two-input multiplexer is used to select the data applied to the input of an edge-triggered D flip-flop. When the control input Ri1 is equal to 1, the multiplexer selects the data on the bus. This data will be loaded into the flip-flop at the rising edge of the clock. When Ri is equal to 0, the multiplexer feeds back the value currently stored in the flip-flop.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Fig.1.2 Input and output gating for the registers in fig 1.1

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Fig. 1.3 Input and output gating for one register bit The Q output of the flip-flop is connected to the bus via a tri-state gate. When Riout is equal to 0, the gates output is in the high-impedance (electrically disconnected) state. This corresponds to the open-circuit state of a switch. When Riout = 1, the gate drives the bus to 0 or I, depending on the value of Q. 1.3.2 PERFORMING AN ARITHMETIC OR LOGIC OPERATION The ALU is a combinational circuit that has no internal storage. It performs arithmetic and logic operations on the two operands applied to its A and B inputs. In Figures 1.1 and 1.2, one of the operands is the output of the multiplexer MUX and the other operand is

obtained directly from the bus. The result produced by the ALU is stored temporarily in register Z. Therefore, a sequence of operations to add the contents of register Ri to those of register R2 and store the result in register R3 is 1. R1out, Yin 2. SelectY, Add, Z 3. Z0, R31 The signals whose names are given in any step are activated for the duration of the clock cycle corresponding to that step. All other signals are inactive. Hence, in step 1, the output of register Rl and the input of register Y are enabled, causing the contents of RI to be transferred over the bus to Y. In step 2, the multiplexers Select signal is set to SelectY causing the multiplexer to gate the contents of register Y to input A of the ALU. At the same time, the contents of register R2 are gated onto the bus and, hence, to input B. The function performed by the ALU depends on the signals applied to its control lines. In this case, the Add line is set to 1, causing the output of the ALU to be the sum of the two numbers at inputs A and B. This sum is loaded into register Z because its input control signal is activated. In step 3, the contents of register Z are transferred to the destination register, R3. This last transfer cannot be carried out during step 2, because only one register output can be connected to the bus during any clock cycle. In this introductory discussion, we assume that there is a dedicated signal for each function to be performed. For example, we assume that there are separate control signals Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

to specify individual ALU operations, such as Add, Subtract, XOR, and so on. In reality, some degree of encoding is likely to be used. For example, if the ALU can perform eight different operations, three control signals would suffice to specify the required operation. 1.3.3 FETCHING A WORD FROM MEMORY To fetch a word of information from memory, the processor has to specify the address of the memory location where this information is stored and request a Read operation. This applies whether the information to be fetched represents an instruction in a program or an operand specified by an instruction. The processor transfers the required address to the MAR, whose output is connected to the address lines of the memory bus. At the same time, the processor uses the control lines of the memory bus to indicate that a Read operation is needed. When the requested data are received from the memory they are stored in register MDR, from where they can be transferred to other registers in the processor. The connections for register MDR are illustrated in Figure 1.4. It has four control signals: MDRin and MDRout control the connection to the internal bus, and MDRIUE and MDRoutE control the connection to the external bus. The circuit in Figure 1.3 is easily modified to provide the additional connections. A three-input multiplexer can be used, with the memory bus data line connected to the third input. This input is selected when MDRinE = 1. A second tri-state gate, controlled by MDROUtE can be used to connect the output of the flip-flop to the memory bus. During memory Read and Write operations, the timing of internal processor operations must be coordinated with the response of the addressed device on the memory bus. The processor completes one internal data transfer in one clock cycle. The speed of operation of the addressed device, on the other hand, varies with the device. We saw in Chapter 5 that modern processors include a cache memory on the same chip as the processor. Typically, a cache will respond to a memory read request in one clock cycle. However, when a cache miss occurs, the request is forwarded to the main memory, which introduces a delay of several clock cycles. A read or write request may also be intended for a register in a memory-mapped I/O device.

Computer Science & Engineering Dept.

SJCET, Palai

R 402 Fig 1.4 Connection and control signals for register MDR

Computer Organization

Such I/O registers are not cached, so their accesses always take a number of clock cycles. To accommodate the variability in response time, the processor waits until it receive an indication that the requested Read operation has been completed. We will assume that a control signal called Memory-Function-Completed (MFC) is used for this purpose. The addressed device sets this signal to 1 to indicate that the contents of that specified location have been read and are available on the data lines of the memory the bus. As an example of a read operation, consider the instruction Move (Rl),R2. The actions needed to execute this instruction are:

These actions may be carried out as separate steps, but some can be combined into a single step. Each action can be completed in one clock cycle, except action 3 which requires one or more clock cycles, depending on the speed of the addressed device. For simplicity, let us assume that the output of MAR is enabled all the time. Thus, the contents of MAR are always available on the address lines of the memory bus. This ad or is the case when the processor is the bus master. When a new address is loaded into MAR, it will appear on the memory bus at the beginning of the next clock cycle, as shown in Figure 1.5. A Read control signal is activated at the same time MAR is loaded. This signal will cause the bus interface circuit to send a read command, MR, on the bus. With this arrangement, we have combined actions I and 2 above into a single control step. Actions 3 and 4 can also be combined by activating control signal MDRinE while waiting for a response from the memory. Thus, the data received from the memory are loaded into MDR at the end of the clock cycle in which the MFC signal is received. In the next clock cycle, MDRout is activated to transfer the data to register R2. This means that the memory read operation requires three steps, which can be described by the signals being activated as follows:

where WMFC is the control signal that causes the processors control circuitry to wait for the arrival of the MFC signal. Figure 1.5 shows that MDRinE is set to 1 for exactly the same period as the read command, MR. Hence, in subsequent discussion, we will not specify the value of MDRinE explicitly, with the understanding that it is always equal to MR.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Fig.1.5 Timing of a memory Read operation.

1.3.4 STORING A WORD IN MEMORY Writing a word into a memory location follows a similar procedure. The desired address is loaded into MAR. Then, the data to be written are loaded into MDR, and a Write command is issued. Hence, executing the instruction Move R2,(Rl) requires the following sequence:

As in the case of the read operation, the Write control signal causes the memory bus interface hardware to issue a Write command on the memory bus. The processor remains in step 3 until the memory operation is completed and an MFC response is received.

1.4 MULTIPLE-BUS ORGANIZATION


We used the simple single-bus structure of Figure 1.1 to illustrate the basic ideas. The he resulting control sequences in Figures 1.6 and 1.7 are quite long because only one data Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

item can be transferred over the bus in a clock cycle. To reduce the number of steps needed, most commercial processors provide multiple internal paths that enable several transfers to take place in parallel.

Fig.1.8 Three-bus organization of the data path. Figure 1.8 depicts a three-bus structure used to connect the registers and the ALU of a processor. All general-purpose registers are combined into a single block called the register file. The register file in Figure 1.8 is said to have three ports. There are two outputs, allowing the contents of two different registers to be accessed simultaneously and have their contents placed on buses A and B. The third port allows the data on bus C to be loaded into a third register during the same clock cycle. Buses A and B are used to transfer the source operands to the A and B inputs of the ALU, where an arithmetic or Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

logic operation may be performed. The result is transferred to the destination over bus C. If needed, the ALU may simply pass one of its two input operands unmodified to bus C. We will call the ALU control signals for such an operation R=A or R=B. The three-bus arrangement obviates the need for registers Y and Z in Figure 1.1. A second feature in Figure 1.8 is the introduction of the Incrementer unit, which is used to increment the PC by 4. Using the Incrementer eliminates the need to add 4 to the PC using the main ALU, as was done in Figures 1.6 and 1.7. The source for the constant 4 at the ALU input multiplexer is still useful. It can be used to increment other addresses, such as the memory addresses in LoadMultiple and StoreMultiple instructions.

Fig. 1.9 Control sequence for the instruction Add R4,R5,R6 for the three-bus organization in Fig1.8 Consider the three-operand instruction Add R4,R5,R6 The control sequence for executing this instruction is given in Figure 1.9. In step 1, the contents of the PC are passed through the ALU, using the R==B control signal, and loaded into the MAR to start a memory read operation. At the same time the PC is incremented by 4. Note that the value loaded into MAR is the original contents of the PC. The incremented value is loaded into the PC at the end of the clock cycle and will not affect the contents of MAR. In step 2, the processor waits for MFC and loads the data received into MDR, then transfers them to IR in step 3. Finally, the execution phase of the instruction requires only one control step to complete, step 4. By providing more paths for data transfer a significant reduction in the number of clock cycles needed to execute an instruction is achieved.

1.5 EXECUTION OF A COMPLETE INSTRUCTION


Let us now put together the sequence of elementary operations required to execute one instruction. Consider the instruction Add (R3),Rl which adds the contents of a memory location pointed to by R3 to register Ri. Executing this instruction requires the following actions: Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

1. Fetch the instruction. 2. Fetch the first operand (the contents of the memory location pointed to by R3). 3. Perform the addition. 4. Load the result into RI. Figure 1.6 gives the sequence of control steps required to perform these operations for the single-bus architecture of Figure 1.1. Instruction execution proceeds as follows. In step 1, the instruction fetch operation is initiated by loading the contents of the PC into the MAR and sending a Read request to the memory. The Select signal is set to Select4, which causes the multiplexer MUX to select the constant 4. This value is added to the operand at input B, which is the contents of the PC, and the result is stored in register Z. The updated value is moved from register Z back into the PC during step 2, while waiting for the memory to respond. In step 3, the word fetched from the memory is loaded into the IR. Steps 1 through 3 constitute the instruction fetch phase, which is the same for all instructions. The instruction decoding circuit interprets the contents of the JR at the beginning of step 4. This enables the control circuitry to activate the control signals for steps 4 through 7, which constitute the execution phase. The contents of register R3 are transferred to the MAR in step 4, and a memory read operation is initiated.

Fig. 1.6. Control signals for the execution of the instruction Add (R3),R1. Then the contents of Ri are transferred to register Y in step 5, to prepare for the addition operation. When the Read operation is completed, the memory operand is available in register MDR, and the addition operation is performed in step 6. The contents of MDR are gated to the bus, and thus also to the B input of the ALU, and register Y is selected as the second input to the ALU by choosing SelectY. The sum is stored in register Z, then transferred to Ri in step 7. The End signal causes a new instruction fetch cycle to begin by returning to step 1. This discussion accounts for all control signals in Figure 1.6 except Y1, in step 2. There is no need to copy the updated contents of PC into register Y when executing the Add instruction. But, in Branch instructions the updated value of the PC is needed to compute the Branch target address. To speed up the execution of Branch instructions, this value is copied into register Y in step 2. Since step 2 is part of the fetch phase, the same action will be performed for all instructions. This does not cause any harm because register Y is not used for any other purpose at that time. Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

Branch Instruction
A branch instruction replaces the contents of the PC with the branch target address. This address is usually obtained by adding an offset X, which is given in the branch instruction, to the updated value of the PC. Figure 1.7 gives a control sequence that implements an unconditional branch instruction. Processing starts, as usual, with the fetch phase. This phase ends when the instruction is loaded into the IR in step 3. The offset value is extracted from the IR by the instruction decoding circuit, which will also perform sign extension if required. Since the value of the updated PC is already available in register Y, the offset X is gated onto the bus in step 4, and an addition operation is performed. The result, which is the branch target address, is loaded into the PC in step 5. The offset X used in a branch instruction is usually the difference between the branch target address and the address immediately following the branch instruction.

Fig.1.7 Control sequence for an unconditional branch instruction. For example, if the branch instruction is at location 2000 and if the branch target address is 2050, the value of X must be 46. The reason for this can be readily appreciated from the control sequence in Figure 1.7. The PC is incremented during the fetch phase, before knowing the type of instruction being executed. Thus, when the branch address is computed in step 4, the PC value used is the updated value, which points to the instruction following the branch instruction in the memory. Consider now a conditional branch. In this case, we need to check the status of the condition codes before loading a new value into the PC. For example, for a Branch-onnegative (Branch<0) instruction, step 4 in Figure 1.7 is replaced with Thus, if N =0 the processor returns to step 1 immediately after step 4. If N = 1, step 5 is performed to load a new value into the PC, thus performing the branch operation.

1.6 Interconnection structures


A computer consists of a set of components or modules (processor, memory, I/O) that communicate with each other. A computer is a network of modules. There must be paths for connecting these modules. The collection of paths connecting the various modules is called the interconnection structure.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Memory o Consists of N words of equal length o Each word assigned a unique numerical address (0, 1, , N-1) o A word of data can be read or written o Operation specified by control signals o Location specified by address signals

I/O Module o Similar to memory from computers viewpoint o Consists of M external device ports (0, 1, , M-1) o External data paths for input and output o Sends interrupt signal to the processor

Processor Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

o Reads in instructions and data o Writes out data after processing o Uses control signals to control overall operation of the system o Receives interrupt signals The preceding list defines the data to be exchanged. The interconnection structure must support the following types of transfers: Memory to processor: processor reads an instruction or a unit of data from memory. Processor to memory: processor writes a unit of data to memory. I/O to processor: processor reads data from an I/O device via an I/O module. Processor to I/O: processor sends data to the I/O device via an I/O module. I/O to or from memory: an I/O module is allowed to exchange data directly with memory, without going through the processor, using direct memory access (DMA). Over the years, a number of interconnection structures have been tried. By far the most common is the bus and various multiple-bus structures.

Bus Interconnection
A bus is a communication pathway connecting two or more devices. Multiple devices can be connected to the same bus at the same time. Typically, a bus consists of multiple communication pathways, or lines. Each line is capable of transmitting signals representing binary 1 or binary 0. A bus that connects major computer components (processor, memory, I/O) is called a system bus.

Bus Structure
Typically, a bus consists of 50 to hundreds of separate lines. On any bus the lines are grouped into three main function groups: data, address, and control. There may also be power distribution lines for attached modules.

Data lines o Path for moving data and instructions between modules. o Collectively are called the data bus. o Consists of: 8, 16, 32, 64, etc bits key factor in overall system performance Address lines o Identifies the source or destination of the data on the data bus. CPU needs to read an instruction or data from a given memory location. o Bus width determines the maximum possible memory capacity for the system. Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

8080 has 16 bit addresses giving access to 64K address Control lines o Used to control the access to and the use of the data and address lines. o Transmits command and timing information between modules. Typical control lines include the following: Memory write: causes data on the bus to be written to the addressed memory location. Memory read: causes data from the addressed memory location to be placed on the bus. I/O write: causes data on the bus to be output to the addressed I/O port. I/O read: causes data from the addressed I/O port to be placed on the bus. Transfer ACK: indicates that data have been from or placed on the bus. Bus request: indicates that a module needs to gain control of the bus. Bus grant: indicates that a requesting module has been granted control of the bus. Interrupt request: indicates that an interrupt is pending. Interrupt ACK: indicates that the pending interrupt has been recognized. Clock: used to synchronize operations. Reset: initializes all modules.

What does a bus look like? Parallel lines on a circuit board. Ribbon cables. Strip connectors of a circuit board. o PCI, AGP, PCI Express, SCSI, etc Sets of wires.

1.7 Layered view of a computer system.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

PIPE LINING Pipelining is used in modern computers to achieve high performance. We begin by explaining the basics of pipelining and how it can lead to improved performance. Then we examine machine instruction features that facilitate pipelined execution, and we show that the choice of instructions and instruction sequencing can have a significant effect on performance. Pipelined organization requires sophisticated compilation techniques, and optimizing compilers have been developed for this purpose. Among other things, such Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

compilers rearrange the sequence of operations to maximize the benefits of pipelined execution.

BASIC CONCEPTS The speed of execution of programs is influenced by many factors. One way to improve performance is to use faster circuit technology to build the processor and the main memory. Another possibility is to arrange the hardware so that more than one operation can be performed at the same time. In this way, the number of operations performed per second is increased even though the elapsed time needed to perform any one operation is not changed. We have encountered concurrent activities several times before. In multiprogramming DMA devices make I/O transfers and simultaneous computational activities possible because they can perform I/O transfers independently once these transfers are initiated by the processor. Pipelining is a particularly effective way of organizing concurrent activity in a computer system. The basic idea is very simple. It is frequently encountered in manufacturing plants, where pipelining is commonly known as an assembly-line operation. Readers are undoubtedly familiar with the assembly line used in car manufacturing. The first station in an assembly line may prepare the chassis of a car, the next station adds the body, the next one installs the engine, and so on. While one group of workers is installing the engine on one car, another group is fitting a car body on the chassis of another car, and yet another group is preparing a new chassis for a third car. It may take days to complete work on a given car, but it is possible to have a new car rolling off the end of the assembly line every few minutes. Consider how the idea of pipelining can be used in a computer. The processor executes a program by fetching and executing instructions, one after the other. Let F and E, refer to the fetch and execute steps for instruction I. Execution of a program consists of a sequence of fetch and execute steps, as shown in Fig1.10. Now consider a computer that has two separate hardware units, one for fetching instructions and another for executing them, as shown in Figure1.19. The instruction fetched by the fetch unit is deposited in an intermediate storage buffer, B 1. This buffer is needed to enable the execution unit to execute the instruction while the fetch unit is fetching the next instruction. The results of execution are deposited in the destination location specified by the instruction. For the purposes of this discussion, we assume that both the source and the destination of the data operated on by the instructions are inside the block labeled Execution unit. The computer is controlled by a clock whose period is such that the fetch and execute steps of any instruction can each be completed in one clock cycle. Operation of the computer proceeds as in Figure 1.10c. In the first clock cycle, the fetch unit fetches an instruction I1 (step F1) and stores it in buffer Bi at the end of the clock cycle. In the second clock cycle, the instruction fetch unit proceeds with the fetch operation for instruction 12 (step F2). Meanwhile, the execution unit performs the operation specified by instruction I1, which is available to it in buffer Bi (step E1). By the end of the second Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

clock cycle, the execution of instruction 1 is completed and instruction 12 is available. Instruction 2 is stored in B 1, replacing I, which is no longer needed. Step E2 is performed by the execution unit during the third clock cycle, while instruction 13 is being fetched by the fetch unit. In this manner, both the fetch and execute units are kept busy all the time. If the pattern in Figure 1.10c can be sustained for a long time, the completion rate of instruction execution will be twice that achievable by the sequential operation depicted in Figure 1.10a. In summary, the fetch and execute units in Figure 1.10b constitute a two-stage pipeline in which each stage performs one step in processing an instruction. An inter- stage storage buffer, B 1, is needed to hold the information being passed from one stage to the next. New information is loaded into this buffer at the end of each clock cycle.

Fig. 1.10 Basic idea of instruction pipelining The processing of an instruction need not be divided into only two steps. For example, a pipelined processor may process each instruction in four steps, as follows: Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

F Fetch: read the instruction from the memory. D Decode: decode the instruction and fetch the source operand(s). E Execute: perform the operation specified by the instruction. W Write: store the result in the destination location.

Fig 1.11 A 4-stage pipelining. The sequence of events for this case is shown in Figure 1.11a. Four instructions are in progress at any given time. This means that four distinct hardware units are needed, as shown in Figure 1.11b. These units must be capable of performing their tasks simultaneously and without interfering with one another. Information is passed from one unit to the next through a storage buffer. As an instruction progresses through the pipeline, all the information needed by the stages downstream must be passed along. For example, during clock cycle 4, the information in the buffers is as follows: Buffer B 1 holds instruction I3, which was fetched in cycle 3 and is being decoded by the instruction-decoding unit. Buffer B2 holds both the source operands for instruction I2 and the specification of the operation to be performed. This is the information produced by the decoding hardware in cycle3. The buffer also holds the information needed for the write step of instruction I2 (step W2). Even though it is not needed by stage E, this information must be passed on to Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

stage W in the following clock cycle to enable that stage to perform the required Write operation. Buffer B3 holds the results produced by the execution unit and the destination information for instruction Ii. PIPELINE PERFORMANCE The pipelined processor in Figure 1.11 completes the processing of one instruction in each clock cycle, which means that the rate of instruction processing is four times that of sequential operation. The potential increase in performance resulting from pipelining is proportional to the number of pipeline stages. However, this increase would be achieved only if pipelined operation as depicted in Figure 1.11a could be sustained without interruption throughout program execution. Unfortunately, this is not the case. For a variety of reasons, one of the pipeline stages may not be able to complete its processing task for a given instruction in the time allotted. For example, stage E in the four-stage pipeline of Figure 1.11b is responsible for arithmetic and logic operations, and one clock cycle is assigned for this task. Although this may be sufficient for most operations, some operations, such as divide, may require more time to complete. Figure 1.12 shows an example in which the operation specified in instruction I2 requires three cycles to complete, from cycle 4 through cycle 6. Thus, in cycles 5 and 6, the Write stage must be told to do nothing, because it has no data to work with. Meanwhile, the information in buffer B2 must remain intact until the Execute stage has completed its operation. This means that stage 2 and, in turn, stage I are blocked from accepting new instructions because the information in B 1 cannot be overwritten. Thus, steps D4 and F5 must be postponed as shown. Pipelined operation in Figure 1.12 is said to have been stalled for two clock cycles. Normal pipelined operation resumes in cycle 7. Any condition that causes the pipeline to stall is called a hazard. We have just seen an example of a data hazard. A data hazard is any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline. As a result some operation has to be delayed, and the pipeline stalls.

Fig. 1.12 Effect of an execution operation taking more than one clock cycle.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

CPU-ARITHMETIC Binary number system


The number system followed by computers Base is two and any number is represented as an array containing 1s and 0s representing coefficients of power of two. Used in computer systems because of the ease of representing 1 and 0 as two levels of voltage/power high and low To represent decimal system, 10 levels of voltage would be required! Correspondingly complex hardware too

Binary arithmetic
Addition Subtraction Multiplication Division Binary addition Four basic rules for elementary addition 0 + 0 = 0 ; 0 + 1 = 1; 1 + 0 = 1; 1 + 1 = 10; Carry-overs are performed in the same manner as in decimal addition 11001 + 1001 =? How to add multiple (more than two) binary numbers? Binary subtraction Four rules for elementary subtraction 0 0 = 0; 1 0 = 1; 1 1 = 0; 0 1 = 1, but with a borrow of 1 from the next column of minuend 1101 1100 = 1 1100 - 1001 =?

Signed binary numbers


Like in decimal, we need to represent negative numbers in binary system too Decimal system uses - sign, but computers understand only 1s and 0s. A solution is to add a digit (sign bit) to represent the sign approach is called Signed Magnitude Representation 0 marks positive and 1 marks negative Problem! Have to specify the number of bits in the number to avoid misinterpretation Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

Complements With signed magnitude, representing number is simple but arithmetic is complex Nice to have a representation in which addition simple and other operations can be done using addition Multiplication is repeated addition; division is repeated subtraction Using complements we can perform subtraction using addition 1s complement By complementing (changing 1 to 0 and 0 to 1) each bit in the number. Most significant bit tells us the sign of the number 9 => 01001 -9 => 10110 2 2

Subtraction using 1s complement To subtract, add 1s complement of the number. If there is an overflow bit, add the overflow bit to the remaining part 0111 0101 => 0111 + 1010 0111 + 1010 ------10001 2s complement Add 1 to the 1s complement form 0001+ 1 --------0010

Addition 1. Represent both operands in signed-2's complement format (If operand X>0, keep its original binary form. If operand X<0, take 2's complement of X : 2n _ X ) 2. Add operands, discard carry-out of the sign bit MSB (if any). 3. The result is automatically in signed-2's complement form. Example: (n=6 bits): 6 000110 -6 111010 9 001001 -9 110111 SJCET, Palai

Computer Science & Engineering Dept.

R 402

Computer Organization

ADDITION/SUBTRACTION

-The 1 in the 7th bit is automatically dropped.

The MSB of the result is 1, indicating it is a negative result represented in signed 2's complement form. Its value can be found by 2's complement to be

The 1 in the 7th bit is automatically dropped. The MSB of the result is 1, indicating it is a negative result represented in signed 2's complement form. Its value can be found by 2's complement to be

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Subtraction

1. Represent both operands in signed-2's complement format. 2. Take 2's complement of the subtrahend B (which may be in complement form already if it is negative). 3. Add it to the minuend A. 4. The result is automatically in signed-2's complement form. Example: Given

find

Represent both operands in signed 2's complement:

, Complement the subtrahand so that it becomes Add it to minuend: .

, which is in signed 2's complement Why does it work? Consider the following three cases (where A>0, B>0):

This is a normal binary addition with a positive sum.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

(or

) and

The negative value B is represented by 2's complement

If

A- B >0, the result is in binary form with

automatically dropped.

If A - B< 0, the result is 2's complement representation of a negative value A- B.

Both negative values -A and -B are represented in 2's complement form as

and

and

The first is automatically dropped and the second term is the 2's complement representation of a negative value (A+B). We see that signed 2's complement representation can properly deal with both addition and subtraction with negative operands as well as positive ones. Example: (n=4 bits)

The MSB of the result is 1, indicating it is a negative result represented in signed 2's complement form. Its value can be found by 2's complent to be , a wrong result! Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

The MSB of the result is 1, indicating it is a negative result represented in signed 2's complement form. Its value can be found by 2's complement to be

The 1 in the 5th bit is dropped.

The 1 in the 5th bit is dropped. The result is 5, another wrong result! The wrong results are caused by overflow problem. Given n=4 bits, the range of valid values representable is to .

The overflow problem can be detected by checking whether the carry-in Cin to and carryout Cout from the MSB are the same. Consider the sign bit of the following six cases of

addition:

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

It is obvious that when Cin Cout, the result is incorrect due to overflow.

Hardware Implementation: An n-bit adder can be built by concatenating n full


adders:

This n-bit adder can also carry out subtraction. A-B as well as addition A+B.

A control signal addition when

is used to control a 2x1 MUX to select either , or for subtraction

for when

. The subtraction is carried out by adding 2's complement of operand to . (Recall that 2's complement can be obtained by bit-wise complement and adding 1 to the LSB.)

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

ADDITION AND SUBTRACTION OF SIGNED NUMBERS

Figure shows the logic truth table for the sum and carry-out functions for adding equally weighted bits x, and y in two numbers X and Y. The figure also shows logic expressions for these functions, along with an example of addition of the 4-bit unsigned numbers 7 and 6. Note that each stage of the addition process must accommodate a carry-in bit.

Logic specification for a stage of binary addition We use c, to represent the carry-in to the ith stage, which is the same as the carry-out from the (i 1)st stage. The logic expression for s in above figure can be implemented with a 3-input XOR gate, used in following figure a as part of the logic required for a single stage of binary addition. Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

Logic for addition of binary vectors The carry-out function, c1 i, is implemented with a two-level AND-OR logic circuit. A convenient symbol for the complete circuit for a single stage of addition, called a full adder (FA), is also shown in the figure. A cascaded connection of n full adder blocks, as shown in Figure b, can be used to add two n-bit numbers. Since the carries must propagate, or ripple, through this cascade, the configuration is called an n-bit ripple-carry adder Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

The carry-in, Co, into the least-significant-bit (LSB) position provides a convenient means of adding 1 to a number. For instance, forming the 2s-complement of a number involves adding 1 to the 1s-complement of the number. The carry signals are also useful for interconnecting k adders to form an adder capable of handling input numbers that are kn bits long, as shown in Figure c. The n-bit adder in Figure b can be used to add 2s-complement numbers X and Y, where the Xn-1 and Yn-1, i bits are the sign bits. In this case, the carry-out bit, cn is not part of the answer. Overflow can only occur when the signs of the two operands are the same. In this case, overflow obviously occurs if the sign of the result is different. Therefore, a circuit to detect overflow can be added to the n-bit adder by implementing the logic expression

It can also be shown that overflow occurs when the carry bits c, and Cn_l are different. Therefore, a simpler alternative circuit for detecting overflow can be obtained by implementing the expression cn XOR cn-1 with an XOR gate. In order to perform the subtraction operation X Y on 2s-complement numbers X and Y, we form the 2s-complement of Y and add it to X. The logic circuit network shown in the following figure can be used to perform either addition or subtraction based on the value applied to the Add/Sub input control line. This line is set to 0 for addition, applying the Y vector unchanged to one of the adder inputs along with a carry-in signal, c0, of 0. When the Add/Sub control line is set to 1, the Y vector is 1s-complemented (that is, bit complemented) by the XOR gates and c0 is set to 1 to complete the 2s-complementation of Y. Remember that 2s-complementing a negative number is done in exactly the same manner as for a positive number. An XOR gate can be added to the following figure to detect the overflow condition cn XOR cn-1.

Computer Science & Engineering Dept.

SJCET, Palai

R 402 Binary addition subtraction logic network Binary Coded Decimal (BCD)

Computer Organization

Introduction: Although binary data is the most efficient storage scheme; every bit pattern represents a unique, valid value. However, some applications may not be desirable to work with binary data. For instance, the internal components of digital clocks keep track of the time in binary. The binary value must be converted to decimal before it can be displayed. Because a digital clock is preferable to store the value as a series of decimal digits, where each digit is separately represented as its binary equivalent, the most common format used to represent decimal data is called binary coded decimal, or BCD BCD Numeric Format -Every four bits represent one decimal digit.

Use decimal values from 0 to 9 4-bit values above 9 are not used in BCD. The unused 4-bit values are: BCD Decimal 1010 1011 1100 1101 1110 1111 10 11 12 13 14 15

Multi-digit decimal numbers are stored as multiple groups of 4 bits per digit. Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

BCD is a signed notation positive or negative. For example, +27 as 0(sign) 0010 0111. -27 as 1(sign) 0010 0111. BCD does not store negative numbers in twos complement.

Values represented
b3b2b1b0 Sign & magnitude 1s 2s complement complemt

0111 0110 0101 0100 0011 0010 0001 0000 1000 1001 1010 1011 1100 1101 1110 1111

+7 +6 +5 +4 +3 +2 +1 +0 -0 -1 -2 -3 -4 -5 -6 -7

+7 +6 +5 +4 +3 +2 +1 +0 -7 -6 -5 -4 -3 -2 -1 -0

+7 +6 +5 +4 +3 +2 +1 +0 -8 -7 -6 -5 -4 -3 -2 -1

Algorithms for Addition 0101 + 1001 5 +9 SJCET, Palai

Computer Science & Engineering Dept.

R 402

Computer Organization

1110 + 0110 1 0100

Incorrect BCD digit Add 6 Correct answer

BCD adder

-If the result, S3 S2 S1 S0, is not a valid BCD digit, the multiplexer causes 6 to be added to the result.

Carry-Look-ahead Adder
There are several factors that contribute to the delay in the digital adders. One is the propagation delay due to the internal structure of the gates, another factor is the loading of the output buffers (due to fanout and net delays), and a third factor is the logic circuit itself. The propagation delay (or gate delay) of a gate is the time difference between the change of the input and output signals. Ripple-carry vs. Carry-look-ahead Adders

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

One type of circuit where the effect of gate delays is particularly clear, is an ADDER. In a 4-bit adder ripple-carry adder the result of an addition of two bits depends on the carry generated by the addition of the previous two bits. Thus, the Sum of the most significant bit is only available after the carry signal has rippled through the adder from the least significant stage to the most significant stage. This can be easily understood if one considers the addition of the two 4-bit words: 1 1 1 12 + 0 0 0 12, as shown in Figure 3.

Figure 3: Addition of two 4-bit numbers illustrating the generation of the carry-out bit In this case, the addition of (1+1 = 102) in the least significant stage causes a carry bit to be generated. This carry bit will consequently generate another carry bit in the next stage, and so on, until the final carry-out bit appears at the output. This requires the signal to travel (ripple) through all the stages of the adder as illustrated in Figure 4 below. As a result, the final Sum and Carry bits will be valid after a considerable delay. The carry-out bit of the first stage will be valid after 4 gate delays (2 associated with the XOR gate and 1 each associated with the AND and OR gates). From the schematic of Figure 4, one finds that the next carry-out (C2) will be valid after an additional 2 gate delays (associated with the AND and OR gates) for a total of 6 gate delays. In general the carry-out of a N-bit adder will be valid after 2N+2 gate delays. The Sum bit will be valid an additional 2 gate delays after the carry-in signal. Thus the sum of the most significant bit SN-1 will be valid after 2(N-1) + 2 +2 = 2N +2 gate delays. This delay may be in addition to any delays associated with interconnections. It should be mentioned that in case one implements the circuit in a FPGA, the delays may be different from the above expression depending on how the logic has been placed in the look up tables and how it has been divided among different CLBs.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Figure 4: Ripple-carry adder, illustrating the delay of the carry bit. Features of Ripple-carry adder: - Multiple full adders with carry ins and carry outs chained together - Small Layout area - Large delay time The disadvantage of the ripple-carry adder is that it can get very slow when one needs to add many bits. For instance, for a 32-bit adder, the delay would be about 66 ns if one assumes a gate delay of 1 ns. That would imply that the maximum frequency one can operate this adder would be only 15 MHz! For fast applications, a better design is required. The carry-look-ahead adder solves this problem by calculating the carry signals in advance, based on the input signals. It is based on the fact that a carry signal will be generated in two cases: (1) when both bits Ai and Bi are 1, or (2) when one of the two bits is 1 and the carry-in (carry of the previous stage) is 1. Thus, one can write, COUT = Ci+1 = Ai.Bi + (Ai Bi).Ci. (1) The " " stands for exclusive OR or XOR. One can write this expression also, as Ci+1 = Gi + Pi.Ci Gi = Ai.Bi Pi = (Ai Bi) (2) (3) (4)

in which

are called the Generate and Propagate term, respectively.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Lets assume that the delay through an AND gate is one gate delay and through an XOR gate is two gate delays. Notice that the Propagate and Generate terms only depend on the input bits and thus will be valid after two and one gate delay, respectively. If one uses the above expression to calculate the carry signals, one does not need to wait for the carry to ripple through all the previous stages to find its proper value. Lets apply this to a 4-bit adder to make it clear. C1 = G0 + P0.C0 C2 = G1 + P1.C1 = G1 + P1.G0 + P1.P0.C0 C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.C0 C4 = G3 + P3.G2 + P3.P2.G1 + P3P2.P1.G0 + P3P2.P1.P0.C0 (5) (6) (7) (8)

Notice that the carry-out bit, Ci+1, of the last stage will be available after four delays (two gate delays to calculate the Propagate signal and two delays as a result of the AND and OR gate). The Sum signal can be calculated as follows, Si = Ai Bi Ci = Pi Ci. (9)

The Sum bit will thus be available after two additional gate delays (due to the XOR gate) or a total of six gate delays after the input signals Ai and Bi have been applied. The advantage is that these delays will be the same independent of the number of bits one needs to add, in contrast to the ripple counter. The carry-lookahead adder can be broken up in two modules: (1) the Partial Full Adder, PFA, which generates Si, Pi and Gi as defined by equations 3, 4 and 9 above; and (2) the Carry Look-ahead Logic, which generates the carry-out bits according to equations 5 to 8. The 4-bit adder can then be built by using 4 PFAs and the Carry Look-ahead logic block as shown in Figure 5.

Figure 5: Block diagram of a 4-bit CLA. Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

The disadvantage of the carry-lookahead adder is that the carry logic is getting quite complicated for more than 4 bits. For that reason, carry-look-ahead adders are usually implemented as 4-bit modules and are used in a hierarchical structure to realize adders that have multiples of 4 bits. Figure 6 shows the block diagram for a 16-bit CLA adder. The circuit makes use of the same CLA Logic block as the one used in the 4-bit adder. Notice that each 4-bit adder provides a group Propagate and Generate Signal, which is used by the CLA Logic block. The group Propagate PG of a 4-bit adder will have the following expressions, PG = P3.P2.P1.P0 ; (10) (11)

GG = G3 + P3G2 + P3.P2.G1. + P3.P2.P1.G0

The group Propagate PG and Generate GG will be available after 3 and 4 gate delays, respectively (one or two additional delays than the Pi and Gi signals, respectively).

Figure 6: Block diagram of a 16-bit CLA Adder

MULTIPLICATION
Algorithms for Multiplication
1101 Multiplicand M X 1011 Multiplier Q 1101 1101 0000 1101____ 10001111 Product P

Array multiplier Combination circuit Product generated in one micro operation Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

Requires large number of gates Became feasible after integrated circuits developed Needed for j multiplier and k multiplicand bits o j x k AND gates o j 1 k-bit adders to produce product of j + k bits

Multiply Signed-2s Complement Booth algorithm: This algorithm serves two purposes: Fast multiplication when there are consecutive 0s or 1s in the multiplier. Computer Science & Engineering Dept. SJCET, Palai

R 402 Can be used for signed multiplication. QR multiplier Qn least significant bit of QR Qn+1 previous least significant bit of QR BR multiplicand AC= 0 SC number of bits in multiplier

Computer Organization

Algorithm: 1. Do SC + 1 times 2. If QnQn+1 = 10 AC AC + BR + 1 3. If QnQn+1 = 01 AC AC + BR 4. Arithmetic shift right AC & QR 5. SCSC 1 Explanation: 1. Depending on the current and previous bits, do one of the following: 00: a. Middle of a string of 0s, so no arithmetic operations. 01: b. End of a string of 1s, so add the multiplicand to the left half of the product. 10: c. Beginning of a string of 1s, so subtract the multiplicand from the left half of the product. 11: d. Middle of a string of 1s, so no arithmetic operation. 2. As in the previous algorithm, shift the Product register right (arith) 1 bit.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Hardware Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

Example: -9 x -13 = 117

DIVISION
Algorithms for Division
Division can be implemented using either a restoring or a non-restoring algorithm. An inner loop to perform multiple subtractions must be incorporated into the algorithm. 10 11 ) 1000 11_ 10

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

A logic circuit arrangement implements the restoring-division technique The restoring-division algorithm: S1: DO n times Shift A and Q left one binary position. Subtract M from A, placing the answer back in A. S2: If the sign of A is 1, set q0 to 0 and add M back to A (restore A); otherwise, set q0 to 1.

A restoring-division example

Computer Science & Engineering Dept.

SJCET, Palai

R 402 Initially Shift Subtract Set q0 Restore Shift Subtract Set q0 Restore Shift Subtract Set q0 Shift Subtract Set q0 Restore 00000 00011 00001 111 01 11110 11 00001 00010 11101 11111 11 00010 00100 11101 00010 00010 11101 11111 11 00010 1000 000
First cycle

Computer Organization

0000 000
Second cycle

0000 000
Third cycle

0001 001

Fourth cycle

0010

Quotient Remainder

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

The non-restoring division algorithm: S1: Do n times If the sign of A is 0, shift A and Q left one binary position and subtract M from A; otherwise, shift A and Q left and add M to A. S2: If the sign of A is 1, add M to A.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Assume the dividend and the divisor is 124 and 7 respectively. The Non restoring division scheme would proceed as follows: (124 decimal= 01111100 binary).The M register contains the divisor 7 ( M= 00000111).

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

A few comparisons Restoring division is most efficient for - floating-point division - for integer division when the divisor is not small easy to implement. Non Restoring Division The main advantage is the compatibility with 2's complement notation for dividend and divisor.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Note that a single precision floating point number is normalized only if it can be expressed in binary in the form: 1.M x 2E where M is the 23-bit mantissa and the exponent E is such that -126 E 127. A de-normalized number requires an exponent less than -126, in which case the number would be represented using the special pattern 0 for the characteristic to denote an exponent of -126 and the significand is expressed as a pure fraction 0.M. Thus the value is 0.Mx2-126

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

The difference in exponents determines which of the significands is shifted

The result is then normalized and the exponent is adjusted if necessary.

The rounding hardware then creates the final result.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

ALU Design

A One Bit ALU This 1-bit ALU will perform AND, OR, and ADD

A One-bit Full Adder

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

This is also called a (3, 2) adder Half Adder: No CarryIn nor CarryOut Truth Table:

Logic Equation for CarryOut

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

CarryOut = (!A & B & CarryIn) | (A & !B & CarryIn) | (A & B & !CarryIn) | (A & B & CarryIn) CarryOut = B & CarryIn | A & CarryIn | A & B Logic Equation for Sum

Sum = (!A & !B & CarryIn) | (!A & B & !CarryIn) | (A & !B & !CarryIn) | (A & B & CarryIn) Sum = (!A & !B & CarryIn) | (!A & B & !CarryIn) | (A & !B & !CarryIn) | (A & B & CarryIn) Sum = A XOR B XOR CarryIn Truth Table for XOR:

Logic Diagrams for CarryOut and Sum CarryOut = B & CarryIn | A & CarryIn | A & B

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Sum = A XOR B XOR CarryIn

A 4-bit ALU

Computer Science & Engineering Dept.

SJCET, Palai

R 402 CONTROL UNIT ORGANIZATION

Computer Organization

Instruction Execution The CPU executes a sequence of instructions. The execution of an instruction is organized as an instruction cycle: it is performed as a succession of several steps;

Each step is executed as a set of several microoperations. The task performed by any microoperation falls in one of the following categories: - Transfer data from one register to another; - Transfer data from a register to an external interface (system bus); - Transfer data from an external interface to a register; - Perform an arithmetic or logic operation, using registers for input and output.

Computer Science & Engineering Dept.

SJCET, Palai

R 402 Microoperations and Control Signals

Computer Organization

In order to allow the execution of a microoperation, one or several control signals have to be issued; they allow the corresponding data transfer and/or computation to be performed. Examples: a) signals for transferring content of register R0 to R1: R0out, R1in b) signals for adding content of Y to that of R0 (result in Z): R0out, Add, Zin c) signals for reading a memory location; address in R3: R3out, MARin, Read The CPU executes an instruction as a sequence of control steps. In each control step one or several microoperations are executed. One clock pulse triggers the activities corresponding to one control step for each clock pulse the control unit generates the control signals corresponding to the microoperations to be executed in the respective control step. Microoperations and Control Signals (contd) Instruction: ADD R1, R3 R1 R1 + R3 control steps and control signals:

instruction: ADD R1, (R3) R1 R1 + [R3]

control steps and control signals:

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

instruction: BR target unconditional branch (with relative addressing) control steps and control signals:

The rst (three) control steps are identical for each instruction; they perform instruction fetch and increment the PC. The following steps depend on the actual instruction (stored in the IR). If a control step issues a read, the value will be available in the MBR after one additional step. Several microoperations can be performed in the same control step if they dont conict (for example, only one of them is allowed to output on the bus)

Control Unit The basic task of the control unit: - For each instruction the control unit causes the CPU to go through a sequence of control steps; - in each control step the control unit issues a set of signals which cause the corresponding microoperations to be executed. The control unit is driven by the processor clock. The signals to be generated at a certain moment depend on: - the actual step to be executed; - the condition and status ags of the processor; Computer Science & Engineering Dept. SJCET, Palai

R 402 - the actual instruction executed; - external signals received on the system bus (e.g. interrupt signal) Control Unit

Computer Organization

Techniques for implementation of the control unit: 1. Hardwired control 2. Microprogrammed control HARDWIRED CONTROL To execute instructions, the processor must have some means of generating the control signals needed in the proper sequence. Computer designers use a wide variety of techniques to solve this problem. The approaches used fall into one of two categories: hardwired control and microprogrammed control. We discuss each of these techniques in detail, starting with hardwired control in this section. Consider the sequence of control signals given in Figure 1.6. Each step in this sequence is completed in one clock period. A counter may be used to keep track of the control steps, as shown in Figure 1.10. Each state, or count, of this counter corresponds to one control step. The required control signals are determined by the following information: Contents of the control step counter Contents of the instruction register Contents of the condition code flags External input signals, such as MFC and interrupt requests

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Fig. 1.10 Control unit Organization To gain insight into the structure of the control unit, we start with a simplified view of the hardware involved. The decoder/encoder block in Figure 1.10 is a combinational circuit that generates the required control outputs, depending on the state of all its inputs. By separating the decoding and encoding functions, we obtain the more detailed block diagram in Figure 1.11. The step decoder provides a separate signal line for each step, or time slot, in the control sequence. Similarly, the output of the instruction decoder consists of a separate line for each machine instruction. For any instruction loaded in the IR, one of the output lines INS1 through INSm is set to 1, and all other lines are set to 0. The input signals to the encoder block in Figure 1.11 are combined to generate the individual control signals Ym, PC0, Add, End, and so on. An example of how the encoder generates the Z1 control signal for the processor organization in Figure 1.1 is given in Figure 1.12. This circuit implements the logic function

This signal is asserted during time slot T1 for all instructions, during T6 for an Add instruction, during T4 for an unconditional branch instruction, and so on. The logic function for Z is derived from the control sequences in Figures 1.6 and 1.7. As another example, Figure 1.13 gives a circuit that generates the End control signal from the logic function

The End signal starts a new instruction fetch cycle by resetting the control step counter to its starting value. Figure 1.11 contains another control signal called RUN. Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

Fig.1.11 Separation of the decoding and encoding functions

Fig.1.12 Generation of the Zin control signal for the processor in Fig 1.1

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Fig. 1.13 Generation of the End control signal. When set to 1, RUN causes the counter to be incremented by one at the end of every clock cycle. When RUN is equal to 0, the counter stops counting. This is needed whenever the WMFC signal is issued, to cause the processor to wait for the reply from the memory. The control hardware shown in Figure 1.10 or 1.11 can be viewed as a state machine that changes from one state to another in every clock cycle, depending on the contents of the instruction register, the condition codes, and the external inputs. The outputs of the state machine are the control signals. The sequence of operations carried out by this machine is determined by the wiring of the logic elements, hence the name hardwired. A controller that uses this approach can operate at high speed. However, it has little flexibility, and the complexity of the instruction set it can implement is limited. In the case of hardwired control, the control unit is a combinatorial circuit; it gets a set of inputs (from IR, ags, clock system bus) and transforms them into a set of control signals.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Generation of signal Zin: - rst step of all instructions (fetch instruction) - step 5 of ADD with register addressing - step 5 of BR - step 6 of ADD with register-indirect addressing Zin =T1 +T5 (ADDreg + BR) + T6 ADDreg_ind +... Generation of signal End: - step 6 of ADD with register addressing - step 7 of ADD with register-indirect addressing - step 6 of BR End = T6 (ADDreg + BR) + T7 ADDreg_ind + . . . Advantages: Hardwired control provides highest speed. RISCs are implemented with hardwired control. If the instruction set becomes very complex (CISCs) implementing hardwired control is very difficult. In this case microprogrammed control units are used.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

In order to allow execution of register-to-register operations in a single clock cycle, RISCs (and other modern processors) use three-bus CPU structures (see following slide).

MICROPROGRAMMED CONTROL Microprogram - Program stored in memory that generates all the control signals required to execute the instruction set correctly - Consists of microinstructions Microinstruction Contains a control word and a sequencing word

Control Word - All the control information required for one clock cycle. o a sequence of Nsig bits, where Nsig is the total number of control signals; each bit in a CW corresponds to one control signal.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization o Each control step during execution of an instruction denes a certain CW; it represents a combination of 1s and 0s corresponding to the active and non-active control signals

Sequencing Word - Information needed to decide the next microinstruction address Vocabulary to write a microprogram

Microprogrammed control - basic idea: All microroutines corresponding to the machine instructions are stored in the control store. The control unit generates the sequence of control signals for a certain machine instruction by reading from the control store the CWs of the microroutine corresponding to the respective instruction. The control unit is implemented just like another very simple CPU, inside the CPU, executing microroutines stored in the control store. Control Memory (Control Storage: CS) - Storage in the microprogrammed control unit to store the microprogram Writeable Control Memory (Writeable Control Storage: WCS) - CS whose contents can be modified -> Allows the microprogram can be changed -> Instruction set can be changed or modified Dynamic Microprogramming - Computer system whose control unit is implemented with a micro program in WCS - Microprogram can be changed by a systems programmer or a user MICROPROGRAMMED CONTROL In hardwired control, we saw how the control signals required inside the processor can be generated using a control step counter and a decoder/encoder circuit. Now we discuss an alternative scheme, called microprogrammed control, in which control signals are generated by a program similar to machine language programs.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Fig.1.15 An example of micro instructions for the Fig. 1.6 First, we introduce some common terms. A control word (CW) is a word whose individual bits represent the various control signals in Figure 1.11. Each of the control steps in the control sequence of an instruction defines a unique combination of is and Os in the CW. The CWs corresponding to the 7 steps of Figure 7.6 are shown in Figure 1.15. We have assumed that SelectY is represented by Select = 0 and Select4 by Select 1. A sequence of CWs corresponding to the control sequence of a machine instruction constitutes the micro routine for that instruction, and the individual control words in this microroutine are referred to as microinstructions. The microroutines for all instructions in the instruction set of a computer are stored in a special memory called the control store. The control unit can generate the control signals for any instruction by sequentially reading the CWs of the corresponding microroutine from the control store. This suggests organizing the control unit as shown in Figure 1.16. To read the control words sequentially from the control store, a microprogram counter (PC) is used. Every time a new instruction is loaded into the IR, the output of the block labeled starting address generator is loaded into the PC. The 1sPC is then automatically incremented by the clock, causing successive microinstructions to be read from the control store. Hence, the control signals are delivered to various parts of the processor in the correct sequence. One important function of the control unit cannot be implemented by the simple organization in Figure 1.16. This is the situation that arises when the control unit is required to check the status of the condition codes or external inputs to choose between alternative courses of action. In the case of hardwired control, this situation is handled by including an appropriate logic function, as in Equation 1.2, in the encoder circuitry. In microprogrammed control, an alternative approach is to use conditional branch microinstructions. In addition to the branch address, these microinstructions specify which of the external inputs, condition codes, or, possibly, bits of the instruction register, should be checked as a condition for branching to take place.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

The instruction Branch<O may now be implemented by a microroutine such as that shown in Figure 1.17. After loading this instruction into IR, a branch microinstruction transfers control to the corresponding microroutine, which is assumed to start at location 25 in the control store. This address is the output of the starting address generator block in Figure 1.16. The microinstruction at location 25 tests the N bit of the condition codes. If this bit is equal to 0, a branch takes place to location 0 to fetch a new machine instruction. Otherwise, the microinstruction at location 26 is executed to put the branch target address into register Z, as in step 4 in Figure 1.7. The microinstruction in location 27 loads this address into the PC.

Fig.1.16 Basic organization of a microprogrammed control unit

Fig.1.17 Micro routine for the instruction BRANCH<0

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Fig. 1.18 Organization of the control unit to allow conditional branching in the microprogram. To support microprogram branching, the organization of the control unit should be modified as shown in Figure 1.18. The starting address generator block of Figure 1.16 becomes the starting and branch address generator. This block loads a new address into the PC when a microinstruction instructs it to do so. To allow implementation of a conditional branch, inputs to this block consist of the external inputs and condition codes as well as the contents of the instruction register. In this control unit, the PC is incremented every time a new microinstruction is fetched from the microprogram memory, except in the following situations: 1. When a new instruction is loaded into the IR, the PC is loaded with the starting address of the microroutine for that instruction. 2. When a Branch microinstruction is encountered and the branch condition is satisfied, the PC is loaded with the branch address. 3. When an End microinstruction is encountered, the PC is loaded with the address of the first CW in the microroutine for the instruction fetch cycle (this address is 0 in Figure 1.17).

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Control Store Organization

The control store contains the microprogram (sometimes called rmware). Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

Microroutine Executed for Conditional Branch

The microroutines contain, beside CWs, also branches which have to be interpreted by the microprogrammed controller. The sequencer is controlling the right execution sequence of microinstructions. The sequencer is a small control unit of the control unit. The greater ease and speed of designing a microprogrammed control unit versus the design of a control unit based on a random logic implementation of the finite state machine and next state function resulted in a significant reduction in design costs. It was also much easier to correct errors in the microprogrammed system than in the hardwired system. With the lower cost and higher availability of fast RAM, some systems stored the microcode in RAM producing what is sometimes called a writable control store (WCS) machine. This allowed corrections of changes to the microcode even after the machine had been delivered. This is also made possible the loading of completely different instruction sets on the same machine for different applications. The main advantage of the hardwired system is their greater speed. This greater speed coupled with the much higher cost of the hardwired systems tended to restrict their use to high performance computers. With the trend toward simpler instructions and control, and the advent of computer aided design (CAD) tools, the design of hardwired control unit has become much easier and less prone to errors. RISC machines, with their goal of executing one or more instructions per cycle, are becoming much more prevalent. These developments are tending to lead away from the use of microprogrammed control.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

PLA CONTROL

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

SEQUENCER (MICROPROGRAM SEQUENCER) A Microprogram Control Unit that determines the Microinstruction Address to be executed in the next clock cycle - In-line Sequencing - Branch - Conditional Branch - Subroutine - Loop - Instruction OP-code mapping MICROINSTRUCTION SEQUENCING

Instruction code Mapping logic MUX Branch logic selec t

Status bits

Multiplexers Subroutine Register (SBR) Incrementer

Control address register (CAR)

Control memory (ROM) select a status bit

Microoperations

Sequencing Capabilities Required in a Control Storage - Incrementing of the control address register - Unconditional and conditional branches

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

- A mapping process from the bits of the machine instruction to an address for control memory - A facility for subroutine call and return

Conditional Branch If Condition is true, then Branch (address from the next address field of the current microinstruction) else Fall Through Conditions to Test: O(overflow), N(negative), Z(zero), C(carry), etc. Unconditional Branch Fixing the value of one status bit at the input of the multiplexer to 1 Load address Control address Register Increment

MUX

Control memory

...
Status bits (condition) Condition select Next address Microoperations

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

MICROINSTRUCTION FORMAT Information in a Microinstruction - Control Information - Sequencing Information - Constant Information which is useful when feeding into the system These information needs to be organized in some way for - Efficient use of the microinstruction bits - Fast decoding Field Encoding - Encoding the microinstruction bits - Encoding slows down the execution speed due to the decoding delay - Encoding also reduces the flexibility due to the decoding hardware Horizontal Microinstructions Each bit directly controls each micro-operation or each control point Horizontal implies a long microinstruction word Advantages: Can control a variety of components operating in parallel. --> Advantage of efficient hardware utilization Disadvantages: Control word bits are not fully utilized Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

CS becomes large --> Costly Ingeneral ,the number of bits in a microinstruction rasnge from around a dozen to over a hundred. The exact number depends on the complexity of the datapath and on the number and types of instructions as well as the number of allowed instruction operands and their addressing modes.

A horizontal microcode system uses minimal encoding to specify the control information. For example, if there are 32 registers that might be used as an operand, then a separate bit would signal whether the corresponding register is selected. Or if there were 128 different operations that could be specified, then a separate bit would be used for each. A disadvantage of this approach is that relatively few of the actions specified by bits in the microinstruction can occur in parallel and only one register at a time can be selected as a source or destination operand. This leads to the presence of many zeros in the microinstruction, and creates a lot of wasted space in the memory. Vertical Microinstructions A microinstruction format that is not horizontal Vertical implies a short microinstruction word Encoded Microinstruction fields --> Needs decoding circuits for one or two levels of decoding One-level decoding Field A 2 bits Field B 3 bits 3x8 Decoder 1 of 8 Two-level decoding Field A 2 bits 2x4 Decoder Field B 6 bits 6 x 64 Decoder Decoder and selection logic

2x4 Decode r 1 of 4

In a vertical microcode system, the widths of the fields such as the register number and ALU operations are reduced by encoding the information in a shorter form. For example, any one of registers can be specified using 5-bit field or 7 bits could be used encode upto 128 different operations. The main disadvantage of this approach when compared to the horizontal microcode system is the slower operation due to the need to decode fields. Computer Science & Engineering Dept. SJCET, Palai

R 402 Nanostorage and Nanoinstruction

Computer Organization

Nanoinstructions are used to drive a lookup table of microinstructions in a machine where a nanostore is used. This is appropriate where many of the microinstructions occur several times through the micro program. In this case, the distinct microinstructions are placed in a small control storedex. The nanostore then contains (in order) the index in the microcontrol store of the appropriate microinstruction. Usually, the microprogram consists of a large number of short microinstructions, while the nanoprogram contains fewer words with longer nanoinstructions. The decoder circuits in a vertical microprogram storage organization can be replaced by a ROM => Two levels of control storage First level - Control Storage Second level - Nano Storage Two-level microprogram First level -Vertical format Microprogram Second level -Horizontal format Nanoprogram -Interprets the microinstruction fields, thus converts a microinstruction format into a horizontal nanoinstruction format.

vertical

Usually, the microprogram consists of a large number of short microinstructions, while the nanoprogram contains fewer words with longer nanoinstructions. Two-Level Microprogramming - Example Microprogram: 2048 microinstructions of 200 bits each With 1-Level Control Storage: 2048 x 200 = 409,600 bits Assumption: 256 distinct microinstructions among 2048 With 2-Level Control Storage: o Nano Storage: 256 x 200 bits to store 256 distinct nanoinstructions o Control storage: 2048 x 8 bits o To address 256 nano storage locations 8 bits are needed Total 1-Level control storage: 409,600 bits Total 2-Level control storage: 67,584 bits (256 x 200 + 2048 x 8)

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Control address register 11 bits Control memory 2048 x 8 Microinstruction (8bits) Nanomemory address Nanomemory 256 x 200 Nanoinstructions ( 200 bits)

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

THE MEMORY SYSTEM


Programs and the data they operate on are held in the memory of the computer, In this chapter, we discuss how this vital part of the computer operates. By now, the reader appreciates that the execution speed of programs is highly dependent on the speed with which instructions and data can be transferred between the processor and the memory. It is also important to have a large memory to facilitate execution of programs that are large and deal with huge amounts of data. Ideally, the memory would be fast, large, and inexpensive. Unfortunately, it is impossible to meet all three of these requirements simultaneously. Increased speed and size are achieved at increased cost. To solve this problem, much work has gone into developing clever structures that improve the apparent speed and size of the memory, yet keep the cost reasonable. First, we describe the most common components and organizations used to implement the memory. Then we examine memory speed and discuss how the apparent speed of the memory can be increased by means of caches. Next, we present the virtual memory concept, which increases the apparent size of the memory. Finally, we discuss the secondary storage devices, which provide much larger storage capability.

SOME BASIC CONCEPTS


The maximum size of the memory that can be used in any computer is determined by I the addressing scheme. For example, a 16-bit computer that generates 16-bit addresses is capable of addressing up to 216 = 64K memory locations. Similarly, machines whose instructions generate 32-bit addresses can utilize a memory that contains up to 232 =4G (giga) memory locations, whereas machines with 40-bit addresses can access up to 240 = 1T (tera) locations. The number of locations represents the size of the address space of the computer. Most modern computers are byte addressable. Figure 2.7 shows the possible address assignments for a byte-addressable 32-bit computer. The big-endian arrangement is used in the 68000 processor. The little-endian arrangement is used in Intel processors. The ARM architecture can be configured to use either arrangement. As far as the memory structure is concerned, there is no substantial difference between the two schemes. The memory is usually designed to store and retrieve data in word-length quantities. In fact, the number of bits actually stored or retrieved in one memory access is the most common definition of the word length of a computer. Consider, for example, a byteaddressable computer whose instructions generate 32-bit addresses. When a 32-bit address is sent from the processor to the memory unit, the high-order 30 bits determine which word will be accessed. If a byte quantity is specified, the low-order 2 bits of the address specify which byte location is involved. In a Read operation, other bytes may be fetched from the memory, but they are ignored by the processor. If the byte operation is a Write, however, the control circuitry of the memory must ensure that the contents of other bytes of the same word are not changed. Modern implementations of computer memory are rather complex and difficult to understand on first encounter. To simplify our introduction to memory structures, we will first present a traditional architecture. Then, in later sections, we will discuss the latest approaches.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

From the system standpoint, we can view the memory unit as a black box. Data transfer between the memory and the processor takes place through the use of two processor registers, usually called MAR (memory address register) and MDR (memory data register), as introduced in Section 1.2. If MAR is k bits long and MDR is n bits long, then the memory unit may contain up to 2k addressable locations. During a memory cycle, n bits of data are transferred between the memory and the processor. This transfer takes place over the processor bus, which has k address lines and n data lines. The bus also includes the control lines Read/Write (R/W) and Memory Function Completed (WC) for coordinating data transfers. Other control lines may be added to indicate the number of bytes to be transferred. The connection between the processor and the memory is shown schematically in Figure 4.1.

The processor reads data from the memory by loading the address of the required memory location into the MAR register and setting the R/W line to 1. The memory responds by placing the data from the addressed location onto the data lines, and confirms this action by asserting the MFC signal. Upon receipt of the MFC signal, the processor loads the data on the data lines into the MDR register. The processor writes data into a memory location by loading the address of this location into MAR and loading the data into MDR. It indicates that a write operation is involved by setting the R/W line to 0. If read or write operations involve consecutive address locations in the main memory, then a block transfer operation can be performed in which the only address sent to the memory is the one that identifies the first location. Memory accesses may be synchronized using a clock, or they may be controlled using special signals that control transfers on the bus, using the bus signaling schemes. Memory read and write operations are controlled as input and output bus transfers, respectively. A useful measure of the speed of memory units is the time that elapses between the initiation of an operation and the completion of that operation, for example, the time between the Read and the MFC signals. This is referred to as the memory access time, Another important measure is the memory cycle time, which is the minimum time delay required between the initiation of two successive memory operations, for example, the Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

time between two successive Read operations. The cycle time is usually slightly longer than the access time, depending on the implementation details of the memory unit. A memory unit is called random-access memory (RAM) if any location can be accessed for a Read or Write operation in some fixed amount of time that is independent of the locations address. This distinguishes such memory units from serial, or partly serial, access storage devices such as magnetic disks and tapes. Access time on the latter devices depends on the address or position of the data. The basic technology for implementing the memory uses semiconductor integrated circuits. The sections that follow present some basic facts about the internal structure and operation of such memories. We then discuss some of the techniques used to increase the effective speed and size of the memory. The processor of a computer can usually process instructions and data faster than I they can be fetched from a reasonably priced memory unit. The memory cycle time, I then, is the bottleneck in the system. One way to reduce the memory access time is to use a cache memory. This is a small, fast memory that is inserted between the larger, slower main memory and the processor. It holds the currently active segments of a program and their data. Virtual memory is another important concept related to memory organization. So far, we have assumed that the addresses generated by the processor directly specify physical locations in the memory. This may not always be the case. For reasons that will become apparent later in this chapter, data may be stored in physical memory locations that have addresses different from those specified by the program. The memory control circuitry translates the address specified by the program into an address that can be used to access the physical memory. In such a case, an address generated by the processor is referred to as a virtual or logical address. The virtual address space is mapped onto the physical memory where data are actually stored. The mapping function is implemented by a special memory control circuit, often called the memory management unit. This mapping function can be changed during program execution according to system requirements. Virtual memory is used to increase the apparent size of the physical memory. Data are addressed in a virtual address space that can be as large as the addressing capability of the processor. But at any given time, only the active portion of this space is mapped onto locations in the physical memory. The remaining virtual addresses are mapped onto the bulk storage devices used, which are usually magnetic disks. As the active portion of the virtual address space changes during program execution, the memory management unit changes the mapping function and transfers data between the disk and the memory. Thus, during every memory cycle, an address-processing mechanism determines whether the addressed information is in the physical memory unit. If it is, then the proper word is accessed and execution proceeds. If it is not, a page of words containing the desired word is transferred from the disk to the memory. This page displaces some page in the memory that is currently inactive. Because of the time required to move pages between the disk and the memory, there is a speed degradation if pages are moved frequently. By judiciously choosing which page to replace in the memory, however, there may be reasonably long periods when the probability is high that the words accessed by the processor are in the physical memory unit. Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

This section has briefly introduced several organizational features of memory systems. These features have been developed to help provide a computer system with as large and as fast a memory as can be afforded in relation to the overall cost of the system. We do not expect the reader to grasp all the ideas or their implications now; more detail is given later. We introduce these terms together to establish that they are related; a study of their interrelationships is as important as a detailed study of their individual features.

4.1 Memory Hierarchy


We have already stated that an ideal memory would be fast, large, and inexpensive. It is clear that a very fast memory can be implemented if SRAM chips are used. But these chips are expensive because their basic cells have six transistors, which include packing a very large number of cells onto a single chip. Thus, for cost reasons, it is impractical to build a large memory using SRAM chips. The alternative is to use Dynamic RAM chips, which have much simpler basic cells and thus are much less expensive. But such memories are significantly slower. Although dynamic memory units in the range of hundreds of megabytes can be implemented at a reasonable cost, the affordable size is still small compared to the demands of large programs with voluminous data. A solution is provided by using secondary storage, mainly magnetic disks, to implement large memory spaces. Very large disks are available at a reasonable price, and they are used extensively in computer systems. However, they are much slower than the semiconductor memory units. So we conclude the following: A huge amount of cost-effective storage can be provided by magnetic disks. A large, yet affordable, main memory can be built with dynamic RAM technology. This leaves SRAMs to be used in smaller units where speed is of the essence, such as in cache memories. All of these different types of memory units are employed effectively in a computer. The entire computer memory can be viewed as the hierarchy depicted in Figure 5.13. The fastest access is to data held in processor registers. Therefore, if we consider the registers to be part of the memory hierarchy, then the processor registers are at the top in terms of the speed of access. Of course, the registers provide only a minuscule portion of the required memory. At the next level of the hierarchy is a relatively small amount of memory that can be implemented directly on the processor chip. This memory, called a processor cache, holds copies of instructions and data stored in a much larger memory that is provided externally. There are often two levels of caches. A primary cache is always located on the processor chip. This cache is small because it competes for space on the processor chip, which must implement many other functions. The primary cache is referred to as level 1 (LI) cache. A larger, secondary cache is placed between the primary cache and the rest of the memory. It is referred to as level 2 (L2) cache. It is usually implemented using SRAM chips. Including a primary cache on the processor chip and using a larger, off-chip, secondary cache is currently the most common way of designing computers. However, other arrangements can be found in practice. It is possible not to have a cache on the processor chip at all. Also, it is possible to have both L1 and L2 caches on the processor chip. The next level in the hierarchy is called the main memory. This rather large memory is implemented using dynamic memory components, typically in the form of SIMMs, DIMMs, or RIMMs. The main memory is much larger but significantly slower than the Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

cache memory. In a typical computer, the access time for the main memory is about ten times longer than the access time for the L1 cache.

Disk devices provide a huge amount of inexpensive storage. They are very slow compared to the semiconductor devices used to implement the main memory. During program execution, the speed of memory access is of utmost importance. The key to managing the operation of the hierarchical memory system in Figure 5.13 is to bring the instructions and data that will be used in the near future as close to the processor as possible. This can be done by using the mechanisms presented in the sections that follow.

4.2 SEMICONDUCTOR RAM MEMORIES


Semiconductor memories are available in a wide range of speeds. Their cycle times range from 100 ns to less than 10 ns. When first introduced in the late 1960s, they were much more expensive than the magnetic-core memories they replaced. Because of rapid advances in VLSI (Very Large Scale Integration) technology, the cost of semiconductor memories has dropped dramatically. As a result, they are now used almost exclusively in implementing memories. In this section, we discuss the main characteristics of semiconductor memories. We start by introducing the way that a number of memory cells are organized inside a chip.

4.2.1 INTERNAL ORGANIZATION OF MEMORY CHIPS


Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

Memory cells are usually organized in the form of an array, in which each cell is capable of storing one bit of information. A possible organization is illustrated in Figure 5.2. Each row of cells constitutes a memory word, and all cells of a row are connected to a common line referred to as the word line, which is driven by the address decoder on the chip. The cells in each column are connected to a Sense/Write circuit by two bit lines. The Sense/Write circuits are connected to the data input/output lines of the chip. During a Read operation, these circuits sense, or read, the information stored in the cells selected by a word line and transmit this information to the output data lines. During a Write operation, the Sense/Write circuits receive input information and store it in the cells of the selected word. Figure 5.2 is an example of a very small memory chip consisting of 16 words of 8 bits each. This is referred to as a 16 x 8 organization. The data input and the data output of each Sense/Write circuit are connected to a single bidirectional data line that can be connected to the data bus of a computer. Two control lines, R/W and CS, are provided in addition to address and data lines. The R/W (Read/Write) input specifies the required operation, and the CS (Chip Select) input selects a given chip in a multichip memory system.

The memory circuit in Figure 5.2 stores 128 bits and requires 14 external connections for address, data, and control lines. Of course, it also needs two lines for power supply and ground connections. Consider now a slightly larger memory circuit, one that has 1K (1024) memory cells. This circuit can be organized as a 128 x 8 memory, requiring a total of 19 external connections. Alternatively, the same number of cells can be organized into a 1K x 1 format. In this case, a 10-bit address is needed, but there is only one data line, resulting in 15 external connections. Figure 5.3 shows such an organization. The required 10-bit address is divided into two groups of 5 bits each to form the row and column Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

addresses for the cell array. A row address selects a row of 32 cells, all of which are accessed in parallel. However, according to the column address, only one of these cells is connected to the external data line by the output multiplexer and input demultiplexer. Commercially available memory chips contain a much larger number of memory cells than the examples shown in Figures 5.2 and 5.3. We use small examples to make the figures easy to understand. Large chips have essentially the same organization as Figure 5.3 but use a larger memory cell array and have more external connections. For example, a 4M-bit chip may have a 512K x 8 organization, in which case 19 address and 8 data input/output pins are needed. Chips with a capacity of hundreds of megabits are now available.

STATIC MEMORIES

Memories that consist of circuits capable of retaining their state as long as power is applied are known as static memories. Figure 5.4 illustrates how a static RAM (SRAM) cell may be implemented. Two inverters are cross-connected to form a latch. The latch is connected to two hit lines by transistors T1 and T2. These transistors act as switches that can be opened or closed under control of the word line. When the word line is at ground level, the transistors are turned off and the latch retains its state. For example, let us asstfhie that the cell is in state 1 if the logic value at point X is I and at point Y is 0. This state is maintained as long as the signal on the word line is at ground level.

Read Operation
Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

In order to read the state of the SRAM cell, the word line is activated to close switches T1 and T2. If the cell is in state 1, the signal on bit line b is high and the signal on bit line b is low. The opposite is true if the cell is in state 0. Thus, b and b are complements of each other. Sense/Write circuits at the end of the bit lines monitor the state of b and b and set the output accordingly.

Write Operation
The state of the cell is set by placing the appropriate value on bit line b and its complement on b, and then activating the word line. This forces the cell into the corresponding state. The required signals on the bit lines are generated by the Sense/Write circuit. 4.2.2 ASYNCHRONOUS DRAMS Static RAMs are fast, but they come at a high cost because their cells require several transistors. Less expensive RAMs can be implemented if simpler cells are used. However, such cells do not retain their state indefinitely; hence, they are called dynamic RAMs (DRAMs). Information is stored in a dynamic memory cell in the form of a charge on a capacitor, and this charge can be maintained for only tens of milliseconds. Since the cell is required to store information for a much longer time, its contents must be periodically refreshed by restoring the capacitor charge to its full value. An example of a dynamic memory cell that consists of a capacitor, C, and a transistor, T, is shown in Figure 5.6. In order to store information in this cell, transistor

T is turned on and an appropriate voltage is applied to the bit line. This causes a known amount of charge to be stored in the capacitor. After the transistor is turned off, the capacitor begins to discharge. This is caused by the capacitors own leakage resistance and by the fact that the transistor continues to conduct a tiny amount of current, measured in picoamperes, after it is turned off. Hence, the information stored in the cell can be retrieved correctly only if it is read before the charge on the capacitor drops below some threshold value. During a Read operation, the transistor in a selected cell is turned on. A sense amplifier connected to the bit line detects whether the charge stored on the Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

capacitor is above the threshold value. If so, it drives the bit line to a full voltage that represents logic value 1. This voltage recharges the capacitor to the full charge that corresponds to logic value 1. If the sense amplifier detects that the charge on the capacitor is below the threshold value, it pulls the bit line to ground level, which ensures that the capacitor will have no charge, representing logic value 0. Thus, reading the contents of the cell automatically refreshes its contents. All cells in a selected row are read at the same time, which refreshes the contents of the entire row.

A 16-megabit DRAM chip, configured as 2M x 8, is shown in Figure 5.7. The cells are organized in the form of a 4K x 4K array. The 4096 cells in each row are divided into 512 groups of 8, so that a row can store 512 bytes of data. Therefore, 12 address bits are needed to select a row. Another 9 bits are needed to specify a group of 8 bits in the selected row. Thus, a 21-bit address is needed to access a byte in this memory. The highorder 12 bits and the low-order 9 bits of the address constitute the row and column addresses of a byte, respectively. To reduce the number of pins needed for external connections, the row and column addresses are multiplexed on 12 pins. During a Read or a Write operation, the row address is applied first. It is loaded into the row address latch in response to a signal pulse on the Row Address Strobe (RAS) input of the chip. Then a Read operation is initiated, in which all cells on the selected row are read and refreshed. Shortly after the row address is loaded, the column address is applied to the address pins and loaded into the column address latch under control of the Column Address Strobe (CAS) signal. The information in this latch is decoded and the appropriate group of 8 Sense/Write circuits are selected. If the RJW control signal indicates a Read operation, the output values of the selected circuits are transferred to the data lines, D7_0. For a Write operation, the information on the D7_0 lines is transferred to the selected circuits. This information is then used to overwrite the contents of the selected cells in the corresponding 8 columns. We should note that in commercial DRAM chips, the RAS and Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

CAS control signals are active low so that they cause the latching of addresses when they change from high to low. To indicate this fact, these signals are shown on diagrams as RAS and CAS. Applying a row address causes all cells on the corresponding row to be read and refreshed during both Read and Write operations. To ensure that the contents of a DRAM are maintained, each row of cells must be accessed periodically. A refresh circuit usually performs this function automatically. Many dynamic memory chips incorporate a refresh facility within the chips themselves. In this case, the dynamic nature of these memory chips is almost invisible to the user. In the DRAM described in this section, the timing of the memory device is controlled asynchronously. A specialized memory controller circuit provides the necessary control signals, RAS and CAS, that govern the timing. The processor must take into account the delay in the response of the memory. Such memories are referred to as asynchronous DRAMs. Because of their high density and low cost, DRAMs are widely used in the memory units of computers. Available chips range in size from 1M to 256M bits, and even larger chips are being developed. To reduce the number of memory chips needed in a given computer, a DRAM chip is organized to read or write a number of bits in parallel, as indicated in Figure 5.7. To provide flexibility in designing memory systems, these chips are manufactured in different organizations. For example, a 64-Mbit chip may be organized as 16M x 4, 8M x 8, or 4M x 16.

4.2.3 Synchronous DRAMs


More recent developments in memory technology have resulted in DRAMs whose operation is directly synchronized with a clock signal. Such memories are known as synchronous DRAMs (SDRAMs). Figure 5.8 indicates the structure of an SDRAM. The cell array is the same as in asynchronous DRAMs. The address and data connections are buffered by means of registers. We should particularly note that the output of each refresh the contents of the cells. Data held in the latches that correspond to the selected in which column(s) are transferred into the data output register, thus becoming available on the feature is data output pins. SDRAMs have several different modes of operation, which can be selected by writing control information into a mode register. For example, burst operations of different lengths can be specified. The burst operations use the block transfer capability described above as the fast page mode feature. In SDRAMs, it is not necessary to provide externally generated pulses on the CAS line to select successive columns. The necessary control signals are provided internally using a column counter and the clock Is whose signal. New data can be placed on the data lines in each clock cycle. All actions are triggered by the rising edge of the clock.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

STRUCTURE OF LARGER MEMORIES

We have discussed the basic organization of memory circuits as they may be implemented on a single chip. Next, we should examine how memory chips may be connected to form a much larger memory. Static Memory Systems Consider a memory consisting of 2M (2,097,152) words of 32 bits each. Figure 5.10 shows how we can implement this memory using 512K x 8 static memory chips. Each column in the figure consists of four chips, which implement one byte position. Four of these sets provide the required 2M x 32 memory. Each chip has a control input called Chip Select. When this input is set to 1, it enables the chip to accept data from or to place data on its data lines. The data output for each chip is of the three-state type. Only the selected chip places data on the data output line, while all other outputs are in the highimpedance state. Twenty one address bits are needed to select a 32-bit word in this memory. The high-order 2 bits of the address are decoded to determine which of the four Chip Select control signals should be activated, and the remaining 19 address bits are used to access specific byte locations inside each chip of the selected row. The RJW inputs of all chips are tied together to provide a common Read/Write control (not shown in the figure). Computer Science & Engineering Dept. SJCET, Palai

R 402 Dynamic Memory Systems

Computer Organization

The organization of large dynamic memory systems is essentially the same as the memory shown in Figure 5.10. However, physical implementation is often done more conveniently in the form of memory modules. Modern computers use very large memories; even a small personal computer is likely to have at least 32M bytes of memory. Typical workstations have at least I 28M bytes of memory. A large memory leads to better performance because more of the programs and data used in processing can be held in the memory, thus reducing the frequency of accessing the information in secondary storage. However, if a large memory is built by placing DRAM chips directly on the main system printed-circuit board that contains the processor, often referred to as a motherboard, it will occupy an unacceptably large amount of space on the board. Also, it is awkward to provide for future

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Modern computers use very large memories; even a small personal computer is likely to have at least 32M bytes of memory. Typical workstations have at least I 28M bytes of memory. A large memory leads to better performance because more of the programs and data used in processing can be held in the memory, thus reducing the frequency of accessing the information in secondary storage. However, if a large memory is built by placing DRAM chips directly on the main system printed-circuit board that contains the processor, often referred to as a motherboard, it will occupy an unacceptably large amount of space on the board. Also, it is awkward to provide for future expansion of the memory, because space must be allocated and wiring provided for the maximum expected size. These packaging considerations have led to the development of larger memory units known as SIMMs (Single In-line Memory Modules) and DIMMs (Dual Inline Memory Modules). Such a module is an assembly of several memory chips on a separate small board that plugs vertically into a single socket on the motherboard. SIMMs and DIMMs of different sizes are designed to use the same size socket. For example, 4M Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

x 32, 16M x 32, and 32M x 32 bit DIMMs all use the same 100-pin socket. Similarly, 8M x 64, 16M x 64, 32M x 64, and 64M x 72 DIMMs use a 168-pin socket. Such modules occupy a smaller amount of space on a motherboard, and they allow easy expansion by replacement if a larger module uses the same socket as the smaller one. 4.3 READ-ONLY MEMORIES (ROMs) Both SRAM and DRAM chips are volatile, which means that they lose the stored information if power is turned off. There are many applications that need memory devices which retain the stored information if power is turned off. For example, in a typical computer a hard disk drive is used to store a large amount of information, including the operating system software. When a computer is turned on, the operating system software has to be loaded from the disk into the memory. This requires execution of a program that boots the operating system. Since the boot program is quite large, most of it is stored on the disk. The processor must execute some instructions that load the boot program into the memory. If the entire memory consisted of only volatile memory chips, the processor would have no means of accessing these instructions. A practical solution is to provide a small amount of nonvolatile memory that holds the instructions whose execution results in loading the boot program from the disk. Nonvolatile memory is used extensively in embedded systems. Such systems typically do not use disk storage devices. Their programs are stored in nonvolatile semiconductor memory devices. Different types of nonvolatile memory have been developed. Generally, the contents of such memory can be read as if they were SRAM or DRAM memories. But, a special writing process is needed to place the information into this memory. Since its normal operation involves only reading of stored data, a memory of this type is called read-only memory (ROM). 4.3.1 ROM Figure 5.12 shows a possible configuration for a ROM cell. A logic value 0 is stored in the cell if the transistor is connected to ground at point P; otherwise, a I is stored. The bit line is connected through a resistor to the power supply. To read the state of the cell, the word line is activated. Thus, the transistor switch is closed and the voltage on the bit line drops to near zero if there is a connection between the transistor and ground. If there is no connection to ground, the bit line remains at the high voltage, indicating a 1. A sense circuit at the end of the bit line generates the proper output value. Data are written into a ROM when it is manufactured.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

4.3.2 PROM Some ROM designs allow the data to be loaded by the user, thus providing a programmable ROM (PROM). Programmability is achieved by inserting a fuse at point P in Figure 5.12. Before it is programmed, the memory contains all Os. The user can insert Is at the required locations by burning out the fuses at these locations using high-current pulses. Of course, this process is irreversible. PROMs provide flexibility and convenience not available with ROMs. The latter are economically attractive for storing fixed programs and data when high volumes of ROMs are produced. However, the cost of preparing the masks needed for storing a particular information pattern in ROMs makes them very expensive when only a small number are required. In this case, PROMs provide a faster and considerably less expensive approach because they can be programmed directly by the user. 4.3.3 EPROM Another type of ROM chip allows the stored data to be erased and new data to be loaded. Such an erasable, reprogrammable ROM is usually called an EPROM. It provides considerable flexibility during the development phase of digital systems. Since EPROMs are capable of retaining stored information for a long time, they can be used in place of ROMs while software is being developed. In this way, memory changes and [he updates can be easily made. An EPROM cell has a structure similar to the ROM cell in Figure 5.12. In an EPROM cell, however, the connection to ground is always made at point P and a [.If special transistor is used, which has the ability to function either as a normal transistor g a or as a disabled transistor that is always turned off. This transistor can be programmed are to behave as a permanently open switch, by injecting charge into it that becomes trapped inside. Thus, an EPROM cell can be used to construct a memory in the same way as the previously discussed ROM cell. The important advantage of EPROM chips is that their contents can be erased and reprogrammed. Erasure requires dissipating the charges trapped in the transistors of memory cells; this can be done by exposing the chip to ultraviolet light. For this reason, EPROM chips are mounted in packages that have transparent windows. 4.3.4 EEPROM Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

A significant disadvantage of EPROMs is that a chip must be physically removed from the circuit for reprogramming and that its entire contents are erased by the ultraviolet light. It is possible to implement another version of erasable PROMs that can be both programmed and erased electrically. Such chips, called EEPROMs, do not have to be removed for erasure. Moreover, it is possible to erase the cell contents selectively. The only disadvantage of EEPROMs is that different voltages are needed for erasing, writing, and reading the stored data. FLASH MEMORY An approach similar to EEPROM technology has more recently given rise to flash memory devices. A flash cell is based on a single transistor controlled by trapped charge, just like an EEPROM cell. While similar in some respects, there are also substantial differences between flash and EEPROM devices. In EEPROM it is possible to read and write the contents of a single cell. In a flash device it is possible to read the contents of a single cell, but it is only possible to write an entire block of cells. Prior to writing, the previous contents of the block are erased. Flash devices have greater density, which leads to higher capacity and a lower cost per bit. They require a single power supply voltage, and consume less power in their operation. The low power consumption of flash memory makes it attractive for use in portable equipment that is battery driven. Typical applications include hand-held computers, cell phones, digital cameras, and MP3 music players. In hand-held computers and cell phones, flash memory holds the software needed to operate the equipment, thus obviating the need for a disk drive. In digital cameras, flash memory is used to store picture image data. In MP3 players, flash memory stores the data that represent sound. Cell phones, digital cameras, and MP3 players are good examples of embedded systems. Single flash chips do not provide sufficient storage capacity for the applications mentioned above. Larger memory modules consisting of a number of chips are needed. There are two popular choices for the implementation of such modules: flash cards and flash drives. Flash Cards One way of constructing a larger module is to mount flash chips on a small card. Such flash cards have a standard interface that makes them usable in a variety of products. A card is simply plugged into a conveniently accessible slot. Flash cards come in a variety of memory sizes. Typical sizes are 8, 32, and 64 Mbytes. A minute of music can be stored in about 1 Mbyte of memory, using the MP3 encoding format. Hence, a 64-MB flash card can store an hour of music. Flash Drives Larger flash memory modules have been developed to replace hard disk drives. These flash drives are designed to fully emulate the hard disks, to the point that can be fitted into standard disk drive bays. However, the storage capacity of flash drives is significantly lower. Currently, the capacity of flash drives is less than one gigabyte. In contrast, hard disks can store many gigabytes. The fact that flash drives are solid state electronic devices that have no movable us parts provides some important advantages. They have shorter seek and access times, which results in faster response. They have lower power consumption, which makes them attractive for battery driven applications, and they are also insensitive to vibration. The disadvantages of flash drives hard disk drives are their smaller capacity and higher cost per bit. Disks provide an extremely low cost per bit. Another disadvantage is that the Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

flash memory will deteriorate after it has been written a number of times. Fortunately, this number is high, typically at least one million times. 4.4 MEMORY SYSTEM CONSIDERATIONS The choice of a RAM chip for a given application depends on several factors. Foremost among these factors are the cost, speed, power dissipation, and size of the chip. Static RAMs are generally used only when very fast operation is the primary requirement. Their cost and size are adversely affected by the complexity of the circuit that realizes the basic cell. They are used mostly in cache memories. Dynamic RAMs are the predominant choice for implementing computer main memories. The high densities achievable in these chips make large memories economically feasible. Memory Controller To reduce the number of pins, the dynamic memory chips use multiplexed address inputs. The address is divided into two parts. The high-order address bits, which select a row in the cell array, are provided first and latched into the memory chip under control of the RAS signal. Then, the low-order address bits, which select a column, are provided on the same address pins and latched using the CAS signal. A typical processor issues all bits of an address at the same time. The required multiplexing of address bits is usually performed by a memory controller circuit, which is interposed between the processor and the dynamic memory as shown in Figure 5.11.

The controller accepts a complete address and the R/W signal from the processor, under control of a Request signal which indicates that a memory access operation is needed. The controller then forwards the row and column portions of the address to the memory and generates the RAS and CAS signals. Thus, the controller provides the RAS-CAS timing, in addition to its address multiplexing function. It also sends the R/W and CS signals to the memory. The CS signal is usually active low, hence it is shown as CS in Figure 5.11. Data lines are connected directly between the processor and the memory. Note that the clock signal is needed in SDRAM chips. When used with DRAM chips, which do not have self-refreshing capability, the memory controller has to provide all the information needed to control the refreshing process. It contains a refresh counter that provides successive row addresses. Its function is to cause the refreshing of all rows to be done within the period specified for a particular device. Refresh Overhead Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

All dynamic memories have to be refreshed. In older DRAMs, a typical period for refreshing all rows was 16 ms. In typical SDRAMs, a typical period is 64 ms. Consider an SDRAM whose cells are arranged in 8K (=8 192) rows. Suppose that it takes four clock cycles to access (read) each row. Then, it takes 8192 x 4 = 32,768 cycles to refresh all rows. At a clock rate of 133 MHz, the time needed to refresh all rows is 32,768/(l33 x 106) = 246 x 106 seconds.Thus,therefreshingprocessoccupies0.246 ms in each 64-ms time interval. Therefore, the refresh overhead is 0.246/64 0.0038, which is less than 0.4 percent of the total time available for accessing the memory. 4.5 ASSOCIATIVE MEMORY Associative memory (content-addressable memory, CAM) A memory that is capable of determining whether a given datum the search word is contained in one of its addresses or locations. This may be accomplished by a number of mechanisms. In some cases parallel combinational logic is applied at each word in the memory and a test is made simultaneously for coincidence with the search word. In other cases the search word and all of the words in the memory are shifted serially in synchronism; a single bit of the search word is then compared to the same bit of all of the memory words using as many single-bit coincidence circuits as there are words in the memory. Amplifications of the associative memory technique allow for masking the search word or requiring only a close match as opposed to an exact match. Small parallel associative memories are used in cache memory and virtual memory mapping applications. Since parallel operations on many words are expensive (in hardware), a variety of stratagems are used to approximate associative memory operation without actually carrying out the full test described here. One of these uses hashing to generate a best guess for a conventional address followed by a test of the contents of that address. Some associative memories have been built to be accessed conventionally (by words in parallel) and as serial comparison associative memories; these have been called orthogonal memories. See also associative addressing, associative processor. Associative memory may refer to: a type of memory closely associated with neural networks; such as Bidirectional Associative Memory, Autoassociative memory and Hopfield net. a type of computer memory; see Content-addressable memory. an aspect of human memory; see Transderivational search.

4.6 VIRTUAL MEMORIES In most modern computer systems, the physical main memory is not as large as the address space spanned by an address issued by the processor. For example, a processor that issues 32-bit addresses has an addressable space of 4G bytes. The size of the main memory in a typical computer ranges from a few hundred megabytes to 1G bytes. When a program does not completely fit into the main memory, the parts of it not currently being executed are stored on secondary storage devices, such as magnetic disks. Of course, all parts of a program that are eventually executed are first brought into the main memory. When a new segment of a program is to be moved into a full memory, it Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

must replace another segment already in the memory. In modern computers, the operating system moves programs and data automatically between the main memory and secondary storage. Thus, the application programmer does not need to be aware of limitations imposed by the available main memory. Techniques that automatically move program and data blocks into the physical main memory when they are required for execution are called virtual-memory techniques. Programs, and hence the processor, reference an instruction and data space that is independent of the available physical main memory space. The binary addresses that the processor issues for either instructions or data are called virtual or logical addresses. These addresses are translated into physical addresses by a combination of hardware and software components. If a virtual address refers to a part of the program or data space that is currently in the physical memory, then the contents of the appropriate location in the main memory are accessed immediately. On the other hand, if the referenced address is not in the main memory, its contents must be brought into a suitable location in the memory before they can be used. Figure 5.26 shows a typical organization that implements virtual memory. A special hardware unit, called the Memory Management Unit (MMU), translates virtual addresses into physical addresses. When the desired data (or instructions) are in the main memory, these data are fetched as described in our presentation of the cache Y mechanism. If the data are not in the main memory, the MMU causes the operating system to bring the data into the memory from the disk. Transfer of data between the disk and the main memory is performed using the DMA scheme.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

ADDRESS TRANSLATION A simple method for translating virtual addresses into physical addresses is to assume that all programs and data are composed of fixed-length units called pages, each of which consists of a block of words that occupy contiguous locations in the main memory. Pages commonly range from 2K to 16K bytes in length. They constitute the basic unit of information that is moved between the main memory and the disk whenever the translation mechanism determines that a move is required. Pages should not be too small, because the access time of a magnetic disk is much longer (several milliseconds) than the access time of the main memory. The reason for this is that it takes a considerable amount of time to locate the data on the disk, but once located, the data can be transferred at a rate of several megabytes per second. On the other hand, if pages are too large it is possible that a substantial portion of a page may not be used, yet this unnecessary data will occupy valuable space in the main memory. This discussion clearly parallels the concepts introduced in Section 5.5 on cache memory. The cache bridges the speed gap between the processor and the main memory and is implemented in hardware. The virtual-memory mechanism bridges the size and speed gaps between the main memory and secondary storage and is usually implemented in part by software techniques. Conceptually, cache techniques and virtual- memory techniques are very similar. They differ mainly in the details of their implementation. A virtual-memory address translation method based on the concept of fixed-length pages is shown schematically in Figure 5.27. Each virtual address generated by the processor, whether it is for an instruction fetch or an operand fetch/store operation, is interpreted as a virtual page number (high-order bits) followed by an offset (low-order bits) that specifies the location of a particular byte (or word) within a page. Information about the main memory location of each page is kept in a page table. This information includes the main memory address where the page is stored and the current status of the page. An area in the main memory that can hold one page is called a page frame. The starting address of the page table is kept in a page table base register. By adding the virtual page number to the contents of this register, the address of the corresponding entry in the page table is obtained. The contents of this location give the starting address of the page if that page currently resides in the main memory. Each entry in the page table also includes some control bits that describe the status of the page while it is in the main memory. One bit indicates the validity of the page, that is, whether the page is actually loaded in the main memory. This bit allows the operating system to invalidate the page without actually removing it. Another bit indicates whether the page has been modified during its residency in the memory. As in cache memories, this information is needed to determine whether the page should be written back to the disk before it is removed from the main memory to make room for another page. Other control bits indicate various restrictions that may be imposed on accessing the page. For example, a program may be given full read and write permission, or it may be restricted to read accesses only.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

The page table information is used by the MMU for every read and write access, so ideally, the page table should be situated within the MMU. Unfortunately, the page table may be rather large, and since the MMU is normally implemented as part of the processor chip (along with the primary cache), it is impossible to include a complete page table on this chip. Therefore, the page table is kept in the main memory. However, a copy of a small portion of the page table can be accommodated within the MMU. This portion consists of the page table entries that correspond to the most recently accessed pages. A small cache, usually called the Translation Look aside Buffer (TLB) is incorporated into the MMU for this purpose. The operation of the TLB with respect to the page table in the main memory is essentially the same as the operation we have discussed in conjunction with the cache memory. In addition to the information that constitutes a page table entry, the TLB must also include the virtual address of the entry. Figure 5.28 shows a possible organization of a TLB where the associative- mapping technique is used. Set-associative mapped TLBs are also found in commercial products.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

An essential requirement is that the contents of the TLB be coherent with the contents of page tables in the memory. When the operating system changes the contents of page tables, it must simultaneously invalidate the corresponding entries in the TLB. One of the control bits in the TLB is provided for this purpose. When an entry is invalidated, the TLB will acquire the new information as part of the MMUs normal response to access misses. Address translation proceeds as follows. Given a virtual address, the MMU looks in the TLB for the referenced page. If the page table entry for this page is found in the TLB, the physical address is obtained immediately. If there is a miss in the TLB, then the required entry is obtained from the page table in the main memory and the TLB is updated. When a program generates an access request to a page that is not in the main memory, a page fault is said to have occurred. The whole page must be brought from the disk into the memory before access can proceed. When it detects a page fault, the MMU asks the operating system to intervene by raising an exception (interrupt). Processing of the active task is interrupted, and control is transferred to the operating system. The operating Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

system then copies the requested page from the disk into the main memory and returns control to the interrupted task. Because a long delay occurs while the page transfer takes place, the operating system may suspend execution of the task that caused the page fault and begin execution of another task whose pages are in the main memory. It is essential to ensure that the interrupted task can continue correctly when it resumes execution. A page fault occurs when some instruction accesses a memory operand that is not in the main memory, resulting in an interruption before the execution of this instruction is completed. Hence, when the task resumes, either the execution of the interrupted instruction must continue from the point of interruption, or the instruction must be restarted. The design of a particular processor dictates which of these options should be used. If a new page is brought from the disk when the main memory is full, it must replace one of the resident pages. The problem of choosing which page to remove is just as critical here as it is in a cache, and the idea that programs spend most of their time in a few localized areas also applies. Because main memories are considerably larger than cache memories, it should be possible to keep relatively larger portions of a program in the main memory. This will reduce the frequency of transfers to and from the disk. Concepts similar to the LRU replacement algorithm can be applied to page replacement, and the control bits in the page table entries can indicate usage. One simple scheme is based on a control bit that is set to 1 whenever the corresponding page is referenced (accessed). The operating system occasionally clears this bit in all page table entries, thus providing a simple way of determining which pages have not been used recently. A modified page has to be written back to the disk before it is removed from the main memory. It is important to note that the write-through protocol, which is useful in the framework of cache memories, is not suitable for virtual memory. The access time of the disk is so long that it does not make sense to access it frequently to write small amounts of data. The address translation process in the MMU requires some time to perform, mostly dependent on the time needed to look up entries in the TLB. Because of locality of reference, it is likely that many successive translations involve addresses on the same page. This is particularly evident in fetching instructions. Thus, we can reduce the average translation time by including one or more special registers that retain the virtual page number and the physical page frame of the most recently performed translations. The information in these registers can be accessed more quickly than the TLB. 4.7 CACHE MEMORIES The speed of the main memory is very low in comparison with the speed of modern processors. For good performance, the processor cannot spend much of its time waiting to access instructions and data in main memory. Hence, it is important to devise a scheme that reduces the time needed to access the necessary information. Since the speed of the main memory unit is limited by electronic and packaging constraints, the solution must be sought in a different architectural arrangement. An efficient solution is to use a fast cache memory which essentially makes the main memory appear to the processor to be faster than it really is. The effectiveness of the cache mechanism is based on a property of computer programs called locality of reference. Analysis of programs shows that most of their execution time is spent on routines in which many instructions are executed repeatedly. These instructions may constitute a simple loop, nested loops, or a few procedures that Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

repeatedly call each other. The actual detailed pattern of instruction sequencing is not important the point is that many instructions in localized areas of the program are executed repeatedly during some time period, and the remainder of the program is accessed relatively infrequently. This is referred to as locality of reference. it manifests itself in two ways: temporal and spatial. The first means that a recently executed instruction is likely to be executed again very soon. The spatial aspect means that instructions in close proximity to a recently executed instruction (with respect to the instructions addresses) are also likely to be executed soon. If the active segments of a program can be placed in a fast cache memory, then the total execution time can be reduced significantly. Conceptually, operation of a cache memory is very simple. The memory control circuitry is designed to take advantage of the property of locality of reference. The temporal aspect of the locality of reference suggests that whenever an information item (instruction or data) is first needed, this item should be brought into the cache where it will hopefully remain until it is needed again. The spatial aspect suggests that instead of fetching just one item from the main memory to the cache, it is useful to fetch several items that reside at adjacent addresses as well. We will use the term block to refer to a set of contiguous address locations of some size. Another term that is often used to refer to a cache block is cache line. Consider the simple arrangement in Figure 5.14. When a Read request is received from the processor, the contents of a block of memory words containing the location e very slow specified are transferred into the cache one word at a time. Subsequently, when the program references any of the locations in this block, the desired contents are read directly from the cache. Usually, the cache memory can store a reasonable number of blocks at any given time, but this number is small compared to the total number of blocks in the main memory. The correspondence between the main memory blocks and those in the cache is specified by a mapping function. When the cache is full and a memory word (instruction or data) that is not in the cache is referenced, the cache control hardware must decide which block should be removed to create space for the new block that contains the referenced word. The collection of rules for making this decision constitutes the replacement algorithm.

The processor does not need to know explicitly about the existence of the cache. It simply issues Read and Write requests using addresses that refer to locations in the memory. The cache control circuitry determines whether the requested word currently exists in the cache. If it does, the Read or Write operation is performed on the appropriate cache location. In this case, a read or write hit is said to have occurred. In a Read operation, the main memory is not involved. For a Write operation, the system can proceed in two ways. In the first technique, called the write-through protocol, the cache location and the Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

main memory location are updated simultaneously. The second technique is to update only the cache location and to mark it as updated with an associated flag bit, often called the dirty or modified bit. The main memory location of the word is updated later, when the block containing this marked word is to be removed from the cache to make room for a new block. This technique is known as the write- back, or copy-back, protocol. The write-through protocol is simpler, but it results in unnecessary Write operations in the main memory when a given cache word is updated several times during its cache residency. Note that the write-back protocol may also result in unnecessary Write operations because when a cache block is written back to the memory all words of the block are written back, even if only a single word has been changed while the block was in the cache. When the addressed word in a Read operation is not in the cache, a read miss occurs. The block of words that contains the requested word is copied from the main memory into the cache. After the entire block is loaded into the cache, the particular word requested is forwarded to the processor. Alternatively, this word may be sent to the processor as soon as it is read from the main memory. The latter approach, which is called load-through, or early restart, reduces the processors waiting period somewhat, but at the expense of more complex circuitry. During a Write operation, if the addressed word is not in the cache, a write miss occurs. Then, if the write-through protocol is used, the information is written directly into the main memory. In the case of the write-back protocol, the block containing the addressed word is first brought into the cache, and then the desired word in the cache is overwritten with the new information. 4.8 MAPPING FUNCTIONS To discuss possible methods for specifying where memory blocks are placed in the cache, we use a specific small example. Consider a cache consisting of 128 blocks of 16 words each, for a total of 2048 (2K) words, and assume that the main memory is addressable by a 16-bit address. The main memory has 64K words, which we will view as 4K blocks of 16 words each. For simplicity, we will assume that consecutive addresses refer to consecutive words. 4.8.1 Direct Mapping The simplest way to determine cache locations in which to store memory blocks is the direct-mapping technique. In this technique, block j of the main memory maps onto block j modulo 128 of the cache, as depicted in Figure 5.15. Thus, whenever one of the main memory blocks 0, 128, 256, ... is loaded in the cache, it is stored in cache block 0. Blocks 1, 129, 257, ... are stored in cache block 1, and so on. Since more than one memory block is mapped onto a given cache block position, contention may arise for that position even when the cache is not full. For example, instructions of a program may start in block 1 and continue in block 129, possibly after a branch. As this program is executed, both of these blocks must be transferred to the block-1 position in the cache. Contention is resolved by allowing the new block to overwrite the currently resident block. In this case, the replacement algorithm is trivial. Placement of a block in the cache is determined from the memory address. The memory address can be divided into three fields, as shown in Figure 5.15. The low-order 4 bits select one of 16 words in a block. When a new block enters the cache, the 7-bit cache block field determines the cache position in which this block must be stored. The highComputer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

order 5 bits of the memory address of the block are stored in 5 tag bits associated with its location in the cache. They identify which of the 32 blocks that are mapped into this cache position are currently resident in the cache. As execution proceeds, the 7-bit cache block field of each address generated by the processor points to a particular block location in the cache. The high-order 5 bits of the address are compared with the tag bits associated with that cache location. If they match, then the desired word is in that block of the cache. If there is no match, then the block containing the required word must first be read from the main memory and loaded into the cache. The direct-mapping technique is easy to implement, but it is not very flexible.

4.8.2 Associative Mapping Figure 5.16 shows a much more flexible mapping method, in which a main memory block can be placed into any cache block position. In this case, 12 tag bits are required to Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

identify a memory block when it is resident in the cache. The tag bits of an address received from the processor are compared to the tag bits of each block of the cache to see if the desired block is present. This is called the associative-mapping technique. It gives complete freedom in choosing the cache location in which to place the memory block. Thus, the space in the cache can be used more efficiently. A new block that has to be brought into the cache has to replace (eject) an existing block only if the cache is full. In this case, we need an algorithm to select the block to be replaced. Many replacement algorithms are possible. The cost of an associative cache is higher than the cost of a direct-mapped cache because of the need to search all 128 tag patterns to determine whether a given block is in the cache. A search of this kind is called an associative search. For performance reasons, the tags must be searched in parallel.

4.8.3 Set-Associative Mapping A combination of the direct- and associative-mapping techniques can be used. Blocks of the cache are grouped into sets, and the mapping allows a block of the main memory to reside in any block of a specific set. Hence, the contention problem of the direct method is eased by having a few choices for block placement. At the same time, the hardware cost is reduced by decreasing the size of the associative search. An example of this setassociative-mapping technique is shown in Figure 5.17 for a cache with two blocks per Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

set. In this case, memory blocks 0, 64, 128 4032 map into cache set 0, and they can occupy either of the two block positions within this set. Having 64 sets means that the 6bit set field of the address determines which set of the cache might contain the desired block. The tag field of the address must then be associatively compared to the tags of the two blocks of the set to check if the desired block is present. This two-way associative search is simple to implement.

The number of blocks per set is a parameter that can be selected to suit the requirements of a particular computer. For the main memory and cache sizes in Figure 5.17, four blocks per set can be accommodated by a 5-bit set field, eight blocks per set by a 4-bit set field, and so on. The extreme condition of 128 blocks per set requires no set bits and corresponds to the fully associative technique, with 12 tag bits. The other extreme of one block per set is the direct-mapping method. A cache that has k blocks per set is referred to as a k-way set-associative cache.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

One more control bit, called the valid bit, must be provided for each block. This bit indicates whether the block contains valid data. It should not be confused with the modified, or dirty, bit mentioned earlier. The dirty bit, which indicates whether the block has been modified during its cache residency, is needed only in systems that do not use the write-through method. The valid bits are all set to 0 when power is initially applied to the system or when the main memory is loaded with new programs and data from the disk. Transfers from the disk to the main memory are carried out by a DMA mechanism. Normally, they bypass the cache for both cost and performance reasons. The valid bit of a particular cache block is set to 1 the first time this block is loaded from the main memory. Whenever a main memory block is updated by a source that bypasses the cache, a check is made to determine whether the block being loaded is currently in the cache. If it is, its valid bit is cleared to 0. This ensures that stale data will not exist in the cache. A similar difficulty arises when a DMA transfer is made from the main memory to the disk, and the cache uses the write-back protocol. In this case, the data in the memory might not reflect the changes that may have been made in the cached copy. One solution to this problem is to flush the cache by forcing the dirty data to be written back to the memory before the DMA transfer takes place. The operating system can do this easily, and it does not affect performance greatly, because such disk transfers do not occur often. This need to ensure that two different entities (the processor and DMA subsystems in this case) use the same copies of data is referred to as a cache-coherence problem. 4.9 REPLACEMENT ALGORITHMS In a direct-mapped cache, the position of each block is predetermined; hence, no replacement strategy exists. In associative and set-associative caches there exists some flexibility. When a new block is to be brought into the cache and all the positions that it may occupy are full, the cache controller must decide which of the old blocks to overwrite. This is an important issue because the decision can be a strong determining factor in system performance. In general, the objective is to keep blocks in the cache that are likely to be referenced in the near future. However, it is not easy to determine which blocks are about to be referenced. The property of locality of reference in programs gives a clue to a reasonable strategy. Because programs usually stay in localized areas for reasonable periods of time, there is a high probability that the blocks that have been referenced recently will be referenced again soon. Therefore, when a block is to be overwritten, it is sensible to overwrite the one that has gone the longest time without being referenced. This block is called the least recently used (LRU) block, and the technique is called the LRU replacement algorithm. To use the LRU algorithm, the cache controller must track references to all blocks as computation proceeds. Suppose it is required to track the LRU block of a four-block set in a set-associative cache. A 2-bit counter can be used for each block. When a hit occurs, the counter of the block that is referenced is set to 0. Counters with values originally lower than the referenced one are incremented by one, and all others remain unchanged. When a miss occurs and the set is not full, the counter associated with the new block loaded from the main memory is set to 0, and the values of all other counters are increased by one. When a miss occurs and the set is full, the block with the counter value 3 is removed, the new block is put in its place, and its counter is set to 0. The other three block counters are incremented by one. It can be easily verified that the counter values of occupied blocks are always distinct. Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

The LRU algorithm has been used extensively. Although it performs well for many access patterns, it can lead to poor performance in some cases. For example, it produces disappointing results when accesses are made to sequential elements of an array that is hat slightly too large to fit into the cache (see Section 5.5.3 and Problem 5.12). Performance I is of the LRU algorithm can be improved by introducing a small amount of randomness in deciding which block to replace. Several other replacement algorithms are also used in practice. An intuitively reasonable rule would be to remove the oldest block from a full set when a new block he must be brought in. However, because this algorithm does not take into account the recent pattern of access to blocks in the cache, it is generally not as effective as the LRU algorithm in choosing the best blocks to remove. The simplest algorithm is to randomly choose the block to be overwritten. Interestingly enough, this simple algorithm has been found to be quite effective in practice. 4.10 MEMORY INTERLEAVING If the main memory of a computer is structured as a collection of physically separate modules, each with its own address buffer register (ABR) and data buffer register (DBR), memory access operations may proceed in more than one module at the same d in time. Thus, the aggregate rate of transmission of words to and from the main memory system can be increased. How individual addresses are distributed over the modules is critical in determining the average number of modules that can be kept busy as computations proceed. Two methods of address layout are indicated in Figure 5.25. In the first case, the memory address generated by the processor is decoded as shown in Figure 5.25a. The high- order k bits name one of n modules, and the low-order m bits name a particular word in that module, When consecutive locations are accessed, as happens when a block of irate data is transferred to a cache, only one module is involved. At the same time, however, devices with direct memory access (DMA) ability may be accessing information in other memory modules. The second and more effective way to address the modules is shown in Figure 5.25b. It is called memory interleaving. The low-order k bits of the memory address select a module, and the high-order m bits name a location within that module. In this way, consecutive addresses are located in successive modules. Thus, any component of the system that generates requests for access to consecutive memory locations can keep several modules busy at any one time. This results in both faster accesses to a block of data and higher average utilization of the memory system as a whole. To implement the interleaved structure, there must be 2 modules; otherwise, there will be gaps of nonexistent locations in the memory address space.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

HIT RATE AND MIss PENALTY An excellent indicator of the effectiveness of a particular implementation of the memory hierarchy is the success rate in accessing information at various levels of the hierarchy. Recall that a successful access to data in a cache is called a hit. The number of hits stated as a fraction of all attempted accesses is called the hit rate, and the miss rate is the number of misses stated as a fraction of attempted accesses. Ideally, the entire memory hierarchy would appear to the processor as a single memory unit that has the access time of a cache on the processor chip and the size of a magnetic disk. How close we get to this ideal depends largely on the hit rate at different levels of the hierarchy. High hit rates, well over 0.9, are essential for high-performance computers. Performance is adversely affected by the actions that must be taken after a miss. The extra time needed to bring the desired information into the cache is called the miss penalty. This penalty is ultimately reflected in the time that the processor is stalled because the required instructions or data are not available for execution. In general, the miss penalty is the time needed to bring a block of data from a slower unit in the memory hierarchy to a faster unit. The miss penalty is reduced if efficient mechanisms for transferring data between the various units of the hierarchy are implemented. The Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

previous section shows how an interleaved memory can reduce the miss penalty substantially.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Input Output 5.1 Printers


In computing, a printer is a peripheral which produces a hard copy (permanent humanreadable text and/or graphics) of documents stored in electronic form, usually on physical print media such as paper or transparencies. Many printers are primarily used as local peripherals, and are attached by a printer cable or, in most newer printers, a USB cable to a computer which serves as a document source. Some printers, commonly known as network printers, have built-in network interfaces (typically wireless or Ethernet), and can serve as a hardcopy device for any user on the network. Individual printers are often designed to support both local and network connected users at the same time.

In addition, a few modern printers can directly interface to electronic media such as memory sticks or memory cards, or to image capture devices such as digital cameras, scanners; some printers are combined with a scanners and/or fax machines in a single unit, and can function as photocopiers. Printers that include non-printing features are sometimes called Multifunction Printers (MFP), Multi-Function Devices (MFD), or AllIn-One (AIO) printers. Most MFPs include printing, scanning, and copying among their features. A Virtual printer is a piece of computer software whose user interface and API resemble that of a printer driver, but which is not connected with a physical computer printer. Printers are designed for low-volume, short-turnaround print jobs; requiring virtually no setup time to achieve a hard copy of a given document. However, printers are generally slow devices (30 pages per minute is considered fast; and many inexpensive consumer printers are far slower than that), and the cost per page is actually relatively high. However this is offset by the on-demand convenience and project management costs being more controllable compared to an out-sourced solution. The printing press naturally remains the machine of choice for high-volume, professional publishing. However, as printers have improved in quality and performance, many jobs which used to be done by professional print shops are now done by users on local printers; see desktop publishing. Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

The world's first computer printer was a 19th century mechanically driven apparatus invented by Charles Babbage for his Difference Engine.

Printing technology
Printers are routinely classified by the underlying print technology they employ; numerous such technologies have been developed over the years. The choice of print engine has a substantial effect on what jobs a printer is suitable for, as different technologies are capable of different levels of image/text quality, print speed, low cost, noise; in addition, some technologies are inappropriate for certain types of physical media (such as carbon paper or transparencies). Another aspect of printer technology that is often forgotten is resistance to alteration: liquid ink such as from an inkjet head or fabric ribbon becomes absorbed by the paper fibers, so documents printed with a liquid ink sublimation printer are more difficult to alter than documents printed with toner or solid inks, which do not penetrate below the paper surface. Checks should either be printed with liquid ink or on special "check paper with toner anchorage".[2] For similar reasons carbon film ribbons for IBM Selectric typewriters bore labels warning against using them to type negotiable instruments such as checks. The machine-readable lower portion of a check, however, must be printed using MICR toner or ink. Banks and other clearing houses employ automation equipment that relies on the magnetic flux from these specially printed characters to function properly. Types of Printers: 1. Dot Matrix Printer 2. Inkjet Printer 3. Laser Printer

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Dot matrix Printer

Ink jet Printer Modern print technology

Laser Printer

The following printing technologies are routinely found in modern printers: Toner-based printers Laser printer Toner-based printers work using the Xerographic principle that is used in most photocopiers: by adhering toner to a light-sensitive print drum, then using static electricity to transfer the toner to the printing medium to which it is fused with heat and pressure. The most common type of toner-based printer is the laser printer, which uses precision lasers to cause toner adherence. Laser printers are known for high quality prints, good print speed, and a low (Black and White) cost-per-copy. They are the most common printer for many general-purpose office applications, but are much less common as consumer printers due to their high initial cost - although this cost is dropping. Laser printers are available in both color and monochrome varieties. Another toner based printer is the LED printer which uses an array of LEDs instead of a laser to cause toner adhesion to the print drum. Recent research has also indicated that Laser printers emit potentially dangerous ultrafine particles, possibly causing health problems associated with respiration [1] and cause Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

pollution equivalent to cigarettes.[3] The degree of particle emissions varies with age, model and design of each printer but is generally proportional to the amount of toner required. Furthermore, a well ventilated workspace would allow such ultrafine particles to disperse thus reducing the health side effects.

Liquid inkjet printers Inkjet printer Inkjet printers operate by propelling variably-sized droplets of liquid or molten material (ink) onto almost any sized page. They are the most common type of computer printer for the general consumer[citation needed] due to their low cost, high quality of output, capability of printing in vivid color, and ease of use. Solid ink printers Solid ink Solid Ink printers, also known as phase-change printers, are a type of thermal transfer printer. They use solid sticks of CMYK colored ink (similar in consistency to candle wax), which are melted and fed into a piezo crystal operated print-head. The printhead sprays the ink on a rotating, oil coated drum. The paper then passes over the print drum, at which time the image is transferred, or transfixed, to the page. Solid ink printers are most commonly used as color office printers, and are excellent at printing on transparencies and other non-porous media. Solid ink printers can produce excellent results. Acquisition and operating costs are similar to laser printers. Drawbacks of the technology include high power consumption and long warm-up times from a cold state. Also, some users complain that the resulting prints are difficult to write on (the wax tends to repel inks from pens), and are difficult to feed through Automatic Document Feeders, but these traits have been significantly reduced in later models. In addition, this type of printer is only available from one manufacturer, Xerox, manufactured as part of their Xerox Phaser office printer line. Previously, solid ink printers were manufactured by Tektronix, but Tek sold the printing business to Xerox in 2001. Dye-sublimation printers Dye-sublimation printer

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

A dye-sublimation printer (or dye-sub printer) is a printer which employs a printing process that uses heat to transfer dye to a medium such as a plastic card, paper or canvas. The process is usually to lay one color at a time using a ribbon that has color panels. Dyesub printers are intended primarily for high-quality color applications, including color photography; and are less well-suited for text. While once the province of high-end print shops, dye-sublimation printers are now increasingly used as dedicated consumer photo printers.

Inkless printers Thermal printers Thermal printer Thermal printers work by selectively heating regions of special heat-sensitive paper. Monochrome thermal printers are used in cash registers, ATMs, gasoline dispensers and some older inexpensive fax machines. Colors can be achieved with special papers and different temperatures and heating rates for different colors. One example is the ZINK technology. UV printers Xerox is working on an inkless printer which will use a special reusable paper coated with a few micrometres of UV light sensitive chemicals. The printer will use a special UV light bar which will be able to write and erase the paper. As of early 2007 this technology is still in development and the text on the printed pages can only last between 16-24 hours before fading. Printing speed The speed of early printers was measured in units of characters per second. More modern printers are measured in pages per minute. These measures are used primarily as a marketing tool, and are not well standardised. Usually pages per minute refers to sparse monochrome office documents, rather than dense pictures which usually print much more slowly. PPM are most of the time referring to A4 paper in Europe and letter paper in the US, resulting in a 5-10% difference.

5.2 Plotters
Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

A plotter is a vector graphics printing device to print graphical plots that connects to a computer. A graphics printer draws images with ink pens. It actually draws point-to-point lines directly from vector graphics files. The plotter was the first computer output device that could print graphics as well as accommodate full-size engineering and architectural drawings. Using different colored pens, it was also able to print in color long before inkjet printers became an alternative.There are different types of plotters. Drum Plotters Electrostatic plotters Flat Bed Plotters Inkjet Plotters Pen Plotters Drum Plotters A type of pen plotter that wraps the paper around a drum with a pin feed attachment. The drum turns to produce one direction of the plot, and the pens move to provide the other. The plotter was the first output device to print graphics and large engineering drawings. Using different colored pens, it could draw in color long before color inkjet printers became viable.

Electrostatic Plotters This plotter uses an electrostatic method of printing. Liquid toner models use a positively charged toner that is attracted to paper which is negatively charged by passing by a line of electrodes (tiny wires or nibs). Models print in black and white or color, and some handle paper up to six feet wide. Newer electrostatic plotters are really large-format laser printers and focus light onto a charged drum using lasers or LEDs.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Flatbed Plotters This is a graphics plotter that contains a flat surface that the paper is placed on. The size of this surface (bed) determines the maximum size of the drawing.

Inkjet Plotters This is a printer that propels droplets of ink directly onto the medium. Today, almost all inkjet printers produce color. Low-end inkjets use three ink colors (cyan, magenta and yellow), but produce a composite black that is often muddy. Four-color inkjets (CMYK) use black ink for pure black printing. Inkjet printers run the gamut from less than a hundred to a couple hundred dollars for home use to tens of thousands of dollars for commercial poster printers

Pen Plotters Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

Pen plotters print by moving a pen across the surface of a piece of paper. This means that plotters are restricted to line art, rather than raster graphics as with other printers. Pen plotters can draw complex line art, including text, but do so very slowly because of the mechanical movement of the pens. Pen Plotters are incapable of creating a solid region of color; but can hatch an area by drawing a number of close, regular lines. When computer memory was very expensive, and processor power was very slow, this was often the fastest way to produce color high-resolution vector-based artwork, or very large drawings efficiently.

5.3 Displays
A display device is an output device for presentation of information for visual or tactile reception, acquired, stored, or transmitted in various forms. When the input information is supplied as an electrical signal, the display is called electronic display.A display device is anything that will put images on a screen to see what input and actions a user would ultimately need visual confirmation. The most common display is the default monitor. By its term means that by default any monitor should work if installed on a CPU prior to turning the power on. The screen has dials that can make the display seem blank and sometimes adjustments must be made to the display itself. CRT MONITOR [cathode-ray tube]

Features 3. High Voltage Device 4. Two connections present in a CRT. i) To the AC power outlet ii) To the System Unit (DB-15, DVI) Disadvantages of CRT Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

They have a big back and take up space on desk. The electromagnetic fields emitted by CRT monitors constitute a health hazard to the functioning of living cells. CRTs emit a small amount of X-ray band radiation which can result in a health hazard. Constant refreshing of CRT monitors can result in headache. CRTs operate at very high voltage which can overheat system or result in an implosion Within a CRT a strong vacuum exists in it and can also result in a implosion They are heavy to pick up and carry around Advantages of CRT The cathode rayed tube can easily increase the monitors brightness by reflecting the light. They produce more colours The Cathode Ray Tube monitors have lower price rate than the LCD display or Plasma display. The quality of the image displayed on a Cathode Ray Tube is superior to the LCD and Plasma monitors. The contrast features of the cathode ray tube monitor are considered highly excellent. How CRTs work & display? A CRT monitor contains millions of tiny red, green, and blue phosphor dots that glow when struck by an electron beam that travels across the screen to create a visible image. In a CRT monitor tube, the cathode is a heated filament. The heated filament is in a vacuum created inside a glass tube. The electrons are negative and the screen gives a positive charge so the screen glows.

Computer Science & Engineering Dept.

SJCET, Palai

R 402 LCD Monitor Flat panel display

Computer Organization

Features 1) HVD in Desktops and LVD in Laptops 2) Two connections present in a HVD LCD Monitor i) To the AC power outlet ii) To the System Unit 3) Highly energy efficient. Flat panel displays encompass a growing number of technologies enabling video displays that are lighter and much thinner than traditional television and video displays that use cathode ray tubes, and are usually less than 4 inches (100 mm) thick. They can be divided into two general categories: Volatile or Static. Flat panel displays balance their smaller footprint and trendy modern look with high production costs and in many cases inferior images compared with traditional CRTs. In many applications, specifically modern portable devices such as laptops, cellular phones, and digital cameras, whatever disadvantages exist are overcome by the portability requirements.

Volatile Volatile displays require pixels be periodically refreshed to retain their state, even for a static image. This refresh typically occurs many times a second. If this is not done, the pixels will gradually lose their coherent state, and the image will "fade" from the screen.

Computer Science & Engineering Dept.

SJCET, Palai

R 402 Examples of volatile flat panel displays

Computer Organization

Plasma displays Liquid crystal displays (LCDs) Organic light-emitting diode displays (OLEDs) Light-emitting diode display (LED) Electroluminescent displays (ELDs) Surface-conduction electron-emitter displays (SEDs) Field emission displays (FEDs) Nano-emissive display (NEDs) Static Static flat panel displays rely on materials whose color states are bistable. This means that the image they hold requires no energy to maintain, but instead requires energy to change. This results in a much more energy-efficient display, but with a tendency towards slow refresh rates which are undesirable in an interactive display. Examples of static flat panel displays electrophoretic displays (e.g. E Ink's electrophoretic imaging film) bichromal ball displays (e.g. Xerox's Gyricon) Interferometric modulator displays (e.g. Qualcomm's iMod, a MEMS display.) Cholesteric displays (e.g. MagInk, Kent Displays) Bistable nematic liquid crystal displays (e.g. ZBD)

5.4 Keyboard
Keyboards are designed for the input of text and characters and also to control the operation of a computer. Types of Keyboards Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

1. Based on Layout -QWERTY layout -DVORAK layout 2. Ergonomic Based on comfort Standard Keyboards The number of keys on a keyboard varies from the original standard of 101 keys to the 104-key windows keyboards.

Qwerty Layout

Dvorak Keyboard

Ergonomic Keyboard

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Differences between Dvorak and Qwerty Typing a 62 word paragraph, Dvorak used between 35% to 20% less movement, and saved almost 6 feet of finger movement, out of the 16 feet of finger movement needed to type these short paragraphs with Qwerty. This is the 'minimum' of difference. In actual practice, nearly all would show more savings, with a range of up to about 50%

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

LAYOUTOFA KEYBOARD Layout of a keyboard can be divided into five sections:i) Typical Keys These keys include letter keys [1,2,A,B etc], which are generally laid out in same style that was common for typewriters. ii) Numeric Keypad:Numeric keys are located on right hand side of keyboard. Generally it consists of a set of 17 keys that are laid out in same configuration used by adding machines and calculators. iii) Function keys:The function keys [F1,F2,F3 etc] are arranged in a row along the top of the keyboard and could be assigned specific commands by current application or the operating system. iv) Control keys:These keys provide curser and screen control. It includes 4 directional arrow keys, that are arranged in an inverted T formation between the typing keys and in numeric keypad. Control keys also include Home, End, Insert, Delete, Page up, Control [ctrl], Page down, Alternate [alt] & Escape [Esc]. The Windows keyboard also consists of two windows or start keys and an Application key. v) Special Purpose Keys:Apart from the above discussed keys, a keyboard contains some special purpose keys such as Enter, Shift, Caps lock, Num lock, Space bar, Tab and Print Screen. WORKING When the user presses the keys, a code corresponding to that key press is send to the operating system. A copy of this code is also stored in the keyboards memory. When the operating system reads the scan code, it informs the same to the keyboard and scan code stored in keyboards memory is then erased. And the action corresponding to the code is done. If the user hold down a key, the processor determines that he wish to send that character repeatedly to the computer. In this process, the delay between each instance of a character can normally be set in operating system, typically ranges from 2 to 30 characters per seconds (cps). Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

Changing the Keyboard Layout Start > Control Panel > Regional & Language Options >Language > Details > Add > Enable Keyboard Layout >United Stated - Dvorak

5.5 Mouse
A mouse (plural mice or mouses) functions as a pointing device by detecting twodimensional motion relative to its supporting surface. Can be used only with GUI (Graphical user interface) based OS. E.g. Windows.

Types of mouse based on mechanism 1. Mechanical Mouse 2. Optical Mouse Mechanical Mouse A mouse that uses a rubber ball that makes contact with wheels inside the unit when it is rolled on a pad or desktop.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

Optical Mouse A mouse that uses light to detect movement. It emits a light and senses its reflection as it is moved. Early optical mice required a special mouse pad, but today's devices can be rolled over traditional pads like a mechanical mouse as well as over almost any surface other than glass or mirror.

5.6 Optical Mark Reader


The Optical Mark Reader is a device that "reads" pencil marks on NCS compatible scan forms such as surveys or test answer forms. Optical Mark Reader is a scanning device that can read marks such as pencil marks on a page; used to read forms and multiple-choice questionnaires. Think of it as the machine that checks multiple choice computer forms. In this document The Optical Mark Reader will be referred to as the scanner or OMR. The computer test forms designed for the OMR are known as NCS compatible scan forms. Tests and surveys completed on these forms are read in by the scanner, checked, and the results are saved to a file. This data file can be converted into an output file of several different formats, depending on which type of output you desire. The OMR is a powerful tool that has many features. While using casstat (grading tests), the OMR will print the number of correct answers and the percentage of correct answers at the bottom of each test. It will also record statistical data about each question. This data is recorded in the output file created when the forms are scanned.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

5.7 Optical character recognition


It is a device used for optical character recognition. Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. OCR is a field of research in pattern recognition, artificial intelligence and machine vision. Though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques. Optical character recognition (using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Because very few applications survive that use true optical techniques, the OCR term has now been broadened to include digital image processing as well. Early systems required training (the provision of known samples of each character) to read a specific font. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.

5.8 Device interface


An interface device (IDF) is a hardware component or system of components that allows a human being to interact with a computer, a telephone system, or other electronic information system. The term is often encountered in the mobile communication industry where designers are challenged to build the proper combination of portability, capability, and ease of use into the interface device. The overall set of characteristics provided by an interface device is often referred to as the user interface (and, for computers - at least, in more academic discussions - the human-computer interface or HCI ). Today's desktop and notebook computers have what has come to be called a graphical user interface (GUI) to distinguish it from earlier, more limited interfaces such as the command line interface (CLI).

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

The Graphics Device Interface (GDI) is a Microsoft Windows application programming interface and core operating system component responsible for representing graphical objects and transmitting them to output devices such as monitors and printers. GDI is responsible for tasks such as drawing lines and curves, rendering fonts and handling palettes. It is not directly responsible for drawing windows, menus, etc.; that task is reserved for the user subsystem, which resides in user32.dll and is built atop GDI. GDI is similar to Macintosh's QuickDraw. Perhaps the most significant capability of GDI over more direct methods of accessing the hardware is its scaling capabilities, and abstraction of target devices. Using GDI, it is very easy to draw on multiple devices, such as a screen and a printer, and expect proper reproduction in each case. This capability is at the center of all What You See Is What You Get applications for Microsoft Windows. A human interface device or HID is a type of computer device that interacts directly with, and most often takes input from, humans and may deliver output to humans. The term "HID" most commonly refers to the USB-HID specification.

5.9 I/O processor


I/O processor (IOP) A specialized computer that permits autonomous handling of data between I/O devices and a central computer or the central memory of the computer. It can be a programmable computer in its own right; in earlier forms, as a wired-program computer, it was called a channel controller. See also direct memory access. Many storage, networking, and embedded applications require fast I/O throughput for optimal performance. Intel I/O processors allow servers, workstations and storage subsystems to transfer data faster, reduce communication bottlenecks, and improve overall system performance by offloading I/O processing functions from the host CPU.

5.10 Standard I/O Interfaces


One interface circuitry for one computer does not Support the other. Separate interface has to be designed Results in number of Interfaces. A number of standards have been developed for expansion of bus Eg: SCSI,PCI and USB Small Computer Systems Interface (SCSI) Parallel interface Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

8, 16, 32 bit data lines Daisy chained Devices are independent Devices can communicate with each other as well as host SCSI - 1 Early 1980s 8 bit 5MHz Data rate 5MBytes.s-1 Seven devices o Eight including host interface

SCSI - 2 1991 16 and 32 bit 10MHz Data rate 20 or 40 Mbytes.s-1 o Check out Ultra/Wide SCSI SCSI -3 1993 16 bits 40 MBPS over 68 pin connector. The number of devices is16 PCI BUS PCI Local Bus (usually shortened to PCI), or Conventional PCI, specifies a computer bus for attaching peripheral devices to a computer motherboard. These devices can take either the form of an integrated circuit fitted onto the motherboard itself, called a planar device in the PCI specification or an expansion card that fits into a socket. The name PCI is initialism formed from Peripheral Component Interconnect. The PCI bus is common in modern PCs, where it has displaced ISA and VESA Local Bus as the standard expansion bus, and it also appears in many other computer types. Despite the availability

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

of faster interfaces such as PCI-X and PCI Express, conventional PCI remains a very common interface. The PCI specification covers the physical size of the bus (including wire spacing), electrical characteristics, bus timing, and protocols. The specification can be purchased from the PCI Special Interest Group (PCI-SIG). Typical PCI cards used in PCs include: network cards, sound cards, modems, extra ports such as USB or serial, TV tuner cards and disk controllers. Historically video cards were typically PCI devices, but growing bandwidth requirements soon outgrew the capabilities of PCI. PCI video cards remain available for supporting extra monitors and upgrading PCs that do not have any AGP or PCI express slots. USB Universal Serial Bus (USB) is a serial bus standard to interface devices to a host computer. USB was designed to allow many peripherals to be connected using a single standardized interface socket and to improve the plug-and-play capabilities by allowing hot swapping, that is, by allowing devices to be connected and disconnected without rebooting the computer or turning off the device. Other convenient features include providing power to low-consumption devices without the need for an external power supply and allowing many devices to be used without requiring manufacturer specific, individual device drivers to be installed. USB is intended to replace many legacy varieties of serial and parallel ports. USB can connect computer peripherals such as mice, keyboards, PDAs, gamepads and joysticks, scanners, digital cameras, printers, personal media players, and flash drives. For many of those devices USB has become the standard connection method. USB was originally designed for personal computers, but it has become commonplace on other devices such as PDAs and video game consoles, and as a bridging power cord between a device and an AC adapter plugged into a wall plug for charging purposes. As of 2008, there are about 2 billion USB devices in the world.

RS 232 C
RS-232C is a long-established standard ("C" is the current version) that describes the physical interface and protocol for relatively low-speed serial data communication between computers and related devices. It was defined by an industry trade group, the Electronic Industries Association (EIA), originally for teletypewriter devices.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

RS-232C is the interface that your computer uses to talk to and exchange data with your modem and other serial devices. Somewhere in your PC, typically on a Universal Asynchronous Receiver/Transmitter (UART) chip on your motherboard, the data from your computer is transmitted to an internal or external modem (or other serial device) from its Data Terminal Equipment (DTE) interface. Since data in your computer flows along parallel circuits and serial devices can handle only one bit at a time, the UART chip converts the groups of bits in parallel to a serial stream of bits. As your PC's DTE agent, it also communicates with the modem or other serial device, which, in accordance with the RS-232C standard, has a complementary interface called the Data Communications Equipment (DCE) interface. In telecommunications, RS-232 (Recommended Standard 232) is a standard for serial binary data signals connecting between a DTE (Data Terminal Equipment) and a DCE (Data Circuit-terminating Equipment). It is commonly used in computer serial ports. A similar ITU-T standard is V.24.

Short for recommended standard-232C, a standard interface approved by the Electronic Industries Alliance (EIA) for connecting serial devices. In 1987, the EIA released a new version of the standard and changed the name to EIA-232-D. And in 1991, the EIA teamed up with Telecommunications Industry association (TIA) and issued a new version of the standard called EIA/TIA-232-E. Many people, however, still refer to the standard as RS-232C, or just RS-232. Almost all modems conform to the EIA-232 standard and most personal computers have an EIA232 port for connecting a modem or other device. In addition to modems, many display screens, mice, and serial printers are designed to connect to a EIA-232 port. In EIA-232 parlance, the device that connects to the interface is called a Data Communications Equipment (DCE) and the device to Computer Science & Engineering Dept. SJCET, Palai

R 402

Computer Organization

which it connects (e.g., the computer) is called a Data Terminal Equipment (DTE). The EIA-232 standard supports two types of connectors -- a 25-pin D-type connector (DB-25) and a 9-pin D-type connector (DB-9). The type of serial communications used by PCs requires only 9 pins so either type of connector will work equally well. Although EIA-232 is still the most common standard for serial communication, the EIA has recently defined successors to EIA-232 called RS-422 and RS-423. The new standards are backward compatible so that RS-232 devices can connect to an RS-422 port.

Role in modern personal computers

PCI Express x1 card with one RS-232 port Serial port In the book PC 97 Hardware Design Guide,[3] Microsoft deprecated support for the RS232 compatible serial port of the original IBM PC design. Today, RS-232 is gradually being replaced in personal computers by USB for local communications. Compared with RS-232, USB is faster, uses lower voltages, and has connectors that are simpler to connect and use. Both standards have software support in popular operating systems. USB is designed to make it easy for device drivers to communicate with hardware. However, there is no direct analog to the terminal programs used to let users communicate directly with serial ports. USB is more complex than the RS-232 standard because it includes a protocol for transferring data to devices. This requires more software to support the protocol used. RS-232 only standardizes the voltage of signals and the functions of the physical interface pins. Serial ports of personal computers are also often used to directly control various hardware devices, such as relays or lamps, since the control lines of the interface could be easily manipulated by software. This isn't feasible with USB, which requires some form of receiver to decode the serial data.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

As an alternative, USB docking ports are available which can provide connectors for a keyboard, mouse, one or more serial ports, and one or more parallel ports. Corresponding device drivers are required for each USB-connected device to allow programs to access these USB-connected devices as if they were the original directly-connected peripherals. Devices that convert USB to RS-232 may not work with all software on all personal computers and may cause a reduction in bandwidth along with higher latency. Personal computers may use the control pins of a serial port to interface to devices such as uninterruptible power supplies. In this case, serial data is not sent, but the control lines are used to signal conditions such as loss of power or low battery alarms. Certain industries, in particular marine survey, provide a continued demand for RS-232 I/O due to sustained use of very expensive but aging equipment. It is far cheaper to continue to use RS-232 than it is to replace the equipment. Some manufacturers have responded to this demand: Toshiba re-introduced the DB9 Male on the Tecra laptop. Companies such as Digi specialise in RS232 I/O cards. Standard details In RS-232, user data is sent as a time-series of bits. Both synchronous and asynchronous transmissions are supported by the standard. In addition to the data circuits, the standard defines a number of control circuits used to manage the connection between the DTE and DCE. Each data or control circuit only operates in one direction, that is, signaling from a DTE to the attached DCE or the reverse. Since transmit data and receive data are separate circuits, the interface can operate in a full duplex manner, supporting concurrent data flow in both directions. The standard does not define character framing within the data stream, or character encoding.

IEEE 488.2 (GPIB)


IEEE-488 is a short-range, digital communications bus specification that has been in use for over 30 years. Originally created for use with automated test equipment, the standard is still in wide use for that purpose. IEEE-488 is also commonly known as HP-IB (Hewlett-Packard Interface Bus) and GPIB (General Purpose Interface Bus). IEEE-488 allows up to 15 devices to share a single eight-bit parallel electrical bus by daisy chaining connections. The slowest device participates in control and data transfer handshakes to determine the speed of the transaction. The maximum data rate is about one Mbyte/s in the original standard, and about 8 Mbyte/s with later extensions.

Computer Science & Engineering Dept.

SJCET, Palai

R 402

Computer Organization

The IEEE-488 connector has 24 pins. The bus employs 16 signal lines eight bidirectional used for data transfer, three for handshake, and five for bus management plus eight ground return lines. In 1975 the bus was standardized by the Institute of Electrical and Electronics Engineers as the IEEE Standard Digital Interface for Programmable Instrumentation, IEEE488-1975 (now 488.1). IEEE-488.1 formalized the mechanical, electrical, and basic protocol parameters of GPIB, but said nothing about the format of commands or data. The IEEE-488.2 standard, Codes, Formats, Protocols, and Common Commands for IEEE-488.1 (June 1987), provided for basic syntax and format conventions, as well as device-independent commands, data structures, error protocols, and the like. IEEE-488.2 built on -488.1 without superseding it; equipment can conform to -488.1 without following -488.2. While IEEE-488.1 defined the hardware, and IEEE-488.2 defined the syntax, there was still no standard for instrument-specific commands. Applications At the outset, HP-IB's designers did not specifically plan for IEEE-488 to be a standard peripheral interface for general-purpose computers. By 1977 the Commodore PET/CBM range of educational/home/personal computers connected their disk drives, printers, modems, etc, by IEEE-488 bus. All of Commodore's post-PET/CBM 8-bit machines, from the VIC-20 to the C128, utilized a proprietary 'serial IEEE-488' for peripherals, with round DIN connectors instead of the heavy-duty HP-IB plugs or a card-edge connector plugging into the motherboard (for PET computers). Hewlett-Packard and Tektronix also used IEEE-488 as a peripheral interface to connect disk drives, tape drives, printers, plotters etc. to their workstation products and HP's HP 2100[4] and HP 3000[5] minicomputers. While the bus speed was increased to 10 MB/s for such applications, the lack of command protocol standards limited third-party offerings and interoperability, and later, faster, open standards such as SCSI eventually superseded IEEE-488 for peripheral access. Additionally, some of HP's advanced pocket calculators/computers of the 1980s, such as the HP-41 and HP-71B series, could work with various instrumentation via an optional HP-IB interface. The interface would connect to the calculator via an optional HP-IL module.

Computer Science & Engineering Dept.

SJCET, Palai

You might also like