Which Of The False Data Dependency Can Be Addressed By Register Renaming

In computer architecture, annals renaming is a technique that abstracts logical registers from physical registers. Every logical annals has a set of concrete registers associated with it. When a machine language instruction refers to a detail logical annals, the processor transposes this name to i specific physical register on the fly. The concrete registers are opaque and cannot exist referenced directly but only via the canonical names.

This technique is used to eliminate fake data dependencies arising from the reuse of registers by successive instructions that do not have whatever real information dependencies between them. The emptying of these false data dependencies reveals more educational activity-level parallelism in an instruction stream, which can be exploited by various and complementary techniques such as superscalar and out-of-order execution for better performance.

Trouble approach [edit]

In a register machine, programs are equanimous of instructions which operate on values. The instructions must name these values in order to distinguish them from one another. A typical instruction might say: "add together $x$ and $y$ and put the result in $z$ ". In this pedagogy, $x$ , $y$ and $z$ are the names of storage locations.

In guild to have a meaty instruction encoding, nigh processor instruction sets accept a small set of special locations which can be referred to by special names: registers. For example, the x86 instruction set compages has eight integer registers, x86-64 has sixteen, many RISCs have 32, and IA-64 has 128. In smaller processors, the names of these locations correspond straight to elements of a register file.

Different instructions may take different amounts of time; for example, a processor may be able to execute hundreds of instructions while a single load from the chief memory is in progress. Shorter instructions executed while the load is outstanding will finish showtime, thus the instructions are finishing out of the original plan order. Out-of-order execution has been used in most recent high-operation CPUs to attain some of their speed gains.

Consider this piece of code running on an out-of-order CPU:

                                    r1            ≔            m            [            1024            ]                        r1            ≔            r1            +            2                        m            [            1032            ]            ≔            r1                        r1            ≔            m            [            2048            ]                        r1            ≔            r1            +            4                        g            [            2056            ]            ≔            r1

The instructions in the final three lines are contained of the first three instructions, but the processor cannot finish r1≔grand [ 2048 ] until the preceding thousand [ 1032 ]≔r1 is done (doing so otherwise would write an incorrect value).

This brake is eliminated by irresolute the names of some of the registers:

                                    r1            ≔            m            [            1024            ]                        r1            ≔            r1            +            2                        m            [            1032            ]            ≔            r1                        r2            ≔            m            [            2048            ]                        r2            ≔            r2            +            4                        m            [            2056            ]            ≔            r2

Now the concluding three instructions can be executed in parallel with the beginning three. The program volition run faster than before obliterating any stalling due to a false data dependency.

Many high-functioning CPUs implement this renaming in hardware to achieve boosted parallelism. On targets without appropriate data-flow detection good compilers would observe independent teaching sequences and cull unlike registers during lawmaking generation.

Information hazards [edit]

When more than one teaching references a item location equally an operand, either by reading it (as an input) or past writing to it (as an output), executing those instructions in an social club different from the original program order can pb to three kinds of information hazards:

Read-after-write (RAW): a read from a annals or retentivity location must return the value placed at that place by the last write in program order, non some other write. This is referred to as a true dependency or period dependency, and requires the instructions to execute in program guild.
Write-after-write (WAW): successive writes to a particular register or memory location must leave that location containing the result of the 2nd write. This tin can exist resolved by squashing (as well known as cancelling, annulling, or mooting) the outset write if necessary. WAW dependencies are also known as output dependencies.
Write-after-read (WAR): a read from a register or memory location must return the final prior value written to that location, and not one written programmatically after the read. This is a sort of false dependency that tin exist resolved by renaming. WAR dependencies are also known as anti-dependencies.

Instead of delaying the write until all reads are completed, ii copies of the location can be maintained, the old value and the new value. Reads that precede, in programme society, the write of the new value tin exist provided with the former value, even while other reads that follow the write are provided with the new value. The false dependency is broken and additional opportunities for out-of-social club execution are created. When all reads that need the one-time value have been satisfied, it can be discarded. This is the essential concept behind register renaming.

Anything that is read and written can be renamed. While the full general-purpose and floating-point registers are discussed the most, flag and status registers or even private status bits are commonly renamed likewise.

Memory locations tin can too be renamed, although it is not normally done to the extent practiced in register renaming. The Transmeta Crusoe processor's gated store buffer is a form of memory renaming.

If programs refrained from reusing registers immediately, at that place would be no need for annals renaming. Some instruction sets (east.1000., IA-64) specify very large numbers of registers for specifically this reason. Nonetheless, at that place are limitations to this approach:

It is very difficult for the compiler to avoid reusing registers without large code size increases. In loops, for instance, successive iterations would have to employ different registers, which requires replicating the lawmaking in a process called loop unrolling, or utilising self-modifying lawmaking to change the operand targets in each iteration.
Big numbers of registers require more bits for specifying a register as an operand in an instruction, resulting in increased code size.
Many instruction sets historically specified smaller numbers of registers and cannot be inverse while still retaining backwards compatibility.

Code size increases are important because when the program code is larger, the instruction cache misses more than often and the processor stalls waiting for new instructions.

Architectural versus physical registers [edit]

Machine language programs specify reads and writes to a express set of registers specified by the education ready architecture (ISA). For instance, the Alpha ISA specifies 32 integer registers, each 64 bits wide, and 32 floating-point registers, each 64 $.25 wide. These are the architectural registers. Programs written for processors running the Alpha instruction set will specify operations reading and writing those 64 registers. If a developer stops the program in a debugger, they can discover the contents of these 64 registers (and a few condition registers) to determine the progress of the car.

One particular processor which implements this ISA, the Alpha 21264, has 80 integer and 72 floating-point physical registers. There are, on an Alpha 21264 chip, 80 physically separate locations which can store the results of integer operations, and 72 locations which can store the results of floating signal operations (In fact, at that place are fifty-fifty more locations than that, merely those extra locations are non germane to the register renaming operation.)

The post-obit text describes two styles of register renaming, which are distinguished past the excursion which holds the data set up for an execution unit.

In all renaming schemes, the machine converts the architectural registers referenced in the didactics stream into tags. Where the architectural registers might be specified by 3 to five bits, the tags are commonly a 6 to eight flake number. The rename file must have a read port for every input of every education renamed every cycle, and a write port for every output of every educational activity renamed every cycle. Considering the size of a annals file by and large grows as the square of the number of ports, the rename file is usually physically big and consumes significant power.

In the tag-indexed register file style, at that place is one large register file for data values, containing one register for every tag. For example, if the machine has eighty physical registers, then it would utilize 7 flake tags. 48 of the possible tag values in this example are unused.

In this style, when an instruction is issued to an execution unit, the tags of the source registers are sent to the physical register file, where the values respective to those tags are read and sent to the execution unit of measurement.

In the reservation station style, in that location are many pocket-size associative register files, ordinarily one at the inputs to each execution unit. Each operand of each instruction in an issue queue has a place for a value in one of these annals files.

In this style, when an instruction is issued to an execution unit of measurement, the annals file entries corresponding to the outcome queue entry are read and forwarded to the execution unit.

Architectural Register File or Retirement Register File (RRF): The committed annals state of the machine. RAM indexed by logical register number. Typically written into equally results are retired or committed out of a reorder buffer.
Time to come File: The most speculative register land of the machine. RAM indexed past logical register number.
Active Register File: The Intel P6 group'southward term for Time to come File.
History Buffer: Typically used in combination with a future file. Contains the "old" values of registers that accept been overwritten. If the producer is still in flying it may be RAM indexed by history buffer number. After a branch misprediction must use results from the history buffer—either they are copied, or the hereafter file lookup is disabled and the history buffer is content-addressable retention (CAM) indexed by logical register number.
Reorder Buffer (ROB): A structure that is sequentially (circularly) indexed on a per-operation basis, for instructions in flight. It differs from a history buffer because the reorder buffer typically comes later the future file (if it exists) and before the architectural register file.; Reorder buffers can exist data-less or data-ful.; In Willamette'southward ROB, the ROB entries signal to registers in the physical register file (PRF), and besides comprise other book keeping.; This was too the first Out of Order design done by Andy Glew, at Illinois with HaRRM.; P6's ROB, the ROB entries incorporate data; there is no separate PRF.; Data values from the ROB are copied from the ROB to the RRF at retirement.; I pocket-sized detail: if there is temporal locality in ROB entries (i.e., if instructions shut together in the von Neumann instruction sequence write back close together in time, it may be possible to perform write combining on ROB entries and then have fewer ports than a separate ROB/PRF would).; It is non articulate if it makes a difference, since a PRF should be banked.; ROBs usually don't accept associative logic, and certainly none of the ROBs designed past Andy Glew have CAMs.; Keith Diefendorff insisted that ROBs accept circuitous associative logic for many years.; The first ROB proposal may take had CAMs.

Tag-indexed register file [edit]

This is the renaming manner used in the MIPS R10000, the Alpha 21264, and in the FP department of the AMD Athlon.

In the renaming stage, every architectural annals referenced (for read or write) is looked up in an architecturally-indexed remap file. This file returns a tag and a prepare fleck. The tag is not-ready if there is a queued instruction which will write to it that has not notwithstanding executed. For read operands, this tag takes the place of the architectural annals in the education. For every register write, a new tag is pulled from a free tag FIFO, and a new mapping is written into the remap file, then that time to come instructions reading the architectural register volition refer to this new tag. The tag is marked as unready, considering the pedagogy has non yet executed. The previous physical register allocated for that architectural register is saved with the instruction in the reorder buffer, which is a FIFO that holds the instructions in program order between the decode and graduation stages.

The instructions are then placed in various result queues. As instructions are executed, the tags for their results are broadcast, and the issue queues lucifer these tags confronting the tags of their non-fix source operands. A match means that the operand is fix. The remap file too matches these tags, and then that it tin mark the corresponding physical registers every bit gear up. When all the operands of an didactics in an consequence queue are ready, that instruction is prepare to issue. The issue queues selection ready instructions to send to the diverse functional units each bike. Non-ready instructions stay in the issue queues. This unordered removal of instructions from the issue queues can make them big and power-consuming.

Issued instructions read from a tag-indexed concrete annals file (bypassing just-broadcast operands) and and then execute. Execution results are written to tag-indexed physical register file, too as circulate to the bypass network preceding each functional unit. Graduation puts the previous tag for the written architectural annals into the free queue so that information technology can be reused for a newly decoded instruction.

An exception or branch misprediction causes the remap file to back up to the remap land at last valid instruction via combination of state snapshots and cycling through the previous tags in the in-social club pre-graduation queue. Since this mechanism is required, and since it tin can recover any remap state (not but the state earlier the pedagogy currently being graduated), branch mispredictions can exist handled before the branch reaches graduation, potentially hiding the branch misprediction latency.

Reservation stations [edit]

This is the style used in the integer section of the AMD K7 and K8 designs.

In the renaming stage, every architectural register referenced for reads is looked up in both the architecturally-indexed time to come file and the rename file. The time to come file read gives the value of that register, if there is no outstanding instruction nevertheless to write to it (i.eastward., it'south fix). When the pedagogy is placed in an issue queue, the values read from the hereafter file are written into the corresponding entries in the reservation stations. Register writes in the educational activity crusade a new, non-ready tag to be written into the rename file. The tag number is usually serially allocated in education order—no complimentary tag FIFO is necessary.

Just as with the tag-indexed scheme, the effect queues look for not-ready operands to see matching tag broadcasts. Different the tag-indexed scheme, matching tags cause the respective broadcast value to be written into the issue queue entry's reservation station.

Issued instructions read their arguments from the reservation station, bypass just-broadcast operands, and then execute. As mentioned earlier, the reservation station annals files are commonly small-scale, with possibly eight entries.

Execution results are written to the reorder buffer, to the reservation stations (if the issue queue entry has a matching tag), and to the future file if this is the final instruction to target that architectural annals (in which case register is marked gear up).

Graduation copies the value from the reorder buffer into the architectural annals file. The sole employ of the architectural register file is to recover from exceptions and co-operative mispredictions.

Exceptions and branch mispredictions, recognized at graduation, crusade the architectural file to be copied to the time to come file, and all registers marked as fix in the rename file. There is usually no fashion to reconstruct the land of the future file for some instruction intermediate between decode and graduation, so there is usually no way to practise early recovery from branch mispredictions.

Comparison between the schemes [edit]

In both schemes, instructions are inserted in-lodge into the effect queues, just are removed out-of-order. If the queues do not collapse empty slots, and so they will either accept many unused entries, or require some sort of variable priority encoding for when multiple instructions are simultaneously ready to go. Queues that collapse holes accept simpler priority encoding, but crave unproblematic but big circuitry to advance instructions through the queue.

Reservation stations accept better latency from rename to execute, because the rename stage finds the register values straight, rather than finding the concrete register number, then using that to find the value. This latency shows up as a component of the branch misprediction latency.

Reservation stations also have better latency from instruction issue to execution, because each local register file is smaller than the large key file of the tag-indexed scheme. Tag generation and exception processing are also simpler in the reservation station scheme, as discussed below.

The concrete annals files used by reservation stations usually collapse unused entries in parallel with the issue queue they serve, which makes these register files larger in aggregate, and eat more power, and more complicated than the simpler annals files used in a tag-indexed scheme. Worse yet, every entry in each reservation station can exist written past every result omnibus, so that a reservation-station machine with, east.k., 8 issue queue entries per functional unit of measurement will typically have 9 times as many bypass networks as an equivalent tag-indexed machine. Consequently, result forwarding consumes much more than power and expanse than in a tag-indexed design.

Furthermore, the reservation station scheme has iv places (Future File, Reservation Station, Reorder Buffer and Architectural File) where a result value tin be stored, whereas the tag-indexed scheme has merely one (the physical register file). Considering the results from the functional units, broadcast to all these storage locations, must reach a much larger number of locations in the car than in the tag-indexed scheme, this function consumes more power, area, and fourth dimension. Still, in machines equipped with very authentic branch prediction schemes and if execute latencies are a major business concern, reservation stations can work remarkably well.

History [edit]

The IBM Organisation/360 Model 91 was an early on machine that supported out-of-guild execution of instructions; it used the Tomasulo algorithm, which uses register renaming.

The POWER1 is the first microprocessor that used register renaming and out-of-order execution in 1990.

The original R10000 design had neither collapsing effect queues nor variable priority encoding, and suffered starvation bug as a consequence—the oldest didactics in the queue would sometimes not be issued until both instruction decode stopped completely for lack of rename registers, and every other instruction had been issued. Later on revisions of the pattern starting with the R12000 used a partially variable priority encoder to mitigate this problem.

Early out-of-order machines did not separate the renaming and ROB/PRF storage functions. For that affair, some of the earliest, such as Sohi's RUU or the Metaflow DCAF, combined scheduling, renaming, and storage all in the same structure.

Near modern machines do renaming by RAM indexing a map table with the logical annals number. E.m., P6 did this; future files do this, and have data storage in the aforementioned structure.

Notwithstanding, earlier machines used content-addressable memory (CAM) in the renamer. Due east.g., the HPSM RAT, or Register Alias Table, substantially used a CAM on the logical annals number in combination with different versions of the register.

In many ways, the story of out-of-guild microarchitecture has been how these CAMs have been progressively eliminated. Small-scale CAMs are useful; big CAMs are impractical.^{[ citation needed ]}

The P6 microarchitecture was the outset microarchitecture by Intel to implement both out-of-order execution and register renaming. The P6 microarchitecture was used in Pentium Pro, Pentium Two, Pentium Three, Pentium M, Core, and Core 2 microprocessors. The Cyrix M1, released on October 2, 1995,^[1] was the first x86 processor to apply register renaming and out-of-society execution. Other x86 processors (such as NexGen Nx686 and AMD K5) released in 1996 also featured register renaming and out-of-society execution of RISC μ-operations (rather than native x86 instructions).^[two] ^[three]

References [edit]

^ "Cyrix 6x86 Processor".
^ "NexGen Nx686".
^ "PC Mag Dec six, 1994". Ziff Davis. 1994-12-06.

Smith, J. E.; Pleszkun, A. R. (June 1985). "Implementation of precise interrupts in pipelined processors". ACM SIGARCH Computer Architecture News. 13 (3): 36–44. doi:10.1145/327070.327125.
Smith, J. E.; Pleszkun, A. R. (May 1988). "Implementing precise interrupts in pipelined processors". IEEE Trans. Comput. 37 (five): 562–573. doi:ten.1109/12.4607.
Smith, J. Due east.; Pleszkun, A. R. (1998). "Implementation of precise interrupts in pipelined processors". 25 years of the international symposia on Reckoner architecture (selected papers) - ISCA '98. pp. 291–299. doi:10.1145/285930.285988. ISBN1581130589.