This application claims priority from U.S. Provisional Application No. 60/985,953 filed Nov. 6, 2007, which provisional application is incorporated herein by reference in its entirety.BACKGROUND
1. Technical Field
This disclosure relates generally to a virtualized computer system and, in particular, to a method and system for a virtual machine to transition from replay mode to live execution mode.
2. Description of the Related Art
The advantages of virtual machine technology have become widely recognized. Among these advantages is the ability to run multiple virtual machines on a single host platform. This makes better use of the capacity of the hardware, while still ensuring that each user enjoys the features of a “complete” computer. Depending on how it is implemented, virtualization can also provide greater security, since the virtualization can isolate potentially unstable or unsafe software so that it cannot adversely affect the hardware state or system files required for running the physical (as opposed to virtual) hardware.
As is well known in the field of computer science, a virtual machine (VM) is an abstraction—a “virtualization”—of an actual physical computer system.
Each VM 200 will typically have both virtual system hardware 201 and guest system software 202. The virtual system hardware 201 typically includes at least one virtual CPU 210, virtual memory 230, at least one virtual disk 240, and one or more virtual devices 270. Note that a disk—virtual or physical—is also a “device,” but is usually considered separately because of the important role of the disk. All of the virtual hardware components of the VM may be implemented in software using known techniques to emulate the corresponding physical components. The guest system software includes a guest operating system (OS) 220 and drivers 224 as needed for the various virtual devices 270. Although
Referring back to
Yet another configuration is found in a so-called “multi-core” architecture, in which more than one physical CPU is fabricated on a single chip, with its own set of functional units (such as a floating-point unit and an arithmetic/logic unit ALU), and can execute threads independently; multi-core processors typically share only very limited resources, such as some cache. Still another technique that provides for simultaneous execution of multiple threads is referred to as “simultaneous multi-threading,” in which more than one logical CPU (hardware thread) operates simultaneously on a single chip, but in which the logical CPUs flexibly share some resources such as caches, buffers, functional units, etc. This invention may be used regardless of the type—physical and/or logical—or number of processors included in a VM.
If the VM 200 is properly designed, applications 260 running on the VM will function as they would if run on a “real” computer, even though the applications are running at least partially indirectly, that is via the guest OS 220 and virtual processor(s). Executable files will be accessed by the guest OS from the virtual disk 240 or virtual memory 230, which will be portions of the actual physical disk 140 or memory 130 allocated to that VM. Once an application is installed within the VM, the guest OS retrieves files from the virtual disk just as if the files had been pre-stored as the result of a conventional installation of the application. The design and operation of virtual machines are well known in the field of computer science.
Some interface is generally required between the guest software within a VM and the various hardware components and devices in the underlying hardware platform. This interface—which may be referred to generally as “virtualization software” or “virtualization logic”—may include one or more software components and/or layers, possibly including one or more of the software components known in the field of virtual machine technology as “virtual machine monitors” (VMMs), “hypervisors,” or virtualization “kernels.” Because virtualization terminology has evolved over time and has not yet become fully standardized, these terms do not always provide clear distinctions between the software layers and components to which they refer. For example, “hypervisor” is often used to describe both a VMM and a kernel together, either as separate but cooperating components or with one or more VMMs incorporated wholly or partially into the kernel itself; however, “hypervisor” is sometimes used instead to mean some variant of a VMM alone, which interfaces with some other software layer(s) or component(s) to support the virtualization. Moreover, in some systems, some virtualization code is included in at least one “superior” VM to facilitate the operations of other VMs. Furthermore, specific software support for VMs may be included in the host OS itself. Unless otherwise indicated, the invention described below may be used in virtualized computer systems having any type or configuration of virtualization software. Also, as various virtualization functionalities may be implemented either in software or hardware, the invention described below may be used in virtualized computer systems having any type or configuration of virtualization logic. Although the invention is described below in terms of virtualization software, substantially the same description applies with respect to virtualization logic.
The various virtualized hardware components in the VM, such as the virtual CPU(s) 210-0, 210-1, . . . , 210-m, the virtual memory 230, the virtual disk 240, and the virtual device(s) 270, are shown as being part of the VM 200 for the sake of conceptual simplicity. In actuality, these “components” are usually implemented as software emulations 330 included in the VMM. One advantage of such an arrangement is that the VMM may (but need not) be set up to expose “generic” devices, which facilitate VM migration and hardware platform-independence.
Different systems may implement virtualization to different degrees—“virtualization” generally relates to a spectrum of definitions rather than to a bright line, and often reflects a design choice with respect to a trade-off between speed and efficiency on the one hand and isolation and universality on the other hand. For example, “full virtualization” is sometimes used to denote a system in which no software components of any form are included in the guest other than those that would be found in a non-virtualized computer; thus, the guest OS could be an off-the-shelf, commercially available OS with no components included specifically to support use in a virtualized environment.
In contrast, another concept, which has yet to achieve a universally accepted definition, is that of “para-virtualization.” As the name implies, a “para-virtualized” system is not “fully” virtualized, but rather the guest is configured in some way to provide certain features that facilitate virtualization. For example, the guest in some para-virtualized systems is designed to avoid hard-to-virtualize operations and configurations, such as by avoiding certain privileged instructions, certain memory address ranges, etc. As another example, many para-virtualized systems include an interface within the guest that enables explicit calls to other components of the virtualization software.
For some, para-virtualization implies that the guest OS (in particular, its kernel) is specifically designed to support such an interface. According to this view, having, for example, an off-the-shelf version of Microsoft Windows XP™ as the guest OS would not be consistent with the notion of para-virtualization. Others define para-virtualization more broadly to include any guest OS with any code that is specifically intended to provide information directly to any other component of the virtualization software. According to this view, loading a module such as a driver designed to communicate with other virtualization components renders the system para-virtualized, even if the guest OS as such is an off-the-shelf, commercially available OS not specifically designed to support a virtualized computer system. Unless otherwise indicated or apparent, this invention is not restricted to use in systems with any particular “degree” of virtualization and is not to be limited to any particular notion of full or partial (“para-”) virtualization.
In addition to the sometimes fuzzy distinction between full and partial (para-) virtualization, two arrangements of intermediate system-level software layer(s) are in general use—a “hosted” configuration and a non-hosted configuration (which is shown in
As illustrated in
Note that the kernel 600 (also referred to herein as the “VMkernel”) is not the same as the kernel that will be within the guest OS 220—as is well known, every operating system has its own kernel. Note also that the kernel 600 is part of the “host” platform of the VM/VMM as defined above even though the configuration shown in
The kernel 600 is responsible for initiating physical input/output (I/O) on behalf of the VMs 200 and communicating the I/O completion events back to the VMs 200. In fully virtualized systems, I/O completion events often take the form of a virtual interrupt delivered to one of the virtual processors (VCPUs) of the requesting VM.
Virtualized computer systems are often provided with fault tolerance capabilities, so that the virtualized computer system may continue to operate properly in the event of a failure of one of the VMs. One way of providing fault tolerance is to run two virtual machines (a “primary” virtual machine, and a “backup” or “secondary” virtual machine) in near lockstep. In some implementations, the backup VM replays log entries recorded by and received from the primary VM to mimic the operation of the primary VM (i.e., record by primary VM and replay by backup VM). When the primary VM faults, the backup VM stops replaying the log entries and transitions to live execution mode to resume interactive execution with the real world. The act of the backup VM resuming interactive execution with the external world is referred to herein as “going-live.” When the backup VM is being replayed from the recorded log entries, most external inputs including network packets and interrupts are obtained from the log entries received from the primary VM. In contrast, when the backup VM goes live and resumes interactive execution, the backup VM does not depend on the recorded log entries any more and interacts with Input/Output (I/O) devices and thus, the external world.
This disclosure relates to limiting the execution points at which the backup VM can go live and resume interactive execution with the external world.SUMMARY
Embodiments of the present disclosure include a method and system for allowing a backup VM to enter live execution mode at instruction boundaries but not in the middle of emulation of an instruction. This is accomplished by having the last log entry of the multiple log entries generated during emulation of an instruction include an indication of a “go-live” point and by having the backup VM not replay log entries provided by the primary VM beyond the log entry that indicates the “go-live” point.BRIEF DESCRIPTION OF THE DRAWINGS
The teachings of the embodiments of the present disclosure can be readily understood by considering the following detailed description in conjunction with the accompanying drawings.
DETAILED DESCRIPTION OF EMBODIMENTS
The Figures (FIG.) and the following description relate to preferred embodiments of the present invention by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of the claimed invention.
The present invention as described herein may be used to advantage in both a hosted and a non-hosted virtualized computer system, regardless of the degree of virtualization, in which the virtual machine(s) have any number of physical and/or logical virtualized processors. The present invention may also be implemented directly in a computer's primary operating system (OS), both where the OS is designed to support virtual machines and where it is not. Moreover, the invention may even be implemented wholly or partially in hardware, for example in processor architectures intended to provide hardware support for virtual machines. The present invention may be implemented as a computer program product including computer instructions configured to perform the methods of the present invention. The computer program can be stored on a computer readable storage medium to run on one or more processors of the virtualized computer system.
The features and advantages described in the specification are not all inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter.
Reference will now be made in detail to several embodiments of the present invention(s), examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. The figures depict embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
The primary VM 200-1, which includes a VCPU 210-1 and a guest OS 220-1, is supported by a hypervisor 601-1, including a VMM 300-1 and a VMkernel 600-1, on the host system hardware 100-1. The primary VM 200-1 includes or accesses its own separate virtual disk 240-1, on a physical disk 140-1, as explained previously with reference to
One way of keeping the two VMs 200-1, 200-2 generally synchronized for fault tolerance is to record (log) all non-deterministic inputs or events encountered by the primary VM 200-1 in log entries 280 and send the log entries 280 to VMkernel 600-2 for the backup VM 200-2. In some embodiments, the backup VM 200-2 can be run in near lockstep with the primary VM 200-1. The VMkernel 600-1 corresponding to the primary VM 200-1 records such log entries and sends the log entries 280 to the VMkernel 600-2 corresponding to the backup VM 200-2. Non-deterministic inputs/events include, for example, (i) all inputs from the network external to the virtualized computer system, (ii) information regarding when virtual interrupts were delivered to the VCPU 210-1 due to external events, (iii) timer interrupts delivered to the VCPU 210-1, and (iv) timestamps delivered to the VCPU 210-1 when the VCPU 210-1 requires the current time via various hardware functionality. The hypervisor 601-2 then uses the log entries 280 to ensure that the backup VM 200-2 executes exactly the same instruction stream as the primary VM 200-1 (i.e., the backup VM 200-2 replays the log entries 280). More specifically, the backup VM 200-2 executes device emulation that does not require non-deterministic events, but inserts the non-deterministic events from the received log entries and replays such non-deterministic events from the recorded log entries 280. The VMkernel 600-2 sends acknowledgements 282 back to the VMkernel 600-1 indicating which log entries 280 have been received by the VMkernel 600-2 and which log entries 280 have been replayed on the backup VM 200-2.
For record/replay to function properly, the virtual disks 240-1, 240-2 of the primary VM 200-1 and the backup VM 200-2 start in the same state. The primary and backup VMs 200-1, 200-2 both read from and write to their virtual disks 240-1, 240-2, respectively, while executing. Since the backup VM 200-2 executes the same way as the primary VM 200-1 through record/replay, the backup VM 200-2 will perform the same IO (Input/Output) to its virtual disks 240-2 as the primary VM 200-1 does to its virtual disks 240-1, and therefore the virtual disks 240-1, 240-2 will naturally stay in synchronization. The initiation of a disk IO is not logged in the log entries 280, because it is a deterministic result of the VM's behavior. The completion of a disk IO is logged in the log entries 280, since the exact instruction when a completion interrupt is delivered is non-deterministic. In addition, the completion status of each disk IO is also logged in the log entries 280, since the completion status is an additional non-deterministic result of the disk IO.
The primary VM 200-1, which includes the VCPU 210-1 and the guest OS 220-1, is supported by the hypervisor 601-1, including the VMM 300-1 and VMkernel 600-1, on the host system hardware 100-1. The primary VM 200-1 includes or accesses a shared virtual disk 240-1 on a shared physical disk 140-1. The backup VM 200-2, which includes the VCPU 210-2 and the guest OS 220-2, is supported by the hypervisor 601-2, including the VMM 300-2 and the VMkernel 600-2, on host system hardware 100-2. The backup VM 200-2 also includes or accesses the shared virtual disk 240-1 on the shared physical disk 140-1.
In order to keep the two VMs 200-1, 200-2 generally synchronized for fault tolerance, all non-deterministic inputs or events encountered by the primary VM 200-1 may be recorded (logged) in log entries 280 and provided to VMkernel 600-2 for the backup VM 200-2. In some embodiments, the backup VM 200-2 can be run in near lockstep with the primary VM 200-1. The VMkernel 600-1 corresponding to the primary VM 200-1 records such log entries and sends the log entries 280 to the VMkernel 600-2 corresponding to the backup VM 200-2. Non-deterministic inputs/events include, for example, (i) all inputs from the network external to the virtualized computer system, (ii) information regarding when virtual interrupts were delivered to the VCPU 210-1 due to external events, (iii) timer interrupts delivered to the VCPU 210-1, and (iv) timestamps delivered to the VCPU 210-1 when the VCPU 210-1 requires the current time via various hardware functionality. The VMM 300-2 (or VMkernel 600-2) then uses the log entries 280 to ensure that backup VM 200-2 executes exactly the same instruction stream as the primary VM 200-1 (i.e., the backup VM 200-2 replays the log entries). The VMkernel 600-2 sends acknowledgements 282 back to the VMkernel 600-1, indicating which log entries 280 have been received by the VMkernel 600-2 and which log entries 280 have been replayed on the backup VM 200-2.
According to the embodiment shown in
In the shared storage architecture of
In either the separate disk architecture of
In order to ensure that device states are consistent when the backup VM “goes live,” each device emulation by the VMM 300-2, is modified to recognize that a VM can be either in replay mode or live mode. In replay mode, inputs are obtained from the recorded execution log 280 and some outputs may be discarded (e.g., network packets) or reissued (e.g., some modes of SCSI disks). In “live” mode, I/O is dealt with by the backup VM 200-2 executing normally.
However, even with these modifications of the device emulation, device emulation may not be consistent in certain situations. For example, an I/O device replaying a command completion might expect an I/O completion event followed by a series of log entries in the log file that contain the actual data. If the backup VM 200-2 is allowed to go live at any point in the middle of the emulation of a single instruction, it will unnecessarily complicate the implementation of recording and replay. Likewise if the granularity of going-live is made larger than a single instruction, it can complicate the implementation of recording and replay, and add unacceptable latencies to certain IO operations during replay. Rather than complicating device emulation implementation in order to deal with the backup VM going-live at any point in time, the backup VM 200-2 according to the present embodiment is allowed to go-live at instruction boundaries in the replay log 280.
Specifically, the emulation of any instruction can generate multiple log entries 280. For example, an OUT instruction to an I/O port can cause device emulation to run and generate multiple log entries. According to the present embodiment, the last log entry of that instruction is marked as the “go-live” point. This is because emulating an instruction can require many disparate portions of code to execute, each of which may generate a log, and that it is difficult to determine what the last log entry generated by instruction emulation is, until emulation of the instruction is completed. Thus, according to the embodiment, the last entry associated with an emulated instruction is marked as the go-live point before it is transmitted to the VMkernel 600-2 in the log file 280, and the backup VM 200-2 replays the log entries for that emulated instruction when the last log entry marked as the go-live point is received. Thus, at any moment, the backup VM 200-2 has replayed up to the go-live points in the log entries 280 at the instruction boundaries, and thus would be at a go-live point at any time when the backup VM 200-2 needs to enter live execution mode. This process is explained in more detail below with reference to
This situation is illustrated in
Referring back to
Steps 460, 462, and 464 are also illustrated in
In other embodiments of the invention, however, the sequence of steps 452, 454, 456 and 458 may not always be performed precisely as illustrated in
This situation is illustrated in
Referring back to
Steps 556, 558, and 560 are also illustrated in
By use of the process illustrated in
Specifically, the VMM 300-2 for the backup VM 200-2 quiesces 628 devices so that each device is allowed to go into a state consistent with the state that the backup VM 200-2 assumes the devices to be in. Quiescing a device generally means allowing all pending IOs of that device to complete. For networking devices, quiescing is done by canceling all transmits. For disks, quiescing is handled differently depending upon whether the disks are shared as in
There are some devices for which the guest's assumption of what the state is after replaying will often be different from what the actual state is externally when the backup VM 200-2 goes live. One example is the case where the guest OS 220-2 accesses the host state, for example, the host-guest file system where the guest OS 220-2 is allowed access to the hosts file system. In order to refresh the devices' states when the backup VM 200-2 goes live, the virtualization software may cause 630 the guest O/S 220-2 to reset its assumed states for the various devices by calling the devices to reset their state. Other examples of devices of which the states are generally refreshed include a USB (Universal Serial Bus) interface, a physical CDROM drive, sound card device, etc. Any device which has any state on the primary host, generally cannot be replayed at the backup host unless the same device is present in the same state on the backup host. In situations where the states of the device are not the same between the primary host and the backup host, the state of the device is reset 630 prior to “go-live” at the backup host, usually by issuing a ‘device disconnected’ message up to the device at the backup VM 200-2.
With steps 628, 630 complete, the backup VM 200-2 is now ready to go-live and enter interactive execution mode. Thus, the backup VM 200-2 enters 632 live execution mode (and takes over the operation of the faulted primary VM 200-1 in case of a primary VM fault). Because the backup VM 200-2 has replayed the log entries 280 only up to the go-live points at instruction boundaries, the backup VM 200-2 would be at a go-live point at any time when the backup VM 200-2 enters 632 live execution mode. Thus, the backup VM 200-2 would not be able to go live in the middle of emulation of a single instruction.
Note that step 626 in
Upon reading this disclosure, those of skill in the art will appreciate still additional alternative structural and functional designs for allowing the backup VM to enter live execution when the primary VM faults, through the disclosed principles of the present disclosure. Thus, while particular embodiments and applications have been illustrated and described, it is to be understood that the disclosure is not limited to the precise construction and components disclosed herein and that various modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the method and apparatus of the present disclosure herein.