Validation of surgical simulators in the last two decades

While simulation and simulators have a long history in training programs in various domains, such as the military and aviation, their appearance on the scene of surgical training is more recent [1]. Simulators offer various important advantages over both didactic teaching and learning by performing procedures in patients. They have been shown to prevent harm and discomfort to patients and shorten learning curves, the latter implying that they also offer cost benefits [211]. They are tailored to individual learners, enabling them to progress at their own rate [6]. Additionally, learning on simulators in a skillslab environment allows learners to make mistakes. This is important considering that learning from one’s errors is a key component of skills development [4, 8, 11]. Apart from their worth as training instruments, simulators can also be valuable for formative and summative assessment [3, 6, 12] because they enable standardized training and repeated practice of procedures under standardized conditions [13].

These potential benefits are widely recognized and there is considerable interest in the implementation of simulators in training programs. It is also generally accepted, however, that simulators need to be validated before they can be effectively integrated into educational programs [5, 6, 14, 15]. Validation studies address different kinds of validity, such as “face,” “content,” “expert,” “referent,” “discriminative,” “construct,” “concurrent,” “criterion,” and/or “predictive” validity. There is no uniformity in how these types of validity are defined in different papers [1518]. Additionally, a literature search failed to identify any description of guidelines on how to define and measure different types of validity. Nevertheless, most papers report positive results in respect of all kinds of validity of various simulators. However, what do these results actually reflect?

This paper is based on a review of the literature and the main experiences and efforts relating to the validation of simulators during the last two decades. Based on these, suggestions are made for future research into the use of simulators in surgical skills training.

Terminology of validation

What exactly is validation and what types of validity can be distinguished? There is general agreement in the literature that a distinction can be made between subjective and objective approaches to validation [1518]. Subjective approaches examine novices’ (referents’) and/or experts’ opinions, while objective approaches are used in prospective experimental studies. Face, content, expert, and referent validity concern subjective approaches of validity. These types of validity studies generally require experts (usually specialists) and novices (usually residents or students) to perform a procedure on a simulator, after which both groups are asked to complete a questionnaire about their experience with the simulator. Objective approaches concern construct, discriminative, concurrent, criterion, and predictive validity, and these studies generally involve experiments to ascertain whether a simulator can discriminate between different levels of expertise or to evaluate the effects of simulator training (transfer) by measuring real-time performance, for example, on a patient, cadaver or a substitute real-time model.

Subjective approaches to validity (expert and novice views)

A literature search for guidelines on face and content validity yielded several definitions of validity [1518] but no guidelines on how it should be established. As illustrated in Table 1, studies on face and content validity have used rather arbitrary cutoff points to determine the appropriateness and value of simulators [16, 1924]. The variety in scales and interpretations in the literature suggests a lack of consensus regarding criteria for validity.

Table 1 Methods used to quantify and interpret face and content validity

It is not only important to decide how validity is to be determined; it is also important to decide who is best suited to undertake this task. The literature offers no detailed answers in this regard. It may be advisable to entrust this task to focus groups of specialists who are experts in the procedure in question and in judging simulators. Perhaps judges should also be required to possess good background knowledge on simulators and simulator development. Preferred settings of validation studies need to be considered as well. So far, most tests of face and content validity of surgical simulators have been conducted at conferences (Table 1), where participants are easily distracted by other people and events. Selection bias may also be inherent in this setting, because those who do not believe in simulator training are unlikely to volunteer to practice on a simulator, let alone participate in a validation study.

Objective approaches (experimental studies)

Experimental studies on the simulator

Several studies have examined the construct (discriminative) validity of simulators for endourological procedures [25, 26]. Although the concept of construct validity is somewhat clearer than that of subjective studies of validity, there was substantial variation in methods, data analysis, participants, and outcome parameters. Between 1980 and 2008 several studies examined construct validity in relation to endourological simulators [25]. Table 2 presents the methods used in these studies, in which medical students and residents were the novices, and specialists fulfilled the role of experts, unless mentioned otherwise. Time taken to complete a procedure was a parameter used in all the studies. Time is considered a parameter of importance, but it is not necessarily indicative of achievement of the desired outcome [27]. An exclusive focus on decreasing performance time may eventually result in decreased quality of outcome, suggesting that, besides time, other parameters should be taken into account in measuring validity.

Table 2 Methods and parameters used to assess construct validity

In general surgery there is a similar awareness of discrepancies in the usage and interpretation of construct validity and outcome parameters. Thijssen et al. conducted a systematic review of validation of virtual-reality (VR) laparoscopy metrics, searching two databases and including 40 publications out of 643 initial search results [28]. The data on construct validation were unequivocal for “time” in four simulators and for “score” in one simulator [28], but the results were contradictory for all the other VR metrics used. These findings led those authors to recommend that outcome parameters for measuring simulator validity should be reassessed and based on analysis of expert surgeons’ motions, decisive actions during procedures, and situational adaptation.

Transfer of simulator-acquired skills to performance in patients

Only three studies have examined criterion validity of endourological simulators [25, 2931]. Ogan et al. demonstrated that training on a VR ureterorenoscopy (URS) simulator improved performance on a male cadaver [31]. Knoll et al. trained five residents in the URS procedure on the URO Mentor and compared their performances on the simulator with performances in patients by five other residents by having unblinded supervisors rate the residents’ performances [30]. Brehmer et al. compared experts’ real-time performances with their performances on a simulator [29].

Transfer studies of laparoscopic and endoscopic simulators have shown very positive results regarding improvement of real-time performances [12, 2937]. These results should be interpreted with caution, however, because of small sample sizes (frequently less than 30), lack of randomization, supervisors who were not blinded to type of training, groups with dissimilar backgrounds (e.g., surgical and nonsurgical residents), and/or studies limited to a comparison between experts’ performances on a simulator and in the operating room but not between experts’ and novices’ performances. Also, some of these studies did not use real patients but human cadavers or animal models to measure real-time performance [31, 33].

Ethical and legal concerns may hamper transfer studies where the ideal study protocol would involve groups of trained and untrained participants performing the procedure of interest in a patient. However, even though today many residents learn procedures in patients without prior training on a simulator, this type of study is unlikely to gain the approval of Medical Review Ethics Committees, especially if a study tests the hypothesis that trained participants will outperform controls, implying that the patient is at risk when procedures are performed by controls.

Definition of novices and experts

An important issue in validity research is defining participants’ levels of expertise. Generally, the term “novices” designates persons with no experience at all in performing the procedure under study, while the term “expert” refers to specialists with ample experience in performing the procedure in patients. However, some studies labeled participants with only some experience as “novices” while residents who had not yet completed the learning curve were considered “experts” (Tables 1 and 2). In the absence of clear standards for classifying experts and novices, researchers apparently use arbitrary cutoff points. With regard to urethrocystoscopy, for example, Gettman et al. classified those who had performed 100 procedures or more as experts [38], whereas Shah et al. required performance of > 1,000 procedures for qualification as an expert [39]. Apart from differences regarding the number of procedures used as the cutoff point between novice and expert, it is questionable whether it is at all defensible to use number of procedures performed as a measure of expertise. For one thing, self-estimated numbers are likely to be unreliable [40] and, furthermore, having performed more procedures does not automatically correlate with increased quality of performance. It might be better to focus on external assessment of expertise or a more objective standard to discriminate between experts and novices.

Recommendations for validation and implementation of surgical training models

It is inadvisable to use training models before their validity as an educational tool has been proven by research [5, 6, 14, 15]. However, there is as yet no consensus on appropriate methods and parameters to be used in such studies. So far validity studies have mainly focused on technical skills. Although these skills are important they are not the only aspect of operating on patients. The problems concerning transfer studies and the diversity of study outcomes demonstrate that it may be better to design and evaluate a comprehensive training program instead of validating only one aspect or part of a procedure that can be performed on a simulator. This requires an understanding of educational theories and backgrounds and a multidisciplinary approach in which specialists, residents, educationalists, and industrial designers collaborate. In addition, we should learn from experiences in other domains, such as the military and aviation, where similar difficulties with regard to the use of simulators in training are encountered.

Integration of training needs analysis and training program design in developing training facilities

“For a long time, simulator procurement for military training purposes has been mainly a technology-pushed process driven by what is offered on the market. In short, the more sophisticated the simulator’s capabilities, the more attractive it is to procure. Training programmes are later developed based upon the device procured, sometimes only for the training developers to conclude that the simulator “did not meet the requirements” or, even worse, that it was unusable because of a complete mismatch between the capabilities and limitations of the device on the one hand and the basic characteristics and needs of the trainees on the other” [41].

Nowadays, there is awareness of the mechanism described by Farmer et al. within surgical communities too, and there is also a growing realization of the need to reevaluate the methods and approaches used in developing surgical training programs. In military training in the 1990s there was a generally acknowledged need for an integrated framework as well as research and development of simulations based on the realization that the world was changing and conditions and constraints were evolving [41]. It was stated that “simulation by itself cannot teach” and this concept led to the Military Applications of Simulator and Training concepts based on Empirical Research (MASTER) project in 1994, in which 23 research and industrial organizations in five countries combined their knowledge to develop generic concepts and common guidelines for the procurement, planning, and integration of simulators for use in training.

The MASTER project underlined the importance of three key phases of program development: training needs analysis (TNA), training program design (TPD), and training media (simulators, for example) specification (TMS) [41]. These phases have also been described in the medical education literature [2, 42]. TNA involves task analysis and listing the pitfalls of a procedure that need to be trained. When training needs and the place of a simulator in the curriculum are analyzed before a simulator is actually introduced, a major problem of validation studies can be avoided, namely the fact that some simulators train and measure different, not equally relevant, parameters [43]. TPD follows TNA, and is concerned with organizing the existing theoretical and practical knowledge about the use of simulators with a focus on outlining training program requirements. Following TPD, the TMS phase focuses on simulator requirements. Validation has its place in this phase. As Satava stated “Simulators are only of value within the context of a total educational curriculum” and “the technology must support the training goals” [44].

Figures 1 and 2 present a ten-step approach to developing surgical training programs. Figure 1 represents the preparation phase, consisting of training needs analyses. Figure 2 shows a recommended approach to evaluating and implementing surgical simulators in curricula. For every new training program it should be considered whether all the steps of the process are feasible and cost effective. New developments and improvement of education mostly require financial investments. However, in order to minimize costs it is important to consider the expected benefits as well as possible drawbacks and the costs that go along with those.

Fig. 1
figure 1

The training needs analysis phase of training program development. See file “BarbaraSchout validation critical review_submission_Figure 1

Fig. 2
figure 2

Creating a training program, including Training Program Design and Training Media (model) Specification. See file “BarbaraSchout validation critical review_submission_Figure 2

Accreditation and certification are also very important aspects that need to be considered once the definitive training program has been designed. Because accreditation and certification follow program development, they are not included in Figs. 1 and 2.

Integration of nontechnical factors that influence practical skills performances

As early as 1978 Spencer et al. pointed out that a skillfully performed operation is 75% decision making and only 25% dexterity [45]. Nontechnical (human) factors strongly influence residents’ and students’ performances [3, 14, 4657]. Moreover, research concerning safety in surgery has shown that adverse events are frequently preceded by individual errors, which are influenced by diverse (human) factors [9, 58].

Surgical training is still very much focused on technical skills, although a skillslab environment may be an ideal situation for integrating technical and nontechnical factors. There is still a gap between research into human factors and educational research [41]. Taking account of expertise on human factors early in the development of training programs and also in the specification of training media can make a considerable contribution to improved the validity and cost-effectiveness of training [41].

Effective surgical training depends on programs that are realistic, structured, and grounded in authentic clinical contexts that recreate key components of the clinical experience [8, 9, 14, 56, 59, 60]. Ringsted et al. showed that factors involved in the acquisition of technical skills can be divided into three main groups: task, person, and context [53]. The model of the acquisition of surgical practical skills shown in Fig. 3 is based on these groups. It illustrates the complexity of a learning process that is affected by various factors.

Fig. 3
figure 3

Factors that influence performance of practical skills of trainees. See file “BarbaraSchout validation critical review_submission_figure 3

Collaboration of specialists, residents, educationalists, and industrial designers

Curriculum design is not a task that should be left to one individual. Preferably, it involves multidisciplinary consultations and research [41]. When specialists, residents, educationalists, and industrial designers collaborate and share their knowledge they will be able to make progress in developing and implementing simulator training in curricula [61].

Simulators can assist clinical teachers and relieve some of their burden. Not every specialist is a good teacher. Superior performance of procedures in patients does not automatically imply similar excellence in teaching others to do the same. Currently, training of medical skills during procedures on patients depends largely on the willingness of trainers to allow trainees to practice and improve their diagnostic and procedural skills. As a result, training is strongly teacher and patient centered [62]. A skillslab environment offers a much more learner-centered educational environment [8]. However, this can only be achieved if not only specialists (teachers), but also residents (learners), educationalists (teaching the teachers), and industrial designers (suppliers of teaching facilities) are allowed to contribute their expertise to developing the content of training programs.

Development and evaluation of assessment methods

Performance assessment tools are needed to evaluate and validate surgical simulators. Several methods that have been developed or are being developed involve the use of simulators not only to practice but also to assess skills. VR and augmented-reality (AR) simulators allow automatic gathering of objective data on performance [14, 17, 37]. However, the development of these metrics is itself an emerging field, and as we described earlier, there is no uniform approach to measuring performance with VR or AR simulators. Motion analysis, tracking how trainees move laparoscopic instruments, is a relatively new and important type of assessment [63]. Although this enables objective performance assessment, assessment methods based on data generated by VR/AR simulators and motion analysis offer limited possibilities because of their exclusive focus on technical skills and because many of these systems can only be used in training environments [63]. Another promising, upcoming factor in assessment is error analysis by means of video analysis [6466].

Currently, the most commonly used and the only thoroughly validated method to assess technical as well as nontechnical skills is Objective Structured Assessment of Technical Skills (OSATS). OSATS can be used to assess performance on simulators as well as real-time performance in patients. Performance is usually scored by a supervisor on a five-point scale [67]. However, although OSATS has been thoroughly evaluated and validated, it has the disadvantage of being dependent on supervisors’ subjective opinions. As Miller stated in 1990, “No single assessment method can provide all the data required for judgment of anything so complex as the delivery of professional services by a successful physician” [68]. It seems eminently desirable to further develop and thoroughly evaluate and validate these assessment methods, especially for assessment of real-time performance.

Conclusion

Studies examining the validity of surgical simulators are recommended for progress in the implementation of simulators in surgical education programs. The absence in the literature of general guidelines for interpreting the results of subjective validity studies points to a need to seek consensus, if possible, and perform research to identify appropriate methods for evaluating this type of validity and for interpreting results. A considerable number of studies have addressed objective construct (discriminative) validity of simulators. However, there is considerable variation in outcome parameters and it is questionable whether the measured parameters actually reflect those aspects that are most important for novices to learn on a simulator. Few objective studies have examined whether skills learned on a simulator can be transferred successfully to patient care. This lack of studies is partly due to ethical and legal issues restricting these types of studies.

Validation and integration of surgical simulators in training programs may be more efficient if training needs analysis (TNA) is performed first and program requirements are set in a training program design (TPD) phase by a multidisciplinary team, consisting of specialists, residents, educationalists, and industrial designers. Furthermore, for successful transfer of skills from simulator to patient, it is important to consider and include the influence of contextual, (inter)personal, and task-related factors in training programs, rather than merely focusing on technical skills. Multiple validated assessment methods of practical performance are essential for evaluating training programs and individual performances. Current assessments methods are few, not yet thoroughly validated, and mostly focused on technical skills only. Educational and medical communities should join forces to promote further development and validation of the available assessment methods.