A Survey of Human-Centered Evaluations in Human-Centered Machine Learning

Fabian Sperrle, Mennatallah El-Assady, Grace Guo, Rita Borgo, Duen Horng Chau, Alex Endert, Daniel Keim

Abstract: Visual analytics systems integrate interactive visualizations and machine learning to enable expert users to solve complex analysis tasks. Applications combine techniques from various fields of research and are consequently not trivial to evaluate. The result is a lack of structure and comparability between evaluations. In this survey, we provide a comprehensive overview of evaluations in the field of human-centered machine learning. We particularly focus on human-related factors that influence trust, interpretability, and explainability. We analyze the evaluations presented in papers from top conferences and journals in information visualization and human-computer interaction to provide a systematic review of their setup and findings. From this survey, we distill design dimensions for structured evaluations, identify evaluation gaps, and derive future research opportunities.

Considered Dimensions: We surveyed how human-centered evaluations of human-centered machine learning are performed in practice, and whether there are differences and commonalities across different data types and application domains. We considered 42 relevant dimensions from four high-level aspects.
Bibtex Entry:

Talk @ EuroVis'21

This state-of-the-art report was presented at EuroVis'21.


Challenges and Observations from the Paper

Before we get to the detailed methodologies of coding and paper collection, here is a brief summary of our main findings:

Evaluation Focus

  • Bias towards algorithm-centered evaluations
  • Multiple, tailored evaluations rather than one broad one

Participants

  • On average, six participants in the surveyed evaluations
  • Simpler replacement tasks for evaluation at scale are not always a good fit
  • Personal characteristics and experiences are rarely considered

Lack of Ground Truth

  • Exploratory or personalized analysis results
  • Multi-stage evaluations pass to further participants for evaluation or ranking

Due to these challenges, we observed...
  • Very diverse set of methodologies and study protocols.
  • More detailed evaluation of model- and explanation properties than of human factors.
  • Incomplete study reporting.
  • No significant differences between the evaluation of supervised and unsupervised machine learning.

Summary of Findings

Below, we summarize the findings for each of the four aspects study setup, model and explanations, interactions, and reported results. Many more details can be found in the paper.

Study Setup

  • We collect eleven dimensions in the three categories study setup, participants and tasks and data.
  • Study protocols and methodologies for data collection
  • Participant training and performed tasks
  • Used data types
  • Demographics of study participants

Most studies evaluate understanding and refinement. Few studies consider hypothesize creation or model justification. Several studies let participants use models without model-specific tasks.

The most frequently used datatype was multivariate data, followed by text data and images. Only few systems deal with videos or geographical data.

Typically, participants are experts in the system domain or the dataset used during the study. Few papers evaluate the influence of expertise on study outcomes (SC).

While most participants are experts in machine learning, several papers rely on participants with intermediate knowledge. Few papers evaluate the influence of expertise on study outcomes (SC).

Model and Explanation Properties

  • We distinguish between six XAI properties concerning models and four properties of explanations
  • Properties of models are evaluated much more frequently than properties of evaluations
  • We found a frequent mismatch between motivated and evaluated properties
  • Properties of models and explanations are often evaluated and reported in more detail than human-related factors

Some papers evaluate the participant's perception of transparency with or without making it a study condition (measured and measured condition, respectively).

Some papers manipulate controllability as a study condition (measured) or evaluate the participant's perception of controllability (measured).

Interpretability is the model property most frequently measured. Still, more papers motivate this property without evaluation.

Interaction

  • We survey eight dimensions in the three categories model manipulation, timing, and guidance.
  • Most studies focus on the training or post-training phases, with only four covering data selection and data preprocessing.
  • We found a relatively equal distribution of direct and indirect model manipulation
  • Guidance is rarely evaluated at all, and no study set out to specifically evaluate the provided guidance

Only few papers evaluated user tasks from the data selection or data preprocessing phases of the machine learning pipeline. Note: Papers can evaluate tasks from different phases.

While more papers offer direct interaction possibilities with the model, a significant number employ indirect model manipulation.

While guidance is infrequently evaluated in detail, most papers that provide guidance rely on less intrusive orienting guidance. Note: Papers can employ multiple guidance degrees.

Reported Results

  • To capture the reported results, we ran three IHTM topic models on the main findings and interface- and interaction feedback, respectively.
  • Due to the diverse reporting formats and varying levels of detail, the topic models remained relatively high-level.

The interaction topics suggest that interaction is often used for customization and to control models.

The interface feedback topics highlight the usefulness of explanations and interactions for interpretability.

The topics for main findings relate to core tasks, including understanding, interpretability and explainability.


Next Steps

To support, in particular, junior researchers in developing more targeted study designs, we provide both a checklist for study design and a checklist for study reporting in human-centered evaluation of human-centered machine learning.

Checklist for Study Design

This checklist aims to facilitate the creation of study designs and ensure that relevant factors are covered before participants are evaluated. While most points should be self-explanatory, here is a brief summary:
  • Ordered checklist to work through during the ideation and setup of human-centered evaluations
  • For some survyed papers it was not clear what they aimed to evaluate -- hence we start with the definition of research questions and the general evaluation approach. Depending on the hypothesis, human-centered or algorithm-centered evaluation perspectives need to be balanced differently.
  • In addition to the tasks that participants perform, consider whether participants will require training before they can complete the tasks, and to what extent they will be trained before the study begins.
  • Participants are likely a diverse set of people of different age, gender, cultural background, and expertise in both the evaluated domain and dataset, and machine learning in general. Ensure that this diversity has been considered appropriately and define requirements for study participation.

Template for Study Reporting

This detailed template can be filled out and included in a paper's supplementary material to ensure that information required to judge the study's findings are provided. Furthermore, it provides an overview of aspects that should be considered when reporting results.

Opportunities and Calls for Action

Our survey highlights several opportunities for future work in human-centered evaluation of human-centered machine learning:

Structured Evaluation Framework

  • Need to survey algorithm-centered evaluations
  • Derive guidelines for balanced evaluations

Shared Vocabulary

  • Ensure matching understanding of common XAI terms
  • Consider providing definitions of used terms in your next paper

Open Access & Reproducibility

  • Open access not only for systems and results, but also for evaluation frameworks if you create them

Based on our survey results, we call on the community to consider evaluation papers as one possibility to tackle the issue of current human-centered machine learning papers being too complex to be holistically evaluated in a single paper:

Tailored Publications

Consider splitting papers; do not expect complex HCML systems to be evaluated within a single paper

Realistic Reporting

Value candid, realistic reporting of study results, especially with respect to limitations.

Evaluated Aspects

Make clear which aspects were evaluated and which ones were not evaluated.

State Limitations

Be clear on limitations of study results, including factors limiting generalization.