A Survey of Human-Centered Evaluations in Human-Centered Machine Learning

Talk @ EuroVis'21

This state-of-the-art report was presented at EuroVis'21.

Challenges and Observations from the Paper

Before we get to the detailed methodologies of coding and paper collection, here is a brief summary of our main findings:

Evaluation Focus

Bias towards algorithm-centered evaluations
Multiple, tailored evaluations rather than one broad one

Participants

On average, six participants in the surveyed evaluations
Simpler replacement tasks for evaluation at scale are not always a good fit
Personal characteristics and experiences are rarely considered

Lack of Ground Truth

Exploratory or personalized analysis results
Multi-stage evaluations pass to further participants for evaluation or ranking

Due to these challenges, we observed...

Very diverse set of methodologies and study protocols.
More detailed evaluation of model- and explanation properties than of human factors.
Incomplete study reporting.
No significant differences between the evaluation of supervised and unsupervised machine learning.

Summary of Findings

Below, we summarize the findings for each of the four aspects study setup, model and explanations, interactions, and reported results. Many more details can be found in the paper.

Study Setup

We collect eleven dimensions in the three categories study setup, participants and tasks and data.
Study protocols and methodologies for data collection
Participant training and performed tasks
Used data types
Demographics of study participants

Most studies evaluate understanding and refinement. Few studies consider hypothesize creation or model justification. Several studies let participants use models without model-specific tasks.

The most frequently used datatype was multivariate data, followed by text data and images. Only few systems deal with videos or geographical data.

Typically, participants are experts in the system domain or the dataset used during the study. Few papers evaluate the influence of expertise on study outcomes (SC).

While most participants are experts in machine learning, several papers rely on participants with intermediate knowledge. Few papers evaluate the influence of expertise on study outcomes (SC).

Model and Explanation Properties

We distinguish between six XAI properties concerning models and four properties of explanations
Properties of models are evaluated much more frequently than properties of evaluations
We found a frequent mismatch between motivated and evaluated properties
Properties of models and explanations are often evaluated and reported in more detail than human-related factors

Some papers evaluate the participant's perception of transparency with or without making it a study condition (measured and measured condition, respectively).

Some papers manipulate controllability as a study condition (measured) or evaluate the participant's perception of controllability (measured).

Interpretability is the model property most frequently measured. Still, more papers motivate this property without evaluation.

Interaction

We survey eight dimensions in the three categories model manipulation, timing, and guidance.
Most studies focus on the training or post-training phases, with only four covering data selection and data preprocessing.
We found a relatively equal distribution of direct and indirect model manipulation
Guidance is rarely evaluated at all, and no study set out to specifically evaluate the provided guidance

Only few papers evaluated user tasks from the data selection or data preprocessing phases of the machine learning pipeline. Note: Papers can evaluate tasks from different phases.

While more papers offer direct interaction possibilities with the model, a significant number employ indirect model manipulation.

While guidance is infrequently evaluated in detail, most papers that provide guidance rely on less intrusive orienting guidance. Note: Papers can employ multiple guidance degrees.

Reported Results

To capture the reported results, we ran three IHTM topic models on the main findings and interface- and interaction feedback, respectively.
Due to the diverse reporting formats and varying levels of detail, the topic models remained relatively high-level.

The interaction topics suggest that interaction is often used for customization and to control models.

The interface feedback topics highlight the usefulness of explanations and interactions for interpretability.

The topics for main findings relate to core tasks, including understanding, interpretability and explainability.

Next Steps

To support, in particular, junior researchers in developing more targeted study designs, we provide both a checklist for study design and a checklist for study reporting in human-centered evaluation of human-centered machine learning.

Checklist for Study Design

This checklist aims to facilitate the creation of study designs and ensure that relevant factors are covered before participants are evaluated. While most points should be self-explanatory, here is a brief summary:

Ordered checklist to work through during the ideation and setup of human-centered evaluations
For some survyed papers it was not clear what they aimed to evaluate -- hence we start with the definition of research questions and the general evaluation approach. Depending on the hypothesis, human-centered or algorithm-centered evaluation perspectives need to be balanced differently.
In addition to the tasks that participants perform, consider whether participants will require training before they can complete the tasks, and to what extent they will be trained before the study begins.
Participants are likely a diverse set of people of different age, gender, cultural background, and expertise in both the evaluated domain and dataset, and machine learning in general. Ensure that this diversity has been considered appropriately and define requirements for study participation.

Template for Study Reporting

This detailed template can be filled out and included in a paper's supplementary material to ensure that information required to judge the study's findings are provided. Furthermore, it provides an overview of aspects that should be considered when reporting results.

Opportunities and Calls for Action

Our survey highlights several opportunities for future work in human-centered evaluation of human-centered machine learning:

Structured Evaluation Framework

Need to survey algorithm-centered evaluations
Derive guidelines for balanced evaluations

Shared Vocabulary

Ensure matching understanding of common XAI terms
Consider providing definitions of used terms in your next paper

Open Access & Reproducibility

Open access not only for systems and results, but also for evaluation frameworks if you create them

Based on our survey results, we call on the community to consider evaluation papers as one possibility to tackle the issue of current human-centered machine learning papers being too complex to be holistically evaluated in a single paper:

Tailored Publications

Consider splitting papers; do not expect complex HCML systems to be evaluated within a single paper

Realistic Reporting

Value candid, realistic reporting of study results, especially with respect to limitations.

Evaluated Aspects

Make clear which aspects were evaluated and which ones were not evaluated.

State Limitations

Be clear on limitations of study results, including factors limiting generalization.