This workstream aims to give European regulators and market authorities their own means to cross-evaluate AI systems, in particular, general-purpose AI systems (GPAIS) independently from industry. It also seeks to provide developers with the facilities to measure various metrics of AI systems (accuracy, robustness, generality, interpretability, etc.) in rigorous, standardized experiments. To do so, it aims to bring various EU member states’ capabilities in metrology together to start establishing benchmarks relevant to the EU AI Act. Thanks to our efforts, the AI Act includes clauses to ensure the creation of a pool of metrology labs, coordinated at the EU level, to fulfil that purpose.

In February 2020, the European Commission published a White Paper on Artificial Intelligence. In response, TFS published a memo titled “Experimentation, testing & audit as a cornerstone for trust and excellence,” advocating for numerous changes, including the development of means to benchmark AI technologies in a coherent way, through experiments, tests and evaluations.

When the European Commission’s proposed AI Act was unveiled in April 2021, benchmarking was not included, though relevant functions were implied via requirements pertaining to AI systems’ levels of “accuracy” and “robustness.” To aid in clarifying this aspect, in August 2021, TFS published a memo titled “Trust in Excellence & Excellence in Trust,” which put forward recommendations for testing and experimentation facilities, such as the development of test beds and benchmarking protocols.

In late 2021, under the banner of The Athens Roundtable, TFS launched the Working Group on Interoperable Benchmarking of AI Systems, bringing together members from major standards organizations and technical communities—including U.S. NIST, CEN-CENELEC, IEEE, LNE, VDE, and Greece’s National Centre of Scientific Research—with the goal of producing interoperable benchmarks that assess the extent to which AI-enabled systems do, in fact, meet their intended real-world purpose.

Throughout 2021, 2022 and early 2023, TFS has continued to build capacity and understanding among policymakers about the critical role of independent benchmarking in AI governance, notably through our work on standard-setting and on the NIST Risk Management Framework. We welcome the European Parliament’s decision of June 2023 to carry out our recommendations regarding the establishment of an EU-level pool of benchmarking authorities, in particular for addressing general-purpose AI systems. We remain available to contribute independent expertise and opinions to the development of benchmarking capabilities in the EU, independent from industry.

Team members

Related resources