ICFP 2020 - Artifact Evaluation

Authors with a paper accepted to ICFP 2020 are invited to submit an artifact that supports the conclusions of the paper. The Artifact Evaluation Committee will read the paper, explore the artifact, and provide feedback on how easy it would be for future researchers to build on. The ultimate goal of artifact evaluation is to support future researchers in their ability to reproduce and build on today’s work.

If you have a paper accepted at ICFP 2020, please see the Call for Artifacts for instructions on submitting an artifact for evaluation.

Call for Artifacts

The Artifact Evaluation Committee (AEC) invites authors of accepted papers to submit an artifact that supports the conclusions of the paper. The committee will read the paper, explore the artifact, and provide feedback on how easy it would be for future researchers to build on. The ultimate goal of artifact evaluation is to support future researchers in their ability to reproduce and build on today’s work.

The submission of artifacts for review is voluntary and will not influence the final decision of whether the paper itself is accepted. Papers with successfully reviewed artifacts will receive a seal of approval printed on the first page of the paper in the ICFP proceedings. Authors of papers with successfully reviewed artifacts are encouraged to make the artifact publicly available upon publication of the proceedings, by including them as “source materials” in the ACM Digital Library.

If the artifact review is successful then it will be awarded the “Artifact evaluated: functional” badge. For information on what we consider to be “functional” please see the page on Expected Forms of Artifacts.

Types of Artifacts

An artifact that supports the paper’s conclusions can take many forms, including any or all of the following:

a working copy of the software and its dependencies, including benchmarks, examples and/or case studies
experimental data sets
a mechanized proof

Paper proofs are not accepted as artifacts for evaluation.

Selection Criteria

The artifact will be evaluated in relation to the expectations set by the paper, and should be:

consistent with the paper,
as complete as possible,
well documented,
future-proof, and
easy to reuse, facilitating further research.

The community benefits most when an artifact facilitates future research. For example, future researchers may build on an artifact by extending it to cover new situations or augmenting it with new components to solve a different class of problems. Other researchers may try an alternative approach to solving the same problem. This new work can benefit by comparing new results directly with the ones produced by the artifact, and by understanding the various tradeoffs and engineering decisions that were taken in the past.

We expect that most artifacts submitted for review at ICFP will have a few common forms: compilers, interpreters, proof scripts and so on. We have codified the expected forms of artifacts on a separate page. If you are considering submitting an artifact that does not have one of these forms, please contact the Artifact Evaluation chairs before the submission deadline to discuss what is expected.

Submission Process

The evaluation process is single blind. Reviewers will know the names of the authors, but not the reverse. However, the artifact evaluation process also encourages free, single-blind communication between reviewers and authors, so that small technical problems can be overcome. Authors may also iteratively improve their artifacts during the process to overcome small technical problems, but may not submit new material that would require substantially more reviewing effort.

At this stage we intend for most artifact submissions to include BOTH:

Software installed into a QEmu Debian base Virtual Machine (VM) image that will be provided by the committee. The committee will verify that the provided base image works on the reviewers’ machines before distribution, to avoid incompatibility problems between VM formats and VM players.
A separate source tarball that includes just the source files.

In most cases, artifacts should include BOTH the extended VM image AND a separate source tarball. The intention is that reviewers who are familiar with certain tools (e.g. Agda or OCaml) can inspect the artifact sources directly, while reviewers that are less familiar can still execute the artifact without needing to install new software on their own machines, except for QEmu. The VM image will be archived so that future researchers, say in 5 years time, do not need to worry about version incompatibilities between old tool versions and new operating systems.

The detailed submission process is as follows:

Read the Forms of Artifacts page for details on artifact preparation.
Register your intent to submit an artifact on the separate artifact only HotCRP site before the end of 15th May.
Download the current base VM image and prepare your artifact consisting of BOTH the extended VM image and source tarball.
Retrieve the SFTP server name and password that will be posted on the HotCRP page for your artifact soon after the artifact registration deadline.
Upload TWO files one called artifactXX-image-MD5HASH.tgz, and one called artifactXX-source-MD5HASH.tgz to the SFTP server. Replace XX by the artifact number assigned by HotCRP, and MD5HASH by the md5 hash of the corresponding file.
The reviewers will also use SFTP to retrieve prepared artifacts.
During the review process, if you need to upload a new version to fix minor technical problems, you can use the same SFTP server.

Links to the base VM image, and the artifact evaluation HotCRP site, will be distributed to authors with accepted papers (including conditionally accepted) around the time of acceptance notification. We’re using QEmu and a standard Debian 10 installation.

For questions about the overall review process or specific reviews, please contact Brent Yorgey (yorgey@hendrix.edu).

For questions about the VM image or SFTP server, please contact Ben Lippmeier (benl@ouroborus.net).

Timeline

It takes time to produce a good artifact; thus we allot two weeks between conditional paper acceptance and artifact submission. These are the key dates (all dates are in the Anywhere on Earth (AOE / UTC-12) timezone):

Event	Date
ICFP Conditional Acceptance	Fri 8 May
Registration date	Fri 15 May
Artifact submission	Fri 22 May
Review and technical clarification	Wed 3 June - Wed 17 June
Preliminary reviews available	Wed 17 June
Further clarification if needed	Wed 17 June - Tue 23 June
Final decision sent to authors	Tue 23 June

More Information

For additional information, clarification, or answers to questions, please contact the ICFP Artifact Evaluation co-chairs:

Brent Yorgey (byorgey@gmail.com)
Ben Lippmeier (benl@ouroborus.net)

Most artifacts that are submitted for review at ICFP have one of a few common forms, and we have codified what we expect from each of these common forms. We also have advice to authors and reviewers about how to prepare and review them. This material should be taken as highly suggestive, but not prescriptive. If authors or reviewers have questions about what is expected, please contact the AEC co-chairs as early as possible. If you are an author and your artifact does not fit into one of the obvious categories then please contact the AEC co-chairs well before the submission deadline.

We also describe some common problems to avoid. This advice has been distilled from past experience at a variety of events, and does not describe specific papers, artifacts or authors.

Selection Criteria

As stated on the main page, the artifact should be:

consistent with the paper
as complete as possible
well documented
future-proof, and
easy to reuse, facilitating future research.

Consistent with the paper

The artifact should directly implement or support the technical content of the paper. If the paper describes a particular algorithm that does something in a certain way, then the artifact should also do it that same way. It is fine for the artifact to implement an extended version of what is in the paper, but the examples discussed in the paper should work with the artifact with minimal changes.

Complete as possible

All examples, benchmarking results and graphs described in the paper should be reproducible with the artifact. If the paper describes an interpreter that evaluates some example expressions, and the artifact includes an implementation of that interpreter, then all the examples should work with the interpreter. If the paper describes a program to compute numerical results and presents graphs of those results, then the artifact should be able to reproduce all of those graphs.

Common problem: A paper contains five graphs of benchmarking results, but the artifact only reproduces two of them. Reviewers are prone to reject such an artifact due to it being incomplete. In general all graphs should be reproducible by the artifact, else very clear reasons should be given why this is not possible. For example, if a benchmark run needs two weeks of compute time then reviewers are not expected to reproduce that during the review process, but if it only takes 30 minutes they will probably expect to.

Common problem: A paper describes a compiler or interpreter implementation, but its execution depends on commercially licensed tools (like MATLAB, or some commercial SMT solver). Reviewers are prone to reject such an artifact due to it being either incomplete or not easy to reuse. However, it is not unreasonable for authors to expect such tools to be available to the intended audience of the paper. One way to address this may be for the authors to provide a login environment for reviewers to use. A better way is to contact the AEC co-chairs before submission so that we can try to assign reviewers that already have the required tools.

Common problem: A paper describes the implementation of a particular algorithm and gives benchmark results, but the artifact includes only the results and not the implementation. This can happen if the implementation was built as proprietary software, or part of it cannot be released due to confidentiality issues. In this case please contact the AEC co-chairs to discuss whether the artifact is eligible for review. Perhaps some defined fragment of it can still be reviewed.

Well documented

The paper itself describes the technical content of the artifact, and all reviewers will read the paper. However, there must be sufficient documentation to be able to build, test and execute the implementation, as well as to debug minor problems. At an absolute minimum, for a compiler project there needs to be clear instructions about how to build it, execute the test suite, and run it on the examples provided by the paper. Reviewers will also expect to have enough documentation to adapt those examples and try out some of their own.

Common problem: A paper describes an new language and the artifact has an interpreter for that language, but the command line interpreter program does not have basic “–help” style functionality for the reviewers to work out what the flags mean. If the paper shows some form of intermediate representation (IR) produced by the compiler, then this sets the expectation that reviewers could inspect the IR version for some of their own examples. The flags to do this need to be documented.

Future proof

Artifacts should be reusable in 5-10 year time frames. For this reason they should not unnecessarily depend on specific operating system versions, kernel drivers, processor architectures and so on. If the paper describes a DSL for configuring a particular sort of FPGA device, then it is fine to require that particular device – and perhaps provide the reviewers with a simulator. However, artifacts should not depend on quirks of particular operating systems, such as requiring specific system linker or driver versions when this is not intrinsically necessary.

Common problem: An artifact contains source code written in a particular language, but it only builds with an old compiler version. This old compiler version does not work on the latest version of the popular operating system used by the reviewer. Most of these problems will be mitigated by basing the artifact around the standard VM image. However, if an author asks a reviewer to downgrade to an older version of their installed tools then the artifact will probably fail the “future proof” criteria.

Easy to reuse

Future researchers will have a limited attention span for debugging problems with archived artifacts. If an artifact requires excessive configuration or hand-holding to execute, then it is unlikely a future researcher will put time into comparing their new work with the old artifact.

Common problem: An artifact needs to perform numerical computation for 8 hours before producing a result, but it is not possible to pause and resume the computation. A reviewer may not be able to leave such computations running on their personal machines for extended periods, as they may be travelling or need to do other work. It is particularly irksome to both reviewers and authors if the artifact crashes when run on new examples, or if the host machine itself is unstable. Such artifacts should have clear checkpoints that allow intermediate results to be saved and resumed. When the artifact runs it should be clear to the reviewers how to resume computation from a particular checkpoint – such as by printing resumption commands to the console at regular intervals. Such resumption commands should not be swamped by debugging output from the tool.

Common problem: An artifact includes system software that only works with particular operating system driver versions, such as custom Linux networking drivers. Although it may be possible to install new driver versions, reviewers are unlikely to want to do so for risk of destabilizing their own machines. Networking code can also be difficult to review if the ability to reproduce the results depends on particular network bandwidths or transmission latencies. In such cases it may be appropriate to supply a simulator, so that the overall algorithm can be tested without needing a particular physical network configuration.

Advice per Artifact

All artifacts

All artifacts must contain a top-level Readme.md file that gives the name of the paper, the author’s names, and step-by-step instructions about how to execute the artifact.

In most cases the step-by-step instructions should be a list of commands to execute to build and test the artifact on the examples described in the paper, and to reproduce all the graphs and benchmarking results. The instructions should call out particular features of the results such as “this produces the graph in Fig 5 that shows our algorithm runs in linear time”. Try to keep the instructions clear enough that reviews can work through them in under 30 minutes.

If the build process emits warning messages, perhaps when building libraries that are not under the author’s control, then include a note in the instructions that this is the case. Without a note the reviewers may assume something is wrong with the artifact itself.

Separately from the step-by-step instructions, provide other details about what a reviewer should look at. For example, “our artifact extends existing system X and our extension is the code located in file Y”.

Try to avoid requiring graphical environments (X Windows) to be installed into the VM unless truly necessary. Graphical environments in VMs are sometimes slow and unstable. If possible, keep graphics rendering such as web browsing on the host.

Consider providing a top-level Makefile so that the commands to be executed are just make targets that automatically build their prerequisites.

Command-line tools

Unix command-line tools should have standard --help style command-line help pages. It is not acceptable for an executable to throw uninformative exceptions when executed with no flags, or with the wrong flags.

Compilers and interpreters

It should be obvious how to run the tool on new examples that the reviewers write themselves. Do not just hard-code the examples described in the paper.

If your tool consumes expressions in a custom DSL then we recommend supplying a grammar for the concrete syntax, so that reviews can try the tool on new examples. Papers that describe such languages often give just an abstract syntax, and it is often not clear what the full concrete syntax is from the paper alone.

Proof scripts

In most cases, the artifact VM should contain an installation of the proof checker and specify a single command (preferably “make”) to re-check the proof. It is fine to leave the VM itself command-line only, and require reviewers to browse the proof script locally on their own machines. It should not be necessary to have CoqIDE or Emacs/ProofGeneral installed into the VM, unless the paper is particularly about IDE functionality.

Include comments in the proof scripts that highlight the main theorems described in the paper. Use comments like “This is Theorem 1.2: Soundness described on page 6 of the paper”. Proof scripts written in “apply style” are typically unreadable without loading them into an IDE, but reviewers will still want to find the main lemmas and understand how they relate.

Reviewers almost always complain about lack of comments in proof scripts. To authors, the logical statements of the lemmas themselves are likely quite readable, but reviewers typically want English prose that repeats the same information.

Before submission, scan through the script and erase TODO and FIXME style comments. Reviewers will expect proved statements to be true, so there should be not be TODOs in submitted proofs.

Programming environments presented via web interfaces

Try to get the server running locally inside the VM, and allow the reviewer to connect to it via a web browser running natively on their host machine. Graphical environments installed into VMs are sometimes laggy and unstable, and standard web protcols are stable enough that such artifacts should be usable with new browsers.

Programs that generate images

If the artifact produces a .bmp or .png file then expect the reviewer to use “scp” or some such to copy it out to the host machine and view it. Authors should test that the connection to the VM works, so that this is possible.

Long running artifacts

If the artifact needs to run for more than 10 minutes then this must be highlighted in the instructions, and there should be a way to stop and resume the computation.

Artifacts that run on GPUs

If the artifact needs standard GPU hardware then the authors must specify this very clearly when the artifact is submitted. It should not be a problem to find reviewers with standard GPU hardware, but this needs to be called out so that the AEC co-chairs can assign reviewers that do have it.

Artifacts that need many resources

If the artifact needs CPU, disk or memory resources that are larger than are found on a typical laptop then please contact the AEC co-chairs before submission. At the time of writing, if the artifact runs with < 8GB RAM and < 16GB disk space then this should not be a problem.

Advice for Reviewers

Expect to spend about a day reviewing each artifact. Budget about 4 hours for reading the paper, and 4 hours for experimenting with the artifact itself. You should be able to get the basic artifact functionality to work in about 1 hour, spend 2 hours inspecting the implementation, and 1 hour writing up your report.

If you find yourself debugging problems for more then 1 hour then set the artifact to a preliminary ‘Reject’ and discuss what to do about it with the other committee members.

If an artifact runs for more than 10 minutes and crashes or fails 3 times in a row, then set the review to a preliminary ‘Reject’ and discuss what to do about it with the other committee members. It may be that other reviewers have had more success in slightly different environments.

Artifact EvaluationICFP 2020

Call for Artifacts

Forms of Artifacts

Brent YorgeyCo-chair

Hendrix College

United States

Ben LippmeierCo-chair

Ghost Locomotion

Australia

Justus Adam

University of Kent, UK

United Kingdom

Alejandro Aguirre

IMDEA Software Institute and T.U. of Madrid (UPM)

Michael Buch

Harvard University

Zilin Chen

Data61, CSIRO and UNSW

Australia

Mistral Contrastin

University of Cambridge, UK

United Kingdom

Aymeric Fromherz

Carnegie Mellon University

April Gonçalves

Roskilde University, Denmark

Zheng Guo

University of California, San Diego

United States

Matthias Güdemann

University of Applied Sciences Munich

Germany

Jonas Kastberg Hinrichsen

IT University of Copenhagen

Denmark

Jason Z.S. Hu

McGill University

Michael B. James

University of California, San Diego

United States

Wen Kokke

University of Edinburgh

United Kingdom

Lukas Lazarek

Northwestern University

Georgy Lukyanov

Newcastle University, UK

United Kingdom

Tran Ma

Ghost Locomotion

Guido Martínez

CIFASIS-CONICET, Argentina

Argentina

Darius Mercadier

Sorbonne Universités —UPMC Univ Paris 06

Denis Merigoux

INRIA

France

Agustín Mista

Chalmers University of Technology

Sweden

Philip Munksgaard

DIKU, University of Copenhagen

Denmark

Yuki Nishida

Kyoto University

Arjen Rouvoet

Delft University of Technology

Netherlands

Alex Sanchez-Stern

University of California, San Diego

United States

John Sarracino

University of California, San Diego

Taro Sekiyama

National Institute of Informatics

Japan

Mallku Soldevila

FAMAF, UNC / CONICET

Argentina