An Empirical Study of Container Image Configurations and Their Impact on Start Times (Container Image Data)
Creators
- 1. University of Würzburg
- 2. University of Chicago
Description
Dataset with the container image metadata used for our IEEE/ACM CCGRID 2023 paper "An Empirical Study of Container Image Configurations and Their Impact on Start Times".
Abstract of the paper: A core selling point of application containers is their fast start times compared to other virtualization approaches like virtual machines. Predictable and fast container start times are crucial for improving and guaranteeing the performance of containerized cloud, serverless, and edge applications. While previous work has investigated container starts, there remains a lack of understanding of how start times may vary across container configurations. We address this shortcoming by presenting and analyzing a dataset of approximately 200,000 open-source Docker Hub images featuring different image configurations (e.g., image size and exposed ports). Leveraging this dataset, we investigate the start times of containers in two environments and identify the most influential features. Our experiments show that container start times can vary between hundreds of milliseconds and tens of seconds in the same environment. Moreover, we conclude that no single dominant configuration feature determines a container's start time and that hardware and software parameters must be considered together for an accurate assessment.
Dataset description: Our images dataset contains 200,986 entries with 21 features associated to each container image. In the following, we describe the meaning of each feature. Further information is available in OCI Image Specification and the Docker Run Documentation. Besides the 20 features grouped in the five categories below, each dataset entry has a image_id, which is used to uniquely identify the dataset entry.
Features
Metadata features (prefix: meta)
- meta_repo_digest : The repo digest is a SHA-256 hash which is used to uniquely identify and pull the image from Docker Hub
- meta_architecture : The CPU architecture which the binaries in the image are built to run on
- meta_os : The name of the operating system which the image is built to run on
- meta_docker_version : The Docker version used to built this image
I/O stream features (prefix: io)
- io_attach_stdin : boolean setting to determine whether the console should be attached to the process stdin stream
- io_attach_stdout : boolean setting to determine whether the console should be attached to the process stdout stream
- io_attach_stderr : boolean setting to determine whether the console should be attached to the process stderr stream
- io_tty : boolean setting to determine whether the console should pretend to be a TTY when attached
- io_open_std_in : boolean setting to determine whether the process stdin stream should be kept open even if console not attached
- io_std_in_once : boolean setting to determine whether the process retrieved input from the stdin stream at least once
Start command features (prefix: cmd)
- cmd_args : Length of list of arguments to use as the command to execute when the container starts
- cmd_envvars : Environment variables set per default when the container starts
- cmd_additional_args : Length of list for additional arguments to the containers entrypoint
File system features (prefix: fs)
- fs_volumes : Number of volumes to create/use by default
- fs_size : Size of this image in bytes
- fs_virtual_size : Virtual size of this image in bytes (equals size)
- fs_graph_driver_name : Name of the image's graph driver
- fs_root_fs_type : Name of the file system type used in the image
- fs_layers : Number of root file system layers
Networking features (prefix: net)
- net_ports : Number of ports to expose per default
Dataset acquisition: The dataset has been acquired from Docker Hub using a web crawler. We used substring matches with the Docker Hub Explore function. As search strings, we used all letter combination with sizes 1 to 3, meaning that our first search string was 'a' and our last was 'zzz'. We included both results from the 'recently updated' and the 'most popular' selection. We came up with an initial list of 286,294 image names. We then tested we could pull and start these images once. These tests have been conducted from April to June 2022. We sorted out all images that were either not pullable or startable and retrieved all total of 200,986 valid images. In the following, we describe the error types that we encountered and that let to the removal of the causing image from the dataset:
- The image manifest was unknown when we tried to download it meaning that is has been renamed or deleted from the time when our web crawler was running
- The entrypoint command required a dependency that was missing in the image and therefore the container could not be started
- The image did not specify an entrypoint command and could therefore not be started
- The image declared an invalid root file system type
- The image had a malformed root file system
- The image configuration was incomplete and therefore not all required data could be obtained
See also our CodeOcean capsule with the processing scripts for our paper: https://doi.org/10.24433/CO.4595026.v2
Files
Files
(13.0 MB)
Name | Size | Download all |
---|---|---|
md5:f01e3901ef0744586ecf8ad55bf0a294
|
13.0 MB | Download |
Additional details
References
- Straesser, Martin et al. (2023) An Extensive Analysis of Container Image Configurations and Their Impact on Start Times. In Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing.
- Straesser, Martin et al. (2023) An Extensive Analysis of Container Image Configurations and Their Impact on Start Times (Supplementary Materials). CodeOcean Capsule. Available online: https://doi.org/10.24433/CO.4595026.v2