Description
openedon Mar 13, 2023
Story
Problem Statement
Handling large files in k6, whether binary or structured format files such as CSV, leads to high memory usage. As a result, our user’s experience, especially in the cloud, is degraded as soon as they need to handle large data sets; of user ids, for instance.
This issue is experienced by our users in various situations and is caused by many design and implementation decisions in the current state of the k6 open-source tool.
Objectives
Product-oriented
The product-oriented objective of this story, and definition of success, will consist in landing a new streaming CSV parser in k6, allowing users to parse and use big CSV files (> 500MB) that wouldn’t fit in memory and likely lead their k6 scripts to crash. We are keen to provide any technical improvements to k6 that make it possible along the way.
Technology Oriented
From a technological standpoint, the objective of this story is to provide all the design and technical changes necessary to complete the product story. Our primary objective should be to support the ability of users to use large data set files in their k6 scripts without leading to out-of-memory errors. As we pursue this goal, we aim to pick the solutions with the most negligible overhead and are keen to take on technical debt if necessary.
Resolution
Through internal workshops with @sniku and @mstoykov, we surfaced various topics and issues that must be addressed to fulfill the objective.
- Add a streaming-based CSV parser to k6 #2976: our product-level end goal
Must-have
The bare minimum must-have items to be able even to start tackling the top-level product objective would be:
- Support finer-grained and richer access to tar archives content #2975: We end up caching files in memory because we cannot read them directly from a tar archive without decompressing them first.
- Reduce the caching of files inside k6. As a result, 👆 k6 caching files users use in memory and duplicates them per-vu. The behavior is all over the place. If a more convenient tar library allowing for direct access to files from archives were to be available, we might want to revisit this behavior.
Nice to have
As we're at it, some other set of features and refactors would be beneficial to the larger story of handling large files in k6
- Design a File API for k6 #2977. Currently, the
open()
method of k6 is somewhat misnamed and performs areadFile()
operation. This is also a result of k6 archiving users' content in a single tar archive and having to access resources through it. With more efficient and flexible access to k6's tar archive's content, we believe that k6 would also benefit from having a more "standard" file API to open, read and seek through files more conveniently. This would help support streaming use cases by providing more flexible navigation in a file's content. While also helping benefit from OS-based optimizations like the buffer cache. - Add Streams API support to k6 #2978 Another key aspect of handling files more efficiently in k6 is how we access them. As illustrated 👆 we currently only have a way to load the whole content of a file in memory. To support the specific product goal, and other endeavors such as #2273 , or our work towards a new HTTP API, we believe adding even a partial (read operations only) support for the Streams API to k6 would be beneficial. It would establish a healthy baseline API for streaming IO in k6.
Problem Space
This issue approaches the problem at hand with a pragmatic product-oriented objective. However, this specific set of issues has already been approached from various angles in the past and is connected to longer-term plans as demonstrated in this list:
Metadata
Assignees
Type
Projects
Status
Mid term