Performance comparison of the Vaex and Dask libraries

封面

如何引用文章

全文:

详细

The purpose of the study was to compare the performance of the Vaex and Dask libraries, designed to enhance data processing efficiency. In thıs regard, experiments involving the assessment of time consumption for various classes of operations were conducted. The research included dataset preparation, data sampling, environment configuration execution, installation and setup of the aforementioned modules, Python script development, performance testing and subsequent analysis of the results obtained. It was observed that Vaex exhibits high performance when processing large datasets comprising of million objects on a single local machine; Dask's metrics performance is inferior to the former library. This fact indicates that Vaex is a more efficient tool for processing large datasets under conditions similar to those used in this study. The results and conclusions of the study emphasize the importance of choosing the optimal library when processing large volumes of data, and also confirm the advantages of the Vaex library in this context.

作者简介

S. Palmov

Povolzhskiy State University of Telecommunications and Informatics; Samara State Technical University

编辑信件的主要联系方式.
Email: s.palmov@psuti.ru

Associated Professor of Information Systems and Technologies Department, PhD in Technical Science, Associated Professor of Technologies Department

俄罗斯联邦, Samara; Samara

N. Shatalov

Povolzhskiy State University of Telecommunications and Informatics

Email: nickit.schatalow@yandex.ru

Student of Information Systems and Technologies Department

俄罗斯联邦, Samara

参考

  1. What is Vaex? URL: https://vaex.readthedocs.io/en/latest/index.html (accessed: 15.04.2024).
  2. Dask – Dask documentation. URL: https://docs.dask.org/en/stable/ (accessed: 15.04.2024).
  3. GitHub – dask/dask: Parallel computing with task scheduling. URL: https://github.com/dask/dask (accessed: 16.04.2024).
  4. NumPy. URL: https://numpy.org/ (accessed: 16.04.2024).
  5. GitHub – vaexio/vaex. URL: https://github.com/vaexio/vaex (accessed: 17.04.2024).
  6. Dask vs Vaex – a qualitative comparison. URL: https://vaex.io/blog/dask-vs-vaex-a-qualitative-comparison (accessed: 17.04.2024).
  7. How to use HDF5 files in Pytho. URL: https://habr.com/ru/companies/otus/articles/416309/ (accessed: 17.04.2024). (In Russ.)
  8. datasets for training projects. URL: https://habr.com/ru/companies/edison/articles/480408/ (accessed: 18.04.2024). (In Russ.)
  9. Vaex and Dask: when Pandas cannot process big data. URL: https://python-school.ru/blog/analiz-dannyh/vaex-vs-dask/ (accessed: 18.04.2024). (In Russ.)
  10. Using the Vaex library for processing large amounts of data. URL: https://newtechaudit.ru/ispolzovanie-biblioteki-vaex-dlya-obrabotki-bolshih-obyomov-dannyh/ (accessed: 19.04.2024). (In Russ.)
  11. Data analysis using the Dask library. URL: https://habr.com/ru/companies/otus/articles/759552/ (accessed: 19.04.2024). (In Russ.)
  12. Gruzdev A.V., Heidt M. Studying Pandas. Transl. From English by A.V. Gruzdev. Moskow: DMK, 2019, 682 p. (In Russ.)
  13. Ues M. Python and Data Analysis. Primary Data Processing Using Pandas, Numpy and Jupiter. Transl. From English by A.A. Slinkin. 3nd ed. Moscow: DMK, 536 p. (In Russ.)
  14. Vasiliev Yu.A. Python for Data Science. Saint Petersburg: Piter, 272 p. (In Russ.)

补充文件

附件文件
动作
1. JATS XML

版权所有 © Palmov S.V., Shatalov N.V., 2025

Creative Commons License
此作品已接受知识共享署名-非商业性使用-禁止演绎 4.0国际许可协议的许可。