Pyarrow Parquet management

Introduction

What is Parquet?
Parquet is a columnar storage file format that provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. It is designed to support complex data structures and is ideal for big data processing.
Core features of Parquet:
Efficient storage and compression: Parquet uses efficient encoding and compression techniques to store data in a columnar format, reducing the storage space and improving the read performance
Support for various compression algorithms: Parquet supports various compression algorithms such as Snappy, Gzip, and LZO
Support for complex data types: Parquet supports complex data types such as nested fields, arrays, and maps, making it suitable for handling complex data structures
Efficient encoding and decoding schemes: Parquet provides efficient encoding and decoding schemes for complex data types, improving the read and write performance
Pyarrow
Pyarrow is a cross-language development platform for in-memory data that provides efficient data interchange between Python and other languages. It is designed to support complex data structures and is ideal for big data processing.
Pyarrow provides efficient data interchange between Python and other languages, enabling seamless data exchange between different systems
It supports various data types and complex data structures, making it suitable for handling complex data processing tasks

The helpers.hparquet module provides a set of helper functions to manage Parquet files using the pyarrow library.

Upgraded pyarrow to 14.0.2 -> 15.0.0
Update outomes in the tests due to the new version of pyarrow changed the size of the some Parquet files
Delete partition_filename from to_partitioned_parquet() function
Delete partition_filename_cb in the pq.write_to_dataset() call
Delete partition_filename arguments from the all calls of to_partitioned_parquet() function
In the list_and_merge_pq_files()
In the pq.ParquetDataset() change use_legacy_dataset=True to partitioning=None
Introduce purify_parquet_file_names() in the helpers/hunit_test.py

The upgrade to pyarrow 15.0.0 is necessary to keep the library up-to-date and benefit from the latest features and improvements
Due the partition_filename_cb is deprecated in the new version of pyarrow, it is necessary to remove it from the to_partitioned_parquet() function
After discussion with the team, we decided to remove the partition_filename from the to_partitioned_parquet() function
The consequence of this change is that the Parquet files will be saved with the default names like <guid>-<number>.parquet for example f3b3e3e33e3e3e3e3e3e3e3e3e3e3e3e-0.parquet
In the pyarrow 1.15.0 the use_legacy_dataset is deprecated and the partitioning should be used instead
When we use the partitioning=None in pq.ParquetDataset() then we will not use the partitioning and will not add the partitioned columns to the dataset
Some tests expect the Parquet files with the name data.parquet. The purify_parquet_file_names() changes the names of the Parquet files to data.parquet in the goldens

Context: Before the upgrade to pyarrow 15.0.0, casting the time unit to us was necessary to avoid the pyarrow.lib.ArrowInvalid exception.
Problem: In pyarrow 15.0.0, this exception is no longer raised, and the time unit is preserved correctly.
Insight: Casting the time unit to us in the to_parquet() function is no longer necessary and can be removed. At the same time, casting to us in the to_parquet() function does not make sense since the time unit will be converted back to ns in the from_parquet() function.
Solution: Remove the casting of the time unit to us in the to_parquet() function.

Context: The pyarrow version prior to 15.0.0 did not correctly preserve the time unit information when reading data back from Parquet files. That's why casting the time unit to ns was necessary in the from_parquet() function.
Problem: Since the upgrade to pyarrow 15.0.0, casting the time unit to ns is no longer necessary, as the new version of pyarrow correctly preserves the time unit. See the Pyarrow issue for details: https://github.com/apache/arrow/issues/33321 When reading Parquet files with a time unit that is not in ['us', 'ns'], the pyarrow.lib.ArrowInvalid exception could be raised. This could occur when Pyarrow attempts to cast the time unit to a lower resolution. This behavior is tested in the test_parquet_files_with_mixed_time_units_2 test. In this case, the alphabetical order of the files is important. The data from the first file will be cast to the time unit of the rest of the files.
Insight: The general approach is to preserve the time unit information after reading data back from Parquet files. Currently, resolving this issue is challenging because Parquet data is mixed with data from CSV files, which convert the time unit to ns by default. Refer to CmampTask7331 for details. https://github.com/cryptokaizen/cmamp/issues/7331
Solution: Retain the casting of the time unit to ns in the from_parquet() function.