Skip to content

GH-45937: [C++][Parquet] support to encode, write and validate variant#50252

Draft
HuaHuaY wants to merge 3 commits into
apache:mainfrom
HuaHuaY:variant
Draft

GH-45937: [C++][Parquet] support to encode, write and validate variant#50252
HuaHuaY wants to merge 3 commits into
apache:mainfrom
HuaHuaY:variant

Conversation

@HuaHuaY

@HuaHuaY HuaHuaY commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

This PR supports:

  • Mapping between arrow extension variant type and Parquet variant type
  • Encoding unshredded variant arrays
  • Assembling and verifying shredded variant arrays given a typed_value array

This PR does not support:

  • Inferring typed_value shredded data from the value array

What changes are included in this PR?

Are these changes tested?

Yes.

Are there any user-facing changes?

  1. There is a new properties variant_validation_enabled_ in ArrowWriterProperties.
  2. Variant builders at cpp/src/arrow/extension/variant/.

::arrow::internal::Executor* executor_;

bool write_time_adjusted_to_utc_;
bool variant_validation_enabled_;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this belongs to ArrowWriterProperties but not WriterProperties? If users the low-level parquet writer without Arrow API, serialized variant values cannot be validated any more?

PREFIX
"arrow-canonical-extensions")

add_subdirectory(variant)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these files be relocated to cpp/src/parquet/arrow/variant instead (just like what we did for geospatial types)? I think cpp/src/arrow/extension is for metadata of extension type and array. If ARROW_PARQUET is OFF, we don't need even these files to be compiled.

return field->type()->storage_id() == Type::BINARY ||
field->type()->storage_id() == Type::LARGE_BINARY;

bool IsSupportedPrimitiveTypedValue(const std::shared_ptr<DataType>& type) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that some Arrow primitive types are missing from here: https://arrow.apache.org/docs/format/CanonicalExtensions.html#primitive-type-mappings

namespace parquet::arrow::internal {

PARQUET_EXPORT
::arrow::Status ValidateVariants(const ::arrow::ChunkedArray& data,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why making them internal? It seems useful if users want to validate values on their side.

#include "arrow/status.h"
#include "arrow/util/visibility.h"

namespace arrow::extension {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that all types defined in this file are not related to arrow so perhaps we can move them to cpp/src/parquet/variant folder and use namespace parquet or namesapce parquet::variant?

namespace arrow::extension {

enum class VariantBasicType : uint8_t {
Primitive = 0,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not adding k prefix like kPrimitive?

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants