-
Notifications
You must be signed in to change notification settings - Fork 275
Rewrite manifests #1661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Rewrite manifests #1661
Conversation
2. write tests for rewrite manifests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the many comments, most of them are easy to resolve. Thanks @amitgilad3 for working on this, I like it how you reuse the _ManifestMergeManager
👍
@@ -861,6 +865,7 @@ def existing(self, entry: ManifestEntry) -> ManifestWriter: | |||
entry.snapshot_id, entry.sequence_number, entry.file_sequence_number, entry.data_file | |||
) | |||
) | |||
self._existing_files += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we already do this here:
iceberg-python/pyiceberg/manifest.py
Line 823 in 7648803
self._existing_files += 1 |
rewritten_manifests: List[ManifestFile] | ||
added_manifests: List[ManifestFile] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about slapping on some default factories: https://docs.python.org/3/library/dataclasses.html#dataclasses.field
rewritten_manifests: List[ManifestFile] | |
added_manifests: List[ManifestFile] | |
rewritten_manifests: List[ManifestFile] = field(default_factory=list) | |
added_manifests: List[ManifestFile] = field(default_factory=list) |
@@ -477,6 +478,20 @@ def _deleted_entries(self) -> List[ManifestEntry]: | |||
return [] | |||
|
|||
|
|||
@dataclass(init=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of making this Frozen
? This gives some nice benefits like being hashable: https://docs.python.org/3/library/dataclasses.html#frozen-instances
Using the default_factory'
s, we can also drop the init.
@dataclass(init=False) | |
@dataclass(frozen=True) |
self._table = table | ||
self._spec_id = spec_id or table.spec().spec_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add these at the class level? Just below 546.
_table: Table | ||
_spec: PartitionSpec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are used.
def _deleted_entries(self) -> List[ManifestEntry]: | ||
"""To determine if we need to record any deleted manifest entries. | ||
|
||
In case of an append, nothing is deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy paste :)
self.rewritten_manifests.extend(deletes_result.rewritten_manifests) | ||
self.added_manifests.extend(deletes_result.added_manifests) | ||
|
||
if not self.rewritten_manifests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, just for clarity that it is a list:
if not self.rewritten_manifests: | |
if len(self.rewritten_manifests) == 0: |
min_count_to_merge=2, | ||
merge_enabled=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in Java we set the properties from the table properties:
|
||
tbl.append(arrow_table_with_null) | ||
|
||
assert tbl.format_version == 2, f"Expected v2, got: v{tbl.format_version}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we testing here? I would love to see if we could rewrite the V1 and V2 manifest into a V2 one.
assert len(result.rewritten_manifests) == 4, "Action should rewrite 4 manifests" | ||
assert len(result.added_manifests) == 2, "Action should add 2 manifests" | ||
|
||
new_manifests = tbl.current_snapshot().manifests(tbl.io) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing wrong with this, but tbl.inspect
might also be helpful to assert the manifests.
@amitgilad3 gentle ping, are you still interested in working on this? |
Hey @Fokko , yes i am very interested in finishing this, must of missed this (sorry) , will look at this later today :) |
This is an initial implementation of rewrite manifests, aiming to mimic the Java implementation as closely as possible. I’ve tried to follow the same structure and logic, but there are still some areas that might need refinement.
I’m looking for feedback and suggestions on:
• Whether the approach aligns well with the existing design.
• Any gaps or optimizations that could improve performance.
• How best to proceed with completing this feature.
Would love any insights or guidance on the next steps! Thanks in advance for the review! 🙌