Skip to content

Avoid global umask for setting file mode. #7547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 6, 2025

Conversation

ryan-clancy
Copy link
Contributor

This PR updates the method for setting the permissions on cache_path after calling shutil.move. The call to shutil.move may not preserve permissions if the source and destination are on different filesystems. Reading and resetting umask can cause race conditions, so directly read what permissions were set for the temp_file instead.

This fixes #7536.

@@ -411,11 +412,14 @@ def temp_file_manager(mode="w+b"):
# GET file object
fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm)

# get the permissions of the temp file
temp_file_mode = stat.S_IMODE(os.stat(temp_file.name).st_mode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't necessarily return the user's umask no ?

we need the user's umask since some people use datasets with shared disks and use group permissions (see #2157 #5799)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking more at the history of this code, I am not sure the umask block below is needed at all anymore and could just be deleted.

with FileLock(lock_path):
if resume_download:
incomplete_path = cache_path + ".incomplete"
@contextmanager
def _resumable_file_manager():
with open(incomplete_path, "a+b") as f:
yield f
temp_file_manager = _resumable_file_manager
if os.path.exists(incomplete_path):
resume_size = os.stat(incomplete_path).st_size
else:
resume_size = 0
else:
temp_file_manager = partial(tempfile.NamedTemporaryFile, dir=cache_dir, delete=False)
resume_size = 0
# Download to temporary file, then copy to cache dir once finished.
# Otherwise you get corrupt cache entries if the download gets interrupted.
with temp_file_manager() as temp_file:
logger.info(f"{url} not found in cache or force_download set to True, downloading to {temp_file.name}")
# GET file object
if scheme == "ftp":
ftp_get(url, temp_file)
elif scheme not in ("http", "https"):
fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc)
else:
http_get(
url,
temp_file,
proxies=proxies,
resume_size=resume_size,
headers=headers,
cookies=cookies,
max_retries=max_retries,
desc=download_desc,
)
logger.info(f"storing {url} in cache at {cache_path}")
shutil.move(temp_file.name, cache_path)
umask = os.umask(0o666)
os.umask(umask)
os.chmod(cache_path, 0o666 & ~umask)
is the code just after the umask was added and NamedTemporaryFile was used to create the files. NamedTemporaryFile always set 600 permissions and didn't respect umask so umask code was added to fix the permissions.

d536e37 removed the NamedTemporaryFile and moved to open(), which respects the umask.

We should just be able to remove this temp_file_mode + updated chmod lines and then remove:

umask = os.umask(0o666)
os.umask(umask)
os.chmod(cache_path, 0o666 & ~umask)

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq - I updated the PR to remove this outdated code, seems to work as expected and umask is respected (see output below from looking at files produced from tests/test_file_utils.py).

If this looks good to you, can we get this merged for 3.5.2?

Default umask:

$ ls -la | grep a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0
-rw-r--r--@ 1 ryan  staff   39 May  5 13:14 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847
-rw-r--r--@ 1 ryan  staff   35 May  5 13:14 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847.json
-rw-r--r--@ 1 ryan  staff    0 May  5 13:14 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847.lock

Modified to umask 002:

$ ls -la | grep a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0
-rw-rw-r--@ 1 ryan  staff   39 May  5 13:15 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847
-rw-rw-r--@ 1 ryan  staff   35 May  5 13:15 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847.json
-rw-rw-r--@ 1 ryan  staff    0 May  5 13:15 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847.lock

@ryan-clancy ryan-clancy requested a review from lhoestq May 5, 2025 17:19
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great !

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq merged commit 67907b1 into huggingface:main May 6, 2025
8 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Errno 13] Permission denied: on .incomplete file
3 participants