Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: neptune sync is not fault-tolerant #1224

Open
cemde opened this issue Feb 14, 2023 · 6 comments
Open

BUG: neptune sync is not fault-tolerant #1224

cemde opened this issue Feb 14, 2023 · 6 comments

Comments

@cemde
Copy link

cemde commented Feb 14, 2023

Describe the bug

When errors occur with a single run during neptune sync, the scipt stops, but it should skip it and print the error.

Reproduction

  1. write a neptune log from inside a docker container, s.t. there exist permission errors
  2. try to sync from outside the docker container

Works for other kinds of file corruptions as well.

Expected behavior

When neptune encounters a run it cant sync, it should skip it, continue with the next and at the end list all runs it couldnt sync.

Traceback

cornelius@pssr2:~/PCJax/logs$ neptune sync -p user/Project
Traceback (most recent call last):
  File "/users-2/cornelius/.conda/envs/pcjax/bin/neptune", line 8, in <module>
    sys.exit(main())
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/commands.py", line 173, in sync
    sync_runner.sync_all_containers(path, project_name)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/sync.py", line 242, in sync_all_containers
    self.sync_all_offline_containers(base_path, project_name)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/sync.py", line 220, in sync_all_offline_containers
    self.sync_offline_containers(base_path, project_name, offline_dirs)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/sync.py", line 213, in sync_offline_containers
    registered_containers = self.register_offline_containers(base_path, project, offline_dirs)
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/sync.py", line 191, in register_offline_containers
    self._move_offline_container(
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/site-packages/neptune/new/cli/sync.py", line 177, in _move_offline_container
    (base_path / OFFLINE_DIRECTORY / offline_dir).rename(
  File "/users-2/cornelius/.conda/envs/pcjax/lib/python3.10/pathlib.py", line 1234, in rename
    self._accessor.rename(self, target)
PermissionError: [Errno 13] Permission denied: '/users-2/cornelius/PCJax/logs/.neptune/offline/run__ba1c7901-881f-4af6-820e-014e0a698319' -> '/users-2/cornelius/PCJax/logs/.neptune/async/run__9877526b-8f3d-4c95-a813-a66b26e926cd/exec-0-offline'

Neptune Version

neptune-client            0.16.16                  pypi_0    pypi
@Blaizzy Blaizzy self-assigned this Feb 15, 2023
@Blaizzy
Copy link
Contributor

Blaizzy commented Feb 16, 2023

Hey @cemde

Could you try updating to the lastest release and let me know if the issue persists?

@cemde
Copy link
Author

cemde commented Feb 16, 2023

@Blaizzy still exists

@Blaizzy
Copy link
Contributor

Blaizzy commented Feb 16, 2023

That's odd.

Has it worked in the past?

@cemde
Copy link
Author

cemde commented Feb 16, 2023

I never noticed it before, but I also never logged from inside a docker image. The PermissionError is justified. It should just be excepted properly and then logged. in pseudo python:

objects2sync = [obj1,obj2,....]
failed_objs = []
for obj in tqdm(objects2sync):
    try:
        sync_object(obj)
    except:
        failed_objs.append(obj._id, obj_short_id, inspect.traceback())
print("Successful:", objects2sync - failed_objs)
print("Failed:", failed_objs))
@Blaizzy
Copy link
Contributor

Blaizzy commented Feb 16, 2023

Let me check with the team and come back to you

@Blaizzy
Copy link
Contributor

Blaizzy commented Feb 17, 2023

Hey @cemde

I've discussed it with the team and decided to send your issue to our product team as a feature request. They will take it from here and explore how to incorporate it into our future plans.

While I don't have an ETA for this feature, I do want to keep you in the loop.
You can stay up-to-date with our product roadmap by checking out our portal at https://portal.neptune.ai/tabs/15-planned.

Thanks for sharing your feedback! Really appreciate it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment