-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Summarization] Don't close on receiving summary ack for which there is no snapshot in storage #21761
[Summarization] Don't close on receiving summary ack for which there is no snapshot in storage #21761
Conversation
packages/test/test-end-to-end-tests/src/test/summarization/summaries.spec.ts
Show resolved
Hide resolved
⯅ @fluid-example/bundle-size-tests: +469 Bytes
Baseline commit: 65e6a0b |
b06f90a
to
238c291
Compare
* - If that snapshot is older than the one tracked by this client, ignore the ack because only the latest | ||
* snapshot is tracked. | ||
* - If that snapshot is newer, attempt to fetch the latest snapshot and do one of the following: | ||
* - If the fetched snapshot is same or newer than the one for which ack was received, close this client. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd number these sub-bullets instead of having -
to make it more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved the indentation. It wasn't allowing it until I disabled it.
6c94afe
to
58cdd8f
Compare
…is no snapshot in storage (microsoft#21761) ## Bug / root cause There can be scenarios such as DB rollback in storage, where the latest snapshot is lost but the corresponding summary ack is still present because ops are maintained by another service. Currentlty, if a summarizer receives an ack that is newer that the summary it's tracking, it will fetch latest snapshot and close. So, in the above scenario, the summarizer would get the ack, attempt to fetch the snapshot and close. When a new summarizer loads, it will do the same thing because the ack corresponding to the snapshot doesn't exist. The document will be stuck in this loop and reach 10k+ unsummarizer ops after which it won't be able to make any changes. ## Fix When the client gets a newer ack, it fetches the snapshot and if the fetched snapshot is older than the one for the ack, it will ignore this ack and not close. The idea being that the server will send a summary ack only after it has a snapshot present for it, so a snapshot should exist. The only reason it wouldn't exist is that it was deleted **_later_**. So, the client assumes this is the case and moves on. Let's say this is not the case and the snapshot was not deleted but the server did not return it during snapshot fetch for some reason. The next summary upload by this client will fail because the "parent handle" (the last summary id) will be incorrect. The summarizer will close and the next summarizer will most likely not run into this issue because this is a very rare scenario. The PR has the following changes: - Container runtime now tracks the details of the last summary ack it has processed. It uses the information from this as the "parent" summary's handle when uploading a summary. Currently, it gets this information from the "SummaryCollection" but in the scenario described above, "SummaryCollection" has already updated the last ack to the one for which there is no snapshot. So, using that will fail the next summary upload. - If the fetched snapshot's reference seq# is lower than the reference seq# in the ack, simply return. - Added a property `newerSnapshotPresent` to the `RefreshLatestSummaryAckFetch` telemetry which will tell us if there was a newer snapshot present or not. I could add another telemetry when this case is hit, but I think this should be enough. ## Testing - Added a container runtime unit test that mocks the storage layer. It only has one snapshot and it always returns that. The test calls `ContainerRuntime::refreshLatestSummaryAck` with a newer ref seq# and then calls `ContainerRuntime::submitSummary`. It validates that both succeed and call storage with right properties. - Added an end-to-end test. In the test, there is only one snapshot (initial one from detached container). The test then creates a summarizer and calls its `SummaryCollection` (that handles summary op / ack) with a fake summary op and summary ack whose ref seq# is greater than the initial snapshot. This simulates the scenario because from the client perspective, it got a summary ack for a snapshot that doesn't exist. [AB#8334](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/8334)
Bug / root cause
There can be scenarios such as DB rollback in storage, where the latest snapshot is lost but the corresponding summary ack is still present because ops are maintained by another service. Currentlty, if a summarizer receives an ack that is newer that the summary it's tracking, it will fetch latest snapshot and close. So, in the above scenario, the summarizer would get the ack, attempt to fetch the snapshot and close. When a new summarizer loads, it will do the same thing because the ack corresponding to the snapshot doesn't exist. The document will be stuck in this loop and reach 10k+ unsummarizer ops after which it won't be able to make any changes.
Fix
When the client gets a newer ack, it fetches the snapshot and if the fetched snapshot is older than the one for the ack, it will ignore this ack and not close. The idea being that the server will send a summary ack only after it has a snapshot present for it, so a snapshot should exist. The only reason it wouldn't exist is that it was deleted later. So, the client assumes this is the case and moves on.
Let's say this is not the case and the snapshot was not deleted but the server did not return it during snapshot fetch for some reason. The next summary upload by this client will fail because the "parent handle" (the last summary id) will be incorrect. The summarizer will close and the next summarizer will most likely not run into this issue because this is a very rare scenario.
The PR has the following changes:
newerSnapshotPresent
to theRefreshLatestSummaryAckFetch
telemetry which will tell us if there was a newer snapshot present or not. I could add another telemetry when this case is hit, but I think this should be enough.Testing
ContainerRuntime::refreshLatestSummaryAck
with a newer ref seq# and then callsContainerRuntime::submitSummary
. It validates that both succeed and call storage with right properties.SummaryCollection
(that handles summary op / ack) with a fake summary op and summary ack whose ref seq# is greater than the initial snapshot. This simulates the scenario because from the client perspective, it got a summary ack for a snapshot that doesn't exist.AB#8334