Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Summarization] Don't close on receiving summary ack for which there is no snapshot in storage #21761

Merged
merged 8 commits into from
Jul 11, 2024

Conversation

agarwal-navin
Copy link
Contributor

@agarwal-navin agarwal-navin commented Jul 2, 2024

Bug / root cause

There can be scenarios such as DB rollback in storage, where the latest snapshot is lost but the corresponding summary ack is still present because ops are maintained by another service. Currentlty, if a summarizer receives an ack that is newer that the summary it's tracking, it will fetch latest snapshot and close. So, in the above scenario, the summarizer would get the ack, attempt to fetch the snapshot and close. When a new summarizer loads, it will do the same thing because the ack corresponding to the snapshot doesn't exist. The document will be stuck in this loop and reach 10k+ unsummarizer ops after which it won't be able to make any changes.

Fix

When the client gets a newer ack, it fetches the snapshot and if the fetched snapshot is older than the one for the ack, it will ignore this ack and not close. The idea being that the server will send a summary ack only after it has a snapshot present for it, so a snapshot should exist. The only reason it wouldn't exist is that it was deleted later. So, the client assumes this is the case and moves on.

Let's say this is not the case and the snapshot was not deleted but the server did not return it during snapshot fetch for some reason. The next summary upload by this client will fail because the "parent handle" (the last summary id) will be incorrect. The summarizer will close and the next summarizer will most likely not run into this issue because this is a very rare scenario.

The PR has the following changes:

  • Container runtime now tracks the details of the last summary ack it has processed. It uses the information from this as the "parent" summary's handle when uploading a summary. Currently, it gets this information from the "SummaryCollection" but in the scenario described above, "SummaryCollection" has already updated the last ack to the one for which there is no snapshot. So, using that will fail the next summary upload.
  • If the fetched snapshot's reference seq# is lower than the reference seq# in the ack, simply return.
  • Added a property newerSnapshotPresent to the RefreshLatestSummaryAckFetch telemetry which will tell us if there was a newer snapshot present or not. I could add another telemetry when this case is hit, but I think this should be enough.

Testing

  • Added a container runtime unit test that mocks the storage layer. It only has one snapshot and it always returns that. The test calls ContainerRuntime::refreshLatestSummaryAck with a newer ref seq# and then calls ContainerRuntime::submitSummary. It validates that both succeed and call storage with right properties.
  • Added an end-to-end test. In the test, there is only one snapshot (initial one from detached container). The test then creates a summarizer and calls its SummaryCollection (that handles summary op / ack) with a fake summary op and summary ack whose ref seq# is greater than the initial snapshot. This simulates the scenario because from the client perspective, it got a summary ack for a snapshot that doesn't exist.

AB#8334

@github-actions github-actions bot added area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch labels Jul 2, 2024
@agarwal-navin agarwal-navin requested review from a team, pragya91, markfields, jatgarg, tyler-cai-microsoft, kian-thompson, rajatch-ff and vladsud and removed request for a team July 2, 2024 22:42
@agarwal-navin agarwal-navin changed the title [Summarization] Don't close on receiving summary ack without corresponding snapshot Jul 2, 2024
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Jul 8, 2024

@fluid-example/bundle-size-tests: +469 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 457.07 KB 457.16 KB +91 Bytes
azureClient.js 555.04 KB 555.14 KB +105 Bytes
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 258.28 KB 258.35 KB +70 Bytes
fluidFramework.js 391.41 KB 391.43 KB +14 Bytes
loader.js 134.1 KB 134.11 KB +14 Bytes
map.js 42.17 KB 42.17 KB +7 Bytes
matrix.js 145.43 KB 145.44 KB +7 Bytes
odspClient.js 522.81 KB 522.91 KB +105 Bytes
odspDriver.js 97.17 KB 97.19 KB +21 Bytes
odspPrefetchSnapshot.js 42.27 KB 42.29 KB +14 Bytes
sharedString.js 162.51 KB 162.52 KB +7 Bytes
sharedTree.js 381.87 KB 381.88 KB +7 Bytes
Total Size 3.26 MB 3.27 MB +469 Bytes

Baseline commit: 65e6a0b

Generated by 🚫 dangerJS against 1eb9a61

* - If that snapshot is older than the one tracked by this client, ignore the ack because only the latest
* snapshot is tracked.
* - If that snapshot is newer, attempt to fetch the latest snapshot and do one of the following:
* - If the fetched snapshot is same or newer than the one for which ack was received, close this client. The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd number these sub-bullets instead of having - to make it more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the indentation. It wasn't allowing it until I disabled it.

@agarwal-navin agarwal-navin merged commit 5f898d5 into microsoft:main Jul 11, 2024
30 checks passed
@agarwal-navin agarwal-navin deleted the dontFailUnknownAck branch July 15, 2024 22:31
RishhiB pushed a commit to RishhiB/FluidFramework-1 that referenced this pull request Jul 18, 2024
…is no snapshot in storage (microsoft#21761)

## Bug / root cause
There can be scenarios such as DB rollback in storage, where the latest
snapshot is lost but the corresponding summary ack is still present
because ops are maintained by another service. Currentlty, if a
summarizer receives an ack that is newer that the summary it's tracking,
it will fetch latest snapshot and close. So, in the above scenario, the
summarizer would get the ack, attempt to fetch the snapshot and close.
When a new summarizer loads, it will do the same thing because the ack
corresponding to the snapshot doesn't exist. The document will be stuck
in this loop and reach 10k+ unsummarizer ops after which it won't be
able to make any changes.

## Fix
When the client gets a newer ack, it fetches the snapshot and if the
fetched snapshot is older than the one for the ack, it will ignore this
ack and not close. The idea being that the server will send a summary
ack only after it has a snapshot present for it, so a snapshot should
exist. The only reason it wouldn't exist is that it was deleted
**_later_**. So, the client assumes this is the case and moves on.

Let's say this is not the case and the snapshot was not deleted but the
server did not return it during snapshot fetch for some reason. The next
summary upload by this client will fail because the "parent handle" (the
last summary id) will be incorrect. The summarizer will close and the
next summarizer will most likely not run into this issue because this is
a very rare scenario.

The PR has the following changes:
- Container runtime now tracks the details of the last summary ack it
has processed. It uses the information from this as the "parent"
summary's handle when uploading a summary. Currently, it gets this
information from the "SummaryCollection" but in the scenario described
above, "SummaryCollection" has already updated the last ack to the one
for which there is no snapshot. So, using that will fail the next
summary upload.
- If the fetched snapshot's reference seq# is lower than the reference
seq# in the ack, simply return.
- Added a property `newerSnapshotPresent` to the
`RefreshLatestSummaryAckFetch` telemetry which will tell us if there was
a newer snapshot present or not. I could add another telemetry when this
case is hit, but I think this should be enough.

## Testing
- Added a container runtime unit test that mocks the storage layer. It
only has one snapshot and it always returns that. The test calls
`ContainerRuntime::refreshLatestSummaryAck` with a newer ref seq# and
then calls `ContainerRuntime::submitSummary`. It validates that both
succeed and call storage with right properties.
- Added an end-to-end test. In the test, there is only one snapshot
(initial one from detached container). The test then creates a
summarizer and calls its `SummaryCollection` (that handles summary op /
ack) with a fake summary op and summary ack whose ref seq# is greater
than the initial snapshot. This simulates the scenario because from the
client perspective, it got a summary ack for a snapshot that doesn't
exist.



[AB#8334](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/8334)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch
4 participants