[Summarization] Don't close on receiving summary ack for which there is no snapshot in storage #21761

agarwal-navin · 2024-07-02T22:41:57Z

Bug / root cause

There can be scenarios such as DB rollback in storage, where the latest snapshot is lost but the corresponding summary ack is still present because ops are maintained by another service. Currentlty, if a summarizer receives an ack that is newer that the summary it's tracking, it will fetch latest snapshot and close. So, in the above scenario, the summarizer would get the ack, attempt to fetch the snapshot and close. When a new summarizer loads, it will do the same thing because the ack corresponding to the snapshot doesn't exist. The document will be stuck in this loop and reach 10k+ unsummarizer ops after which it won't be able to make any changes.

Fix

When the client gets a newer ack, it fetches the snapshot and if the fetched snapshot is older than the one for the ack, it will ignore this ack and not close. The idea being that the server will send a summary ack only after it has a snapshot present for it, so a snapshot should exist. The only reason it wouldn't exist is that it was deleted later. So, the client assumes this is the case and moves on.

Let's say this is not the case and the snapshot was not deleted but the server did not return it during snapshot fetch for some reason. The next summary upload by this client will fail because the "parent handle" (the last summary id) will be incorrect. The summarizer will close and the next summarizer will most likely not run into this issue because this is a very rare scenario.

The PR has the following changes:

Container runtime now tracks the details of the last summary ack it has processed. It uses the information from this as the "parent" summary's handle when uploading a summary. Currently, it gets this information from the "SummaryCollection" but in the scenario described above, "SummaryCollection" has already updated the last ack to the one for which there is no snapshot. So, using that will fail the next summary upload.
If the fetched snapshot's reference seq# is lower than the reference seq# in the ack, simply return.
Added a property newerSnapshotPresent to the RefreshLatestSummaryAckFetch telemetry which will tell us if there was a newer snapshot present or not. I could add another telemetry when this case is hit, but I think this should be enough.

Testing

Added a container runtime unit test that mocks the storage layer. It only has one snapshot and it always returns that. The test calls ContainerRuntime::refreshLatestSummaryAck with a newer ref seq# and then calls ContainerRuntime::submitSummary. It validates that both succeed and call storage with right properties.
Added an end-to-end test. In the test, there is only one snapshot (initial one from detached container). The test then creates a summarizer and calls its SummaryCollection (that handles summary op / ack) with a fake summary op and summary ack whose ref seq# is greater than the initial snapshot. This simulates the scenario because from the client perspective, it got a summary ack for a snapshot that doesn't exist.

AB#8334

packages/runtime/container-runtime/src/containerRuntime.ts

packages/test/test-end-to-end-tests/src/test/summarization/summaries.spec.ts

msfluid-bot · 2024-07-08T20:00:06Z

⯅ @fluid-example/bundle-size-tests: +469 Bytes

Metric Name	Baseline Size	Compare Size	Size Diff
aqueduct.js	457.07 KB	457.16 KB	⯅ +91 Bytes
azureClient.js	555.04 KB	555.14 KB	⯅ +105 Bytes
connectionState.js	680 Bytes	680 Bytes	■ No change
containerRuntime.js	258.28 KB	258.35 KB	⯅ +70 Bytes
fluidFramework.js	391.41 KB	391.43 KB	⯅ +14 Bytes
loader.js	134.1 KB	134.11 KB	⯅ +14 Bytes
map.js	42.17 KB	42.17 KB	⯅ +7 Bytes
matrix.js	145.43 KB	145.44 KB	⯅ +7 Bytes
odspClient.js	522.81 KB	522.91 KB	⯅ +105 Bytes
odspDriver.js	97.17 KB	97.19 KB	⯅ +21 Bytes
odspPrefetchSnapshot.js	42.27 KB	42.29 KB	⯅ +14 Bytes
sharedString.js	162.51 KB	162.52 KB	⯅ +7 Bytes
sharedTree.js	381.87 KB	381.88 KB	⯅ +7 Bytes
Total Size	3.26 MB	3.27 MB	⯅ +469 Bytes

Baseline commit: 65e6a0b

Generated by 🚫 dangerJS against 1eb9a61

packages/runtime/container-runtime/src/containerRuntime.ts

kian-thompson · 2024-07-11T00:16:22Z

packages/runtime/container-runtime/src/containerRuntime.ts

+		 * - If that snapshot is older than the one tracked by this client, ignore the ack because only the latest
+		 * snapshot is tracked.
+		 * - If that snapshot is newer, attempt to fetch the latest snapshot and do one of the following:
+		 * - If the fetched snapshot is same or newer than the one for which ack was received, close this client. The


nit: I'd number these sub-bullets instead of having - to make it more readable

Improved the indentation. It wasn't allowing it until I disabled it.

…nding snapshot

…is no snapshot in storage (microsoft#21761) ## Bug / root cause There can be scenarios such as DB rollback in storage, where the latest snapshot is lost but the corresponding summary ack is still present because ops are maintained by another service. Currentlty, if a summarizer receives an ack that is newer that the summary it's tracking, it will fetch latest snapshot and close. So, in the above scenario, the summarizer would get the ack, attempt to fetch the snapshot and close. When a new summarizer loads, it will do the same thing because the ack corresponding to the snapshot doesn't exist. The document will be stuck in this loop and reach 10k+ unsummarizer ops after which it won't be able to make any changes. ## Fix When the client gets a newer ack, it fetches the snapshot and if the fetched snapshot is older than the one for the ack, it will ignore this ack and not close. The idea being that the server will send a summary ack only after it has a snapshot present for it, so a snapshot should exist. The only reason it wouldn't exist is that it was deleted **_later_**. So, the client assumes this is the case and moves on. Let's say this is not the case and the snapshot was not deleted but the server did not return it during snapshot fetch for some reason. The next summary upload by this client will fail because the "parent handle" (the last summary id) will be incorrect. The summarizer will close and the next summarizer will most likely not run into this issue because this is a very rare scenario. The PR has the following changes: - Container runtime now tracks the details of the last summary ack it has processed. It uses the information from this as the "parent" summary's handle when uploading a summary. Currently, it gets this information from the "SummaryCollection" but in the scenario described above, "SummaryCollection" has already updated the last ack to the one for which there is no snapshot. So, using that will fail the next summary upload. - If the fetched snapshot's reference seq# is lower than the reference seq# in the ack, simply return. - Added a property `newerSnapshotPresent` to the `RefreshLatestSummaryAckFetch` telemetry which will tell us if there was a newer snapshot present or not. I could add another telemetry when this case is hit, but I think this should be enough. ## Testing - Added a container runtime unit test that mocks the storage layer. It only has one snapshot and it always returns that. The test calls `ContainerRuntime::refreshLatestSummaryAck` with a newer ref seq# and then calls `ContainerRuntime::submitSummary`. It validates that both succeed and call storage with right properties. - Added an end-to-end test. In the test, there is only one snapshot (initial one from detached container). The test then creates a summarizer and calls its `SummaryCollection` (that handles summary op / ack) with a fake summary op and summary ack whose ref seq# is greater than the initial snapshot. This simulates the scenario because from the client perspective, it got a summary ack for a snapshot that doesn't exist. [AB#8334](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/8334)

github-actions bot added area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch labels Jul 2, 2024

agarwal-navin requested review from a team, pragya91, markfields, jatgarg, tyler-cai-microsoft, kian-thompson, rajatch-ff and vladsud and removed request for a team July 2, 2024 22:42

agarwal-navin changed the title ~~[Summarization] Don't close on receiving summary ack without corresponding snapshot~~ Jul 2, 2024

vladsud reviewed Jul 3, 2024

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Show resolved Hide resolved

vladsud reviewed Jul 3, 2024

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

agarwal-navin mentioned this pull request Jul 8, 2024

Create repro newest ack with missing snapshot on the service #21766

Closed

agarwal-navin commented Jul 8, 2024

View reviewed changes

packages/test/test-end-to-end-tests/src/test/summarization/summaries.spec.ts Show resolved Hide resolved

agarwal-navin force-pushed the dontFailUnknownAck branch from b06f90a to 238c291 Compare July 9, 2024 22:01

vladsud reviewed Jul 10, 2024

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Show resolved Hide resolved

vladsud approved these changes Jul 10, 2024

View reviewed changes

kian-thompson approved these changes Jul 11, 2024

View reviewed changes

agarwal-navin added 5 commits July 10, 2024 19:42

[Summarization] Don't close on receiving summary ack without correspo…

ca129de

…nding snapshot

Tests

ace9907

PR comments and test fixes

e3806fd

Small simplification

5330424

Improve indentation

58cdd8f

agarwal-navin force-pushed the dontFailUnknownAck branch from 6c94afe to 58cdd8f Compare July 11, 2024 02:42

agarwal-navin added 2 commits July 11, 2024 09:50

Build break

636b370

Remove build fix

9f02a17

Merge branch 'main' into dontFailUnknownAck

1eb9a61

agarwal-navin merged commit 5f898d5 into microsoft:main Jul 11, 2024
30 checks passed

agarwal-navin deleted the dontFailUnknownAck branch July 15, 2024 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Summarization] Don't close on receiving summary ack for which there is no snapshot in storage #21761

[Summarization] Don't close on receiving summary ack for which there is no snapshot in storage #21761

agarwal-navin commented Jul 2, 2024 •

edited by azure-boards bot

Loading

msfluid-bot commented Jul 8, 2024 •

edited

Loading

kian-thompson Jul 11, 2024

agarwal-navin Jul 11, 2024

[Summarization] Don't close on receiving summary ack for which there is no snapshot in storage #21761

[Summarization] Don't close on receiving summary ack for which there is no snapshot in storage #21761

Conversation

agarwal-navin commented Jul 2, 2024 • edited by azure-boards bot Loading

Bug / root cause

Fix

Testing

msfluid-bot commented Jul 8, 2024 • edited Loading

kian-thompson Jul 11, 2024

Choose a reason for hiding this comment

agarwal-navin Jul 11, 2024

Choose a reason for hiding this comment

agarwal-navin commented Jul 2, 2024 •

edited by azure-boards bot

Loading

msfluid-bot commented Jul 8, 2024 •

edited

Loading