-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misbehaving driver can cause Fluid to hang on container open #18430
Comments
FYI, @rajatch-ff |
There is no way for the container to progress until it fills the gap with the ops because the state would not be consistent then. We cannot skip ops and just proceed. |
Agreed on not "just proceed" but there should be some sort of additional check that results in load erroring out. If the driver returned success but no messages, should that be an immediate error? Should there be a limit on how many attempts are made for the same gap? |
@zagriswo we'll improve the checks here, work backlogged. |
Question: In your original description, why did the service kept returning 0 messages and made the container to stuck? Did it ever proceed, or the service lost the messages somehow? |
@jatgarg it was a bug uncovered by fuzzing. Basically, a hole was made in the op stream, so our driver returned 0 messages in perpetuity because those messages just didn't exist anymore. |
In ODSP driver, we already handle this issue where if we don't make progress in fetching ops using delta storage service, then we give up after 30 secs and container closes.
You can see the usage of it in ODSP driver here:
Let me know if you have more questions. You should be able to use it with your driver easily. In future, we will think if we want to move this thing higer up the stack and in loader/deltastream layer. |
Describe the bug
We found a bug in our driver that resulted in Fluid effectively busy-looping and causing an app hang. We can fix the driver bug, but it would be good to also have the container loading code be a bit more defensive too.
Our driver returned all the messages the service had via the
IDocumentDeltaConnection.initialMessages
property, but this set of messages erroneously had a gap in the middle.DeltaManager
would go through itsfetchMissingDeltas
path to try to retrieve the messages in the gap, but our implementation ofIDocumentDeltaStorageService.fetchMessages
would successfully return an empty stream (that is, no messages anddone: true
) when asked about that gap. This causedDeltaManager
to try to keep fetching the gap over, and over, and over, without making forward progress as it would get back an empty stream each time it tried to fill in the gap.The text was updated successfully, but these errors were encountered: