Review WikiLambda system’s criteria for updating a function’s implementation list
Open, LowPublicFeature

Description

Steps to replicate the issue (include links if applicable):

What happens?:
Two implementations with relatively poor performance were promoted above three implementations with better performance.

What should have happened instead?:
Disconnected test cases should be disregarded or have a lower weighting than connected test cases.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

The function had six implementations and seven test cases. Two test cases were not connected at the time, since they test with invalid Natural number objects (negative value or leading zero). The implementations promoted by WikiLambda system were successful for all seven tests. Three of the less favoured implementations were successful for six out of seven tests, failing for the disconnected case with a leading zero. The preferred implementations rely on Python (==) and have significantly worse performance than the other three, which rely on built-in Z803 and Z866. (The sixth implementation relies on JavaScript (===) and had a timeout failure or two.)

Because disconnecting Z13524 (on April 18th: https://www.wikifunctions.org/w/index.php?title=Z13522&diff=99728&oldid=99559 ) seems to have made no difference to WikiLambda system’s preferences, I considered it appropriate to disconnect the preferred implementations for the time being.

Please see T360385 for issues that I suspect are related to this problem, at least in part.

Disconnections: https://www.wikifunctions.org/w/index.php?search=T363908&title=Special:Search&profile=advanced&fulltext=1&ns1=1

Event Timeline

Let's timebox the investigation and fix for now in order to figure out what is going on. Say two days, and then write down the status. We should figure out why seemingly slower implementations are ranked higher. Another example where this happens is on Z16137: https://www.wikifunctions.org/wiki/Z16137

All of the testers against Z16142 are faster than Z16143, quite considerably so, and yet, WikiLambda system ranked the latter higher than the former: https://www.wikifunctions.org/wiki/Z16137?uselang=en&diff=prev&oldid=103958

This lead to the community disconnecting a perfectly valid implementation. We should figure out why the ranking is off here.

Please see https://www.wikifunctions.org/view/en/Z16688

Python == (Z16690) is preferred with average runtime of 6397.5 ms compared with JavaScript == (Z17360) with average runtime of 2355.75 ms (all tests passing with both implementations).

Hi @GrounderUK - Thanks for making this ticket and providing all the details, and apologies for not commenting sooner here. I did spend some time looking at the initial example above (Z13522), and I formed a hypothesis about the behavior in that case, but I didn't have certainty so I didn't provide info at that time. (It's hard to have complete certainty about these orderings at present, without turning on more detailed logging.) Also, I wrote a doc page outlining how the implementation list ordering happens.

I've looked at your example from June 26 (Z16688). What I'm seeing today is that of the 2 implementations that are passing all tests, the speedier implementation is actually in the first position of the implementation list. Specifically:

Z16692, using JS === for Integer, is in the first position
Z16690, python ==, is in the 2nd position

Of course, it could have been different on June 26 when you looked; I didn't try to check if it has changed since then. However, I am wondering if you are going by the ordering of implementations that appears on the page. That actually does not reflect the ordering in the Z8K4 / implementations property. In fact, I see that python == appears first on the page (under Implementations).

I see that could be a big source of confusion, and apologies if it has been. I'm going to add that point to the doc page mentioned above. Also, the team has floated a few ideas about how to visually indicate which implementation is currently in the first position. I will create a ticket to explore this, if one doesn't already exist.

Note: to see the actual order of elements in the Z8K4 / implementations property, I look at the JSON representation, by visiting https://www.wikifunctions.org/wiki/Z16688?action=raw.

@DMartin-WMF "of the 2 implementations that are passing all tests, the speedier implementation is actually in the first position of the implementation list" isn't that because all the other implementations are disabled?

Should we re-enable them and see what the Wikilambda bot promotes?

… I wrote a doc page outlining how the implementation list ordering happens.

Thank you for your comment and the document, which is excellent. I was getting the feeling on Saturday that the problem had improved so I was looking at the list of disconnections linked in the Description and got as far as reconnecting https://www.wikifunctions.org/view/en/Z10050 when it became apparent that all was not well more generally (T368531).

To answer your question directly, no, I don’t think I have been confused about which implementation has been put at the top of the list. Of course, I can only see the most recent timings but I “know” that Python timings generally exceed 5000 ms whereas JavaScript typically comes in under 3000 ms. We have good visibility of the current ordering when viewing or adding a test. I also keep an eye on contributions by WikiLambda system and the re-orderings are plain enough in the diffs. Z16690 (Python) was promoted early on the 26th and Z16692 was promoted early on the 27th. In this case, then, well… at least it got there in the end!

In the original example, the problem seemed to be a doubtful test case which “fails” for the optimal implementation. This just needs to be ignored, but disconnecting it did not seem to have that effect. Your document refers to “currently connected” tests, so I shall look at this again when stability returns.

Picking up on Toby’s suggestion, please see https://www.wikifunctions.org/w/index.php?title=Z866&action=history where there were reconnections on June 20th and Python has since occupied the top position (all tests are currently failing for all the other implementations, but that was (surely?) not the case a week or two ago).

Clearly, we would rather see Python implementations execute in similar (shorter!) times to JavaScript but my sense is that the divergence is growing rather than shrinking. I can only surmise that Python is currently rather more reliable than JavaScript since it is always significantly slower when the implementations are similar, typically by a factor of two or three. See, for example, Z16143 (disconnected) which takes over 6000 ms for == while JavaScript (Z16142) delivers === in around 1500 ms (with an outlier of 2311 ms) https://www.wikifunctions.org/view/en/Z16137.

Thanks again for your efforts. Given the 80% rule you have outlined, I think we (the functioneers) might explore a more temporary (repeated) disconnection strategy and report back on how manageable that is. As a start, I have connected all implementations to Z13522, leaving the doubtful test disconnected.

https://www.wikifunctions.org/view/en/Z13533 (Python) jumped to the top of the list and got disconnected. [Now re-connected by @99of9.]

@DMartin-WMF I note it says in the document (towards the end) “…the Z8K4 property is only updated if the average runtime for the new best performer is significantly better than the average runtime for the previous best performer (and the new best performer has no more test failures than the previous best performer).” Can you perhaps clarify whether “…no more test failures…” means “…no more failures of connected tests…”? Thank you!

Hi @GrounderUK - Yes, that refers to connected tests. The algorithm considers only the currently connected implementations and tests.

@DMartin-WMF So is the current fully-connected situation where python got prioritized sufficient to demonstrate that the issue is real? As we await a fix, we would like to set it back to the faster JS.

Hi @99of9 - Yes, after studying the Z13522 example, and also having some discussion with other team members, it's clear the issue is real. In fact, it looks like there are 2 questions that need attention: (1) Is the ordering strategy (as described at the doc page) what's needed and (2) Is the implementation of the ordering strategy correct? Until today I thought the answer to (2) was probably "yes", but there is indeed something out of joint with the Z13522 example (from the data displayed in the metadata dialogs, I expect the ordering to be different). I will continue to investigate question (2).

But for the moment, let's return to question (1). You and @GrounderUK have been mostly citing runtime (duration; elapsed time) numbers (e.g. 6000 ms mentioned above). The ordering strategy actually relies on CPU time, which was originally thought to be sensible. The CPU time differences between Python and JavaScript implementations are a lot smaller than the duration differences that you have been citing. That might (sometimes) help to explain why the actual orderings don't correspond with the observed runtimes.

In any case, in team discussions we are considering whether the strategy should be changed from CPU time to runtime, at least while Python runtimes are consistently so much greater than JavaScript runtimes. The original focus on CPU time was oriented towards a concern with optimizing usage of computing resources. However, what I'm understanding here is that users are struggling with frustrating wait times and sometimes timeouts. If we change the strategy to rely on runtimes, it should become more likely that the JavaScript implementations will be selected and the wait times and timeouts will be reduced. (Hope I stated that clearly enough.)

If the team agrees with it, am I right in thinking that you would also approve of this idea of changing the strategy to rely on runtimes?

Short answer: yes. Whatever it takes to avoid timeouts, which are most problematic in compositions which have to run and pass data through a chain of underlying functions.

If the team agrees with it, am I right in thinking that you would also approve of this idea of changing the strategy to rely on runtimes?

Yes.

When I raised this ticket, I assumed that there was some consideration of CPU usage but that we should give duration greater weight. Whether we should switch to considering only duration, I’m not so sure, but in the absence of an existing algorithm for combining the different figures, it seems the simplest improvement to the current situation.

The most common visible problem we have is the overall limit of 10000 ms being exceeded. The second most visible (much less common) is the 9000 ms evaluation limit. I don’t think I have ever experienced a specific limit based on CPU usage, but that may be one of the less visible safeguards.

The practical effect of the current approach is counterproductive, in my view, because it will almost always be a code implementation that is selected (when available). Contributors have an incentive to create new implementations for more complex functions (rather than compositions that will time out) but there can be no explicit code re-use, so we end up with uncontrolled duplication of code. Furthermore, cached results are available only for the specific function, so we forego the opportunity to rely on cached results of subordinate functions that compositions (presumably) allow.

I see no evidence, however, that re-using cached results in compositions translates into shorter durations or lower CPU usage in test evaluations. I suspect this is may be an “accounting” issue but it may also be a consequence of running tests concurrently, so that cached results from one test are not available for reuse within another (when there are subordinate functions in common). The “accounting” issue might be that such reuse of cached results does, in fact, occur but the historic resource utilisation is added into the total instead of (or as well as) the resources used in identifying and using the cached results.

In closing, thank you for confirming that disconnected tests are completely ignored when ordering implementations. When all tests are evaluated for a function, is there any assurance that all the connected tests are evaluated before any disconnected tests are evaluated? Is the current ordering of implementations relevant here? Or is each evaluation assumed (or somehow guaranteed) to be immune from performance side effects from the others, particularly in its orchestration?

Change #1052836 had a related patch set uploaded (by David Martin; author: David Martin):

[mediawiki/extensions/WikiLambda@master] Set some implementation ranking logs to warning level

https://gerrit.wikimedia.org/r/1052836

Hi @GrounderUK and @99of9 - Thanks for weighing in. I've made a ticket (T369587) and implemented a patch for the change we discussed above: basing the ordering on runtime (i.e., duration, elapsed time) instead of CPU times. If everything goes smoothly it might get deployed this week; we'll see.

@GrounderUK , thanks for raising those questions. I'm out of time for today; will comment on them tomorrow.

…will comment on them tomorrow.

Thanks, there’s no rush. But I am a little concerned that the evaluation duration seems to be double counted in the metadata.

IMG_0971.png (2×960 px, 155 KB)

Both orchestration and evaluation started 36 seconds ago and finished 33 seconds ago, but the total duration is more than six seconds. This may be just a presentational issue but I think checking that out would be worth while. Thanks again.

Jdforrester-WMF changed the subtype of this task from "Bug Report" to "Feature Request".Wed, Jul 10, 4:35 PM

@GrounderUK - Thanks for noticing the double counting of the evaluation duration! yes, this appears to be a mistake in the logic of our front-end code, and I created T369788 for this.

Returning to your previous questions, briefly for now – No, there is no code that assures connected tests will be evaluated before disconnected tests. I believe it is accurate to say that the test-running code currently assumes that each run will be immune from performance side effects from the others. It should be noted, however, that the only time that all tests of a function are rerun is when the function itself has been modified. If, say, just one particular implementation is modified, then only the tests for that implementation get rerun. In that case, the other test results are drawn from a cache, because there's no reason to believe a better result would be obtained by rerunning them.

Nevertheless, you raise a good point. When all the tests are in fact rerun, or even if several are rerun at the same time, it's possible that we may get better results by isolating the connected test runs from performance side effects that might occur from the disconnected test runs. I will make a ticket for further consideration of this. Thanks again.