So,after a much more log parsing and number crunching than I had anticipated, I have some results from the Greener Tinderbox effort. For those that don’t know, I’ve been running all the TUnit automated tests over and over on a set of knock-off Tinderboxen trying to hammer out which failures we are seeing, and which we aren’t. My hope is keep doing this long term, debugging a failure or two every couple of weeks until we get the tinderboxes back to a sane state. But to get started, first I had to collect a bunch of data, parse the data, and determine what the data means. That takes a while, but now, round 1 is done.
= Executive Summary =
The boxes displayed more noise than normal tinderboxes because of a couple of configuration issues described down in the “Quality of Results” section. For the most part, many of the tests that failed have been designated as “intermittent” at one time or another or on different products (seamonkey, fennec, etc). You can grab the results in an Open Office spreadsheet:
== Next Steps ==
Looking over the top intermittent failures that were replicated from tinderbox, and comparing that to Ted’s new topfails reports, I’ve decided to attack three tests for debugging:
* browser chrome fuel test: browser_Browser.js (bug 458688)
* chrome test test_wheeltransaction.xul (bug 484853)
* mochitest jquery fx module tests (bug 484994)
My plan is to put in debugging information on these tests and run these three harnesses in tight loops for the next 24 hours to see if I can get the failures to replicate and figure out what is happening there.
On Friday, I want to update to a new changeset and re-run all the tests on the machines for another five days. And after getting that data, I’ll need to give the machines back, but I plan to keep the VMs. Rolling up results should be quite a bit quicker next time as I won’t have any scripts to write for log crunching.
== The VM versus Machine Question ==
There is no doubt that the high end machines I’m using for this blow the VMs out of the water in terms of performance. However, in terms of intermittent failures with tests, I don’t see any quantitative difference between VMs and machines. For the second round, I hope to finally have my Linux machine ready and can compare it with the Linux VM.
= Details Details Details =
== Results ==
The pretty results are here (open office.org spreadsheet). The raw logs are also available in that directory. And I’ll be checking in all my log parsing/number crunching/test running code into http://hg.mozilla.org/qa/tunit-analysis once that repo is available.
== The Math ==
I was going to crunch the numbers, but once I started them (you can see some on the “Total” sheet in the spread sheet, the numbers didn’t look quite right to me, so I’m going to repost my “Counting Failures” blog post but with legible equations this time to make sure I’m going about this correctly.
== Quality of Results ==
There are two problems with the results.
1. The accessibility tests on Linux did not run, because I naively thought that –enable-accessible was the default on Linux as it is on windows. So all the failures on mochitest-a11y on Linux should be disregarded.
2. On Windows, I set the color depth policy as specified in the reference platform document; however, I could not change the resolution or color depth on those machines. I assumed that the policy would somehow take care of it (should have known better). So a bunch of the color testing reftests failed on the windows machine.
Otherwise, the results seem to be good. I have queried every failure (except the reftest failures as noted — many of them seem to be machine configuration issues) against bugzilla to find out if each failure is noise from the box or if it is a failure that has been reported as intermittent on Tinderbox, and I’ve noted those bugs. If the bug field and it is blank then there is no mention of this failure in Bugzilla.
Question: the on-going war between your right and left brain – who wins the most often? =) I’ve missed your blog. Will visit more.