Veridian 3

From KaiRoWiki
Revision as of 14:50, March 4, 2011 by KaiRo (talk | contribs) (→‎Examples)
Jump to navigation Jump to search

On a planet called Veridian III, a decisive battle was fought to prevent a future firing of a rocket into a star that would change gravitational forces and make "the nexus" crash into the planet as well as destroy the planet with a shock wave. Preventing this catastrophy made the crash of the USS Enterprise on the planet controllable as to suffer no human casualties.

In the same spirit, the project I internally dub "Veridian 3" is about dealing with crashes to make the bad ones preventable and other ones more controllable, all through prioritizing Socorro work.

Project areas

Those areas have been specified in the contract:

  • Improving Crash Data Integrity
    • Identification and Removal of Duplicate Crash reports
  • Improving Search Capabilities
  • Improving Classification and Characterization of Crash Reports and Improved Signature Generation
  • additional correlation reports to help identifying circumstances around the crash and steps to reproduce
  • Improve Trend Reports to identify and alert teams about Explosive bugs

Bugzilla Tags

Those are used for the classification of Socorro bugs, all starting with "V3", and those will be documented here. In brackets, there are bug counts as of 02/23.

  • V3-integrity (31): Affecting Crash Data Integrity, i.e. quality of the original data we have stored
  • V3-search (28): Search Capabilities
  • V3-classify (47): Classification and Characterization of Crash Reports and Signature Generation
  • V3-correlation (31): Correlation reports to help identifying circumstances around the crash and steps to reproduce
  • V3-trends (17): Trend Reports, e.g. to identify and alert teams about Explosive bugs
  • V3-newreports (22): New reports (requests for generating new reports)
  • V3-UI (103): User Interface issues
  • V3-UItweaks (58): UI tweaks (probably easy to solve, small UI issues) - subgroup of V3-UI
  • V3-nonHTMLoutput (12): Non-HTML/web output (.csv, feeds, etc.)
  • V3-notify (14): Notifications (to be) sent out by the Socorro system
  • V3-infra (123): Infrastructure and backend issues (note: out of the direct focus of my project, subject to internal planning in the Socorro team)
  • V3-config (20): Configuration adaptations (skiplist additions, etc.)
  • V3-productization (14): Making Socorro a product that can be deployed and understood by others (documentation, etc.)
  • V3-datarequest (6): Data requests (bugs that request data through manual jobs)

Prioritization Comments From Socorro Users

<wsmwk> KaiRo: second tier needs might be bug 421119, bug 518823, bug 578376, bug 411354.  third tier: bug 527304, bug 512910, better workflow for updating skiplist)
<firebot> Bug https://bugzilla.mozilla.org/show_bug.cgi?id=421119 min, P3, 2.1, nobody, NEW, function for socorro to compare stacks of two or more crash reports
<firebot> Bug https://bugzilla.mozilla.org/show_bug.cgi?id=518823 enh, --, Future, nobody, NEW, indicate bug's status for bugzilla keyword topcrash
<firebot> Bug https://bugzilla.mozilla.org/show_bug.cgi?id=578376 nor, --, ---, nobody, NEW, multiple crashes from a single person should have less weight then many crashes from different peopl
<firebot> Bug https://bugzilla.mozilla.org/show_bug.cgi?id=411354 nor, P1, 2.0, nobody, REOP, Add ability to search by build ID
<firebot> Bug https://bugzilla.mozilla.org/show_bug.cgi?id=527304 enh, --, ---, nobody, REOP, provide smart analysis ala talkback
<firebot> Bug https://bugzilla.mozilla.org/show_bug.cgi?id=512910 enh, --, ---, nobody, NEW, Make it easier to analyze crashes that share a signature
https://bugzilla.mozilla.org/show_bug.cgi?id=551669 provide graphs by crash date too
Smokey Ardisson (way behind; no bugmail - do not email) <alqahira@ardisson.org> changed:
                 CC|                            |kairo@kairo.at
johnjbarton 
Of course *all* of my crashes involve Firebug. The number one question I have when I visit crash-stats site is:
How many other users who have this crash also running Firebug?
If the answer is 95%, then I better spend some time on it because no one else will. If the answer is 5%, I'm having lunch.
Jeff Muizelaar
It would be nice if it was possible to get more summary information about a crash.
For example: What build ids does this crash all occur with? What operating system versions does this all occur with? etc.
Josh Matthews
I, like Jeff, would appreciate summaries of the data available - most recent 10 unique build ids, list of unique OS versions, range of uptimes, etc.
I would also be really interested in data about spikes - seeing a graph of the number of crashes for a particular signature over time would be useful to track trends.

Explosive Crashes

Notes on the work on a set of criteria for finding explosive crash reports - bug 629049 is the tracker bug, bug 629062 is detection. The PRD doc has some surrounding info, but no criteria yet.

Personal Notes

  • Sharp/significant increase at certain wall-clock time across versions
  • Sharp/significant increase at certain build ID (date?) on single version/series (possibly ignoring everything in version string starting with first letter if the version ends in "pre", to have e.g. 5.0a3pre->5.0b1pre or 4.0b11pre->4.0b12pre not disturb the analysis)
  • Ignore (suspected) duplicates
  • Frequency weighted by ADU more important than bare count (from something chofmann has said)
  • I'm not fond of topcrash rank comparisons, as 20 crashes with similar frequency changing place looks overvalued there, while e.g. #1 having 10,000 crashes and #3 having 500 fully mask #2 exploding from 600 to 5,000 in a day.

Criteria Proposal

This is a quite rough proposal right now.

  1. Get two sets of numbers per signature:
    • non-duplicate crashes occurred per day and total ADU for the last 10 days
    • non-duplicate crashes and ADU per combination of version series (see personal notes) and date of build ID, for the last 10 available build ID dates in the version series
  2. For each set, calculate (if there are at least 4 values in the set):
    • average crashes per ADU over 7 values before recent value ("base")
    • average ADU over those values ("avgADU")
    • distance of that average to the highest value in set ("dist"), clamped to a minimum of (50 crashes/avgADU)
    • recent value per ADU ("data")
    • (total|version)_explosiveness_1 = (data-base)/dist
  3. For each set, calculate (if there are at least 6 values in the set):
    • average crashes per ADU over 7 values before recent 3 values ("base")
    • average ADU over those values ("avgADU")
    • standard deviation of that average ("dist"), clamped to a minimum of (20 crashes/avgADU)
    • average of recent 3 values per ADU ("data")
    • (total|version)_explosiveness_3 = (data-base)/dist
  4. Mark as explosive in UI if *_explosiveness_1 > 3 or *_explosiveness_3 > 2.

Problems with this proposal

  • Completely arbitrary numbers for explosiveness marking limits and "dist"-clamping, need to see if they catch all explosives and/or catch too much.
  • If there's no large enough set of numbers to work with, there's no useful explosiveness.
  • It's unclear if the version-based numbers give really useful additional value, they also create a multitude of explosiveness numbers to store (2 per version series).
  • There might be an argument for only calculating the second (*_explosiveness_3) measure, as it's fine-grained enough to catch highly explosive crashes on the first day of explosion.

Upsides of this proposal

  • Recognizes that dupes and ADU changes can make base values fluctuate and gets rid of those problems.
  • The clamping of "dist" doesn't just prohibit divisions by zero, but also deals potential skew due to tiny fluctuations in small numbers.
  • Having explosiveness numbers available to UI enables flexibility in marking, sorting and changing limits.

Examples

  • Bug 554660 (see below) has an interesting example of numbers to look at for this: totals of 54, 72, 86, 83, 67, 46, 47, 123, 131 for 2010-03-08 through 2010-03-16. Here's a look at how this algorithm does, ignoring ADU, which are not given there, and therefore also the clamping:
    • On 2010-03-15, total_explosiveness_1 would have been 2.7, not yet triggering (?), and total_explosiveness_3 would have been slightly negative, also not triggering.
    • On 2010-03-16, total_explosiveness_1 would have been 1.2, not triggering, but total_explosiveness_3 would have been 2.2, triggering the warning.
    • On later days, should have triggered easily on both values, even with clamping of "dist".

A larger number of examples, including dist clamping (but no ADU) is available as an File:Explosive calc.ods (File:Explosive calc.pdf)

User Comments

From https://wiki.mozilla.org/Socorro:PRD_Interviews

damon:
 * (initial) growth of more than 25 positions in the ranking
 * upwards change in rank and no related bugzilla id
 * time since startup < 1 minute
 * highlight these crashes in red or something

From https://bugzilla.mozilla.org/show_bug.cgi?id=525316

morgamic:
 My suggestion for a delta to watch is an increase in crash frequency of more
 than 50-75% and new crashes in the top 20 overall signatures by version.

Data From Previous Explosive Crash Bugs

Used the explosive bug query to find those, trying to pull info out on how those were explosive.

Reports

Some tools and reports on crash data are currently outside the main Socorro systems: