Thursday, October 30, 2008

Play Action Fake

I host a MySQL server for use by the Falcon team, and this requires business class broadband service.

Comcast Business Services support has been consistently stellar.
Network or line problems are addressed within the same day, and I always get straight through to a knowledgeable person when I call support.

Recently, the coax connector on the cable modem/router snapped loose. I called Business Services, and within two hours a replacement router sat on my desk. I configured the device and called to have them enable it on their network. I got right through.
"This is Ron. How may I help you?"
Twentysomething, confident in tone. A good sign.
"Hi Ron. This is Chris. I just configured a replacement router and need to have it enabled on your network."
Hard stop.
"That is not a router. It is a mod-em. We call it a cable mod-em."
Like I was five, or something.
"Ok, well, I use it to manage the network, that's why I call it a router."
"That is a cable modem, not a router."
Strictly speaking, he was right: the Comcast Business Gateway (SMC 8014)
modulates and demodulates the cable signal and is therefore a cable modem, albeit one with router-y capabilities.

I called it a router anyway. Another question:
"I have two Comcast devices. Next to the router is a Scientific Atlanta DPX2203. It's what we get our TV and phone with. What do you call that?"

"I have no idea what that is."

He should have. It was just a standard cable modem.
He probably doesn't do residential customer support. Let it go.
Ron logged on to the router. He did something and it rebooted.

"Try it now."
No joy. My carefully configured network ceased up like an acrophobic mountain goat.
"Ron, the LAN can't see the WAN, local DHCP is busted, the NAT and DMZ configuration pages are disabled on the router and my server can't see the network."

"What's your server's IP?"

He shouldn't have to know, but I gave it to him. The router rebooted.
"Ok, try it now."
Better, but not right.
"The NAT and DMZ options are still disabled, and my local systems are assigned public IPs. Not good."
I explained, patiently and at some length, that I wanted my local network behind the firewall and that my server had to bypass the firewall. I also didn't want my local systems using the block of static IPs assigned to my account. Basic stuff, but Ron simply didn't grasp the concept.

Then he said this, I swear:
"I can't help you with that. Those features are disabled for your account."
Something had gone terribly, terribly wrong in my world. Subarachnoid hemorrhage? Radon?
Shake it off, Powers.
"Now, wait a minute. The router was shipped unlocked, just like my old router. I configured it, just like my old router. When you remotely configured it, those features became disabled. I NEED THEM. Did the firmware revision change or something?"

"Sometimes the firmware changes, but NAT is disabled. You do not need those features."
Clearly, I was being tested. Ron said this with complete conviction, and it threw me.
"Ron, I most certainly DO need those features to configure my network. I had a server in the DMZ and a LAN behind the firewall, and now you're saying that I can no longer do that? Has my account status changed?"

"No, but we don't support those features. I really can't help you with that."
Confusion yielded to irritation.
"Look, this is unacceptable. You replaced my router--sorry, cable modem--but disabled the features that I was using. It's useless now. Please let me speak to your supervisor or someone else who can help."
Ron stumbled, then recovered.
"I can transfer you, but those features are not enabled for your account. We don't support NAT."
My incredulity finally inspired a calm sense of purpose.
I will appeal to Ron's higher mind. He can be reasoned with. He is stubborn because he is afraid. I know he is wrong and he suspects he is wrong but wants to appear confident like his fellow techs who really do know their stuff.
I chose to de-escalate and reason with him instead.
"Ok, let's think this through. Assuming you are right, why is my laptop accessing the Comcast DHCP server? Are you saying I can't even turn on the router's DHCP?"
A pause.
"Let me look."
The router rebooted. Third time.
"Check it now."
All lights were green. Homepages blossomed.
"There! Great! What did you change?"

"I enabled NAT."
I thanked him. He said to call back if I needed anything else.

Saturday, October 25, 2008

The Weekly Falcon Index

Planned Falcon Blog posts: 4
Actual posts: 4
Hours resolving network outage due to physically damaged router: 3
Hours from first service call that a new router arrived via courier: 2
Hours haggling with Comcast tech over router configuration: 1
Hours resolving concurrent but unrelated disk corruption: 2.5
Hours locked out of building where I was supposed to speak at MySQL Meetup: 0.5

Falcon IRC non-system messages: 2661
IRC champ: Vlad (26%)
Runner-up: Lars-Erik (19%)
Highest percentage of smiley faces per lines of chat: 19%
Friendliest chatter: Olav
German/Norwegian message ratio: 1:1
References to reiserfs: 13
References to prison: 2
References to football (soccer): 182
Gratuitous references to Ramadan: 1
Best quote: "You obviously do not live in an apartment with a bunch of left-wing girls."
Source of quote: [redacted]

Thursday, October 23, 2008

Some Perspective on Recent Events, Part III

[Part I, Part II]

Perturbations in the Field


In the weeks leading up to the Dev meeting in Riga, the Falcon team dumped a record number of changes into the codebase, including a page cache optimization that contained a severe but
undetectable bug.

The usual indicators offered no sign of trouble: the Pushbuild matrix was green, the Falcon regression tests passed, Syst
em QA reported nothing unusual. It wasn't until Philip modified the System QA stress tests that a problem emerged.

The modification was simple: kill mysqld at the end of the test, then restart the server and wait for Falcon recovery to complete. This unsophisticated, almost whimsical improvisation should have amounted to little more than a savage afterthought in the test plan, however, not only did it expose a severe bug, it also revealed a serious gap in our test coverage.

What else had we missed? Jim surfaced an old argument:
"The question is how something that broken could escape detection."

"What we really need is an internal application that actually uses Falcon for something we care about. If we were actually using it, we would have noticed the breakage in about a minute."
In fact, MySQL 4.1, 5.0 and 5.1 is used extensively throughout development. We last discussed dogfooding Falcon at the London meeting in May, though nothing came of it.

A Personal Vendetta

The discussion continued.


Philip:
"In an ideal world, we should have had a recovery test in PushBuild or in the Weekly Falcon Test starting sometime around Nixon's resignation."

"I wish this bug was caught in the course of the execution of some determined, comprehensive QA strategy for Falcon. Instead, it was caught because I happened to have a personal vendetta against Falcon recovery that drove me to create those tests."
Manyi
1) We need better coverage on regression testing. Serious side-effects of a patch from September were not caught by regression testing until Philip introduced more recovery tests. We need to complete the remaining tasks on FalconQAIdeas and improve the regression testing we have today.

2) Falcon patches tend to be large and contain several bug fixes and worklogs, that adds complexity and makes it more difficult for the reviews. One way to solve this is having smaller patches and each patch corresponds to a bug report or a worklog.
Ann:
"Neither worklogs nor small patches would have made this any easier to find. You have to understand the code, specifically that there can be several dirty page with the same page number and walk the hash duplicates chain to get all the pages written. I doubt that anybody but Jim would have caught that subtlety."
Hakan had seen the recovery failure, but hadn't realized it:
"Actually it was caught by the weekly DBT2 testing with 100 warehouses, where we load the data and restart the server. There were couple of 100 warehouses runs missing in September, but I ignored it because of Falcon being unstable in general."

"Todo: Add a hook into DBT2 to figure out why a 100 warehouse run was not started."
Vlad:
"I'm sorry to be annoying with my ideas of quality, but I really believe if we had good unit testing (and one for page cache of course) such regression would be caught there. SystemQA test is good to find regressions but it does not help localize the error..."
Jim
"I'm the heretic about code reviews. An architectural discussion before the design is cast into code is a hundred times more useful than a code review afterward...a bad implementation of a good design is easier to fix than a good implementation of a bad design."
And on it went.

El Photo Grande

My take was that this bug did us a favor:

"I think this entire episode was a gift. We identified the cause of a serious, complex problem and consequently identified holes in our process."

"Strictly speaking, our QA system worked, albeit very inefficiently, because the bug didn't make its way into a telco switch or a 911 call center or Lehman Brothers financial system or even 6.0.7 for that matter (ok, maybe that was luck), so let's improve the process and move on."
I continued with my analysis, bordering on bureaucratic blather:
What Happened
There was "perfect storm" of concurrent events:
  1. A complex, multi-part change to a core component (page cache), not fully reviewed
  2. Other core changes piled on top (online alter, memory manager, transaction manager)
  3. Disproportional reliance on Pushbuild I results
  4. SystemQA stress tests were changed, thus no pass/fail history (was recovery bug new or old?)
  5. DBT2 failures were detected, but not correlated
What To Do
(I promise that these ideas are perfect and that there will never be another problem if we adopt them.)

Stricter Code Reviews
  1. One commit-one bug (mea culpa on this one!)
  2. Stability is KING
  3. Jim is still the reigning Falcon expert. USE HIM. Nag, if necessary.
  4. Changes to core components (page cache, memory manager) require a higher barrier of entry:
  • Clear, reproducible proof of benefit
  • Clear evidence of no regression
  • Assign a reviewer to understand EVERY LINE of the fix
I also think we should suspend all performance changes until we get results from PAE.

Assign Experts
Assign responsibility for subsystem expertise. I've seen this work on other teams ("Who's the DMA guy? Ralph? Ok, great.") Each subsystem (TransactionManager, Cache) has an expert, a go-to person whose sign-off is required for code reviews.

Cross-component expertise is important but we need more depth.

"El Photo Grande" for Falcon QA
We need the equivalent of Homeland Security for Falcon QA, an aggregation of QA intelligence such that we can, perhaps in a dashboard or something, see the overall state of QA: Pushbuilds I&II, SystemQA, DBT2, Performance, etc.

Once we establish a comprehensive QA baseline, then subtle-but-catastrophic bugs will manifest as perturbations in the field.
Ann thoughtfully countered many of my suggestions:
"It's not possible to do a visual code review of something as complex as Kelly's flush mechanism."

"The problem is that we have to develop QA in parallel with the code."


"This is all hard stuff, particularly recovery since it's not easy to make it deterministic, and if you do make it deterministic, you miss cases."

"People are better at analyzing and understanding designs than code."
I disagree with Jim and Ann on the value of code reviews. Yes, senior engineers should spend mor
e time reviewing design than source code, but my experience has been that a second or third pair of eyes will always catch stuff--always--and that changes to core components must be reviewed by senior engineers. Less critical changes require less scrutiny.

But code reviews always miss stuff, too. Always.

The Mansfield Anomaly

One case in particular comes to mind, a pacemaker firmware defect so subtle and so pernicious that it was named in honor of the engineer who introduced it.

The "Mansfield Anomaly" evaded in-depth peer reviews, design analysis tests, unit tests, s
ystem verification and validation, pre-clinical testing and even clinical trials. It never manifested as a device failure in the field, fortunately, and the bug was eventually found in-house and a firmware patch released.

I was introduced to the Mansfield Anomaly by an advisory level engineer
during an informal training session. He flashed a rather innocuous 'switch' statement on the screen and challenged the ten or so engineers in the room to find the bug.

No one did.

After entertaining a few guesses, he pointed out a very subtle cut-and-paste error that could result in a device failure, but only during a 4ms hardware timing window. Anyone could (and did) miss it, especially in the context of a more comprehensive review with hundreds of lines of code.


A Savage Pleasure

So, back to the original question: How can the Falcon team better ensure that subtle-but-devastating bugs are caught? Code reviews? Design reviews? Dogfooding? Unit tests? More system tests? The MySQL community?

All of the above, of course, but what can a team do mid-project? What is the most effective change to make now?


Here's what we've done so far:
  1. More eyes on the code--all pushes must be code-reviewed
  2. Smaller pushes into the codebase
  3. Public intra-team email: falcon@lists.mysql.com
  4. Falcon QA is moving ahead with new tests
  5. Run the new tests on previous releases, compare with current or pending release
  6. Focus on stability in favor of performance--for now
  7. Save before and after file snapshots for recovery debugging
  8. Pushbuild2: a new array of servers dedicated to automated System QA stress tests
Item 8 is my personal favorite. Stress tests are irrational killers, but I find them to be a very satisfying challenge, which brings to mind one of my favorite quotes:
"Problem-solving is hunting; it is savage pleasure and we are born to it."

--Thomas Harris, Silence of the Lambs

Tuesday, October 21, 2008

Bursty

That's always been my M.O., I guess, but last week was pretty busy:

Flyway

I went down to the Flyway Film Festival in Pepin, Wisconsin over the weekend to screen "Cottonwood", a short film I directed over the summer as part of the 48 Hour Film Project.


I gave a brief Q&A about the making of the film, then did some schmoozing at the filmmaker's reception. I had a great chat with the director of "
Redneck Zombies"--the guy was hilarious.

"Cottonwood" is five minutes long--six with credits--and was written, shot, edited and scored in exactly forty-eight hours. And although "Cottonwood"
looks like a film that was made in an absolute, mad panic (it was), we had a blast doing it.

Minnesota MySQL+PHP Meetup

Last Wednesday, I had the great pleasure of attending the local MySQL+PHP meetup.

The attendance was excellent, the attendees were interesting, and I had the opportunity to talk about Falcon, give a rundown of the Riga meeting, and dispel (I think) some of the rumors making the rounds in the MySQL community.

There were many questions about the Sun acquisition and its effect on MySQL. I was happy to emphasize how positive the acquisition has been, most recently evidenced by the addition of several developers to Falcon and a commitment to avail us of the PAE performance team's expertise.

I also met three local MySQL customers who are interested in giving Falcon a test run with their web applications. When we're closer to beta, I'll arrange to visit them and help with the configuration.

Chill/Thaw Madness

I spent the rest of the week and weekend completely immersed in debugging several chill/thaw and concurrent online alter bugs emanating from the System QA stress tests. Falcon must be stable before we can hand it over to the PAE team for testing--that has been my number 1 priority since Riga.

I have a pile of fixes to push--some of which I will do tonight--but after wrangling with several nasty thaw bugs, I've come to realize that any mechanism that complicated should probably be scrapped (more on this as it develops.)

Ok, now I'm getting agitated. Back to it.

Some Perspective on Recent Events, Part II

"Why Falcon Doesn't Work Anymore"

Recap: After the MySQL Dev meeting in Riga, Philip "Troublemaker" Stoev, reported a serious problem with Falcon recovery. Ann suspected that the problem not a true recovery bug, which Jim confirmed.


The problem wasn't that we found a bug. The problem was that we didn't.

Run-up to Riga


The Falcon team was really rockin' in the last half of August and early September. With the Dev meeting looming, all changes for the 6.0.7 alpha had to be pushed by month's end, otherwise we'd lose weeks of QA testing.


Lots of stuff was dumped into the codebase during that short time. The Falcon weekly report for August 22 records pushes to the Transaction Manager, Memory Manager, Page Cache and Online Alter, just to name a few.

The build team finally cloned mysql-6.0-falcon on September 4th, but the heavy volume of source changes continued, peaking on the 10th, one week before the conference:

22:57:47 <wlad> Today is the absolute world-record in number of pushes/hour in the Falcon tree
22:57:47 <wlad> I think we already had 8 and seeing Chris' determination, there are more to come:)

22:58:56 <cpowers> Doing my best...iterative process of refinement, you know...
22:59:19 <cpowers> Got to keep them pushbuild servers happy
22:59:27 * wlad applauds:)


Pushbuild is the MySQL automated QA system. Every code push into the source tree triggers a wave of regression tests across twelve uniquely configured servers. Test results are displayed within in a simple matrix, where each cell represents the status of a single code change on a single server. If the cell is green, all tests passed on that server. Red indicates a test failure, build break or even a compiler warning.

A high number concurrent pushes will cause Pushbuild tests to stack up, creating a multi-hour delay before results are available. Such was the case during the Great Week of Pushes as the Falcon team checked in change after change. The frantic pace also resulted in several preventable regressions that kicked the tree back into the red, necessitating repeat check-ins and further aggravating the Pushbuild backlog.

However, by the end of the week, Pushbuild returned to its refreshing, reassuringly green self, and we prepared for our journey to Latvia.

Worklog #4479

During the pre-Dev meeting melee, one item made it into Falcon tree with relatively little fanfare: a two-part cache optimization provided by our performance expert, Kelly Long. From the associated worklog descriptions:

Falcon page cache has a findBuffer() routine that looks at the BDB list starting at the oldest and moving toward the youngest. This code is single threaded and can include time consuming disk IO. This change will increase the concurrency by executing most of the algorithm without locks.

Falcon page cache currently uses a single lock on the entire hash table. This change will create a lock per hash bucket allowing for higher concurrency.

The general idea was to push locking further down into the cache by replacing a single exclusive lock at the top with separate locks for each slot of the hash table.


Jim rejected the notion when it was first proposed back in May, pointing out that the algorithm would effectively triple the number of exclusive locks, even for cache misses.

Kelly was convinced, however, that the change would allow for better parallelism, and that a loss in performance at lower concurrency would be more than compensated by a performance gain at higher concurrency. His own benchmarks appeared to support his contention.


The issue was revisited--contentiously--at the July meeting in Boston, but with no consensus. As the Dev meeting grew near, Kevin reconsidered the matter and eventually granted approval for the change, so Kelly's cache optimization finally made its way into the Falcon tree--along with twenty-five other changes in the same week.


Hard Landing


The Falcon team returned from Riga at the end of September, tanned, rested and ready for the drive towards Falcon beta. No sooner did we sit down at our dusty keyboards than did Philip alert us to new problems with Falcon recovery.


Recovery is a normal part of the engine initialization process that reestablishes the previous in-memory state and applies pending updates following an abrupt or abnormal shutdown. Recovery failures are serious and notoriously difficult to debug, so Philip's discovery of new failures stirred things up considerably.


Fortunately, Ann had a hunch as to what the problem might be. After some digging, Jim pinpointed the failure as a bug in the oft-debated cache optimization:

"The problem is simple and straight forward. Cache.cpp is totally broken. More specifically, the I/O thread (and architecture) is totally broken."

"When a checkpoint was required, a dirty page bitmap was produced. The bitmap, however, didn't contain the table space id, so ioThread had to loop through the collisions in the hash table to find the dirty buffer(s)."


"This only handles the first Bdb in a collision chain. In almost all circumstances, this will be the falcon_user tablespace. Consequently, dirty pages in falcon_master are never written."


"I'm sorry it has taken this long to discover that Philip's problems were page cache bugs masquerading as recovery bugs."


"I think we may need to reconsider some of our procedures."
The fix appeared to be simple enough--ten minutes by Kelly's estimation--but rather than spin more cycles on optimization, Kevin chose a reliable but somewhat heavy-handed course of action:
"In light of this analysis, I think it is prudent to back out Kelly's cache changes. We need to focus on delivering the most stable Falcon engine that we can, most immediately, to the [performance] team, who are our #1 customer right now."
The performance team was the Sun PAE group lead by Allan Packer. During the Dev meeting, we agreed to get a stable version of Falcon to them ASAP so they could begin their own in-depth performance analysis.

But the issue at hand was much larger than a code regression--we'd certainly had them before--the issue was that a serious bug escaped detection.

Vlad got right to the heart of the matter:

"I do not see any way to prevent 'totally broken' in the future. How should we should change our processes to detect an error before the push? It passed all the system tests."
Indeed, the problem wasn't with the Falcon code, it was with the Falcon process.

(The third and final installment the story continues tomorrow, "Perturbations in the Field")

Tuesday, October 07, 2008

Sort of Out of Sorts

I promised Giuseppe that I'd post Part II of "Some Perspective on Recent Events" today, but I got hit w/ some kind of bug.

Last month, authorities at Schiphol duly confiscated my Black Baslam, aka "The Latvian Hammer", so I've had to resort to bland, marginally effective over-the-counter remedies.

Let's see how tomorrow goes.

Monday, October 06, 2008

Prolificity

I woke up today at 4:45 am, which is a completely worthless time to be awake. I lay there for twenty minutes tracking post-REM phosphenes and thinking random thoughts. Here's one of them.

The Planet MySQL aggregator ranks blogs according to 'activity', or number of posts in the last month. I typically care little for such things because unnecessary attention attracts unwanted expectation. However, were one to care about such things, this method of ranking favors those who tend to post like this:

Sunday, October 5
I broke my pencil but that's ok because I have another pencil.

Monday, October 6
I like Drizzle.

Tuesday, October 7
I like, drizzle.

Wednesday, October 8
I believe I lost my shoes, Clyde. I think the dog got 'em.

Free idea: Rank blogs using a Prolificity factor that incorporates two subfactors, Currency and Abundance.

Currency is the number of posts in the last month, and is a fair measure of how current the blog is.

Abundance is the moving weighted average of word count per post, which is a measure of the substance of the average blog post.

A moving weighted average will favor more recent posts, so that 100,000-word post of your PhD thesis back in 2002--though a fine contribution to All Things--would have little relevance today.


Expressed formulaicly,
Prolificity = Currency*Abundance
Let's say Currency wins in the event of a tie, so maybe this:
Prolificity = Currency*Abundance + Currency
For example,
Blog Theta: 10 posts, 2000 words, 200 words/post MWA last 10 posts

Currency = 10, Abundance = 200
Prolificity = 10*200 + 10 = 2010

Blog Zeta: 2 posts, 2000 words, 1000 words/post MWA last 10 posts

Currency = 2, Abundance = 1000
Prolificity = 2*1000 + 2 = 2002
Blog Theta ranks because it's more current, but Blog Zeta isn't entirely out of the game. Perhaps the formula could be tweaked, but I'll leave that as an exercise for the reader.

Sunday, October 05, 2008

Some Perspective on Recent Events

[This has nothing to do with the financial crisis or the McCain/FeyPalin ticket. Sorry.]

Last Wednesday began innocently enough. Philip's email to the Falcon team, "Regarding Falcon Recovery", lamented the lack of progress in fixing recovery bugs. He detailed specific failures that he was seeing, many of them new, and pointed out that that the number of recovery bugs was increasing. He closed with
"All of this means that the recovery problems must be tackled immediately and head-on. A database without functioning recovery can be at most alpha-quality software."
He was, of course, absolutely right. The lack of reliable database re
covery is like flying with non-locking landing gear, so we took a hard look at the problem.

Then all hell broke loo
se.

Falcon Recovery

Falcon recovery is the equivalent of waking up after a night of partying and trying to figure out where you are, what you did and who you did it with (I have never experienced such a thing. Ever.)

When Falcon starts, it determines whether it was shutdown cleanly. If not, it invokes recovery to apply the remaining active parts of the serial log, beginning with the last serial log entry, then backing up to its recovery start point.

This multi-phase procedure must work without fail, every time, even following an aborted recovery attempt (this is called double recovery.)

The recovery process applies changes in the serial log to pag
e images in the page cache, until they are all done, then flushes the cache by forcing a checkpoint. Falcon does this in three passes through the serial log:

Phase I: Take Inventory, Establish State

- Determine transaction states
- Determine object states (track state transitions, record final state)

- Determine the last checkpointed record (prior objects guaranteed on disk)

Phase II: Physical Allocation
- Allocate and release required pages and sections
- Track object "incarnation" and state--update active objec
ts with last incarnation

Phase III: Logical Application

- Apply data and index changes (avoid reallocating pages in use)

(For details, have a look at SerialLog::recover() in SerialLog.cpp.)


Recovery failures are notoriously difficult to debug. Sometimes you have to deter
mine not only what did happen, but also what didn't happen, such as a missing page update. And quite apart from tracking various types of recovery objects, recovery must also track the state or "incarnation" of each recovery object. For instance, if a record was updated five times, we only care about the most recent version--the fifth incarnation

(Free idea: "The Fifth Incarnation", an epic tale of lives lived, lives lost, loves lived and lost loves. At quality book stores near you.)

"Expected 7, got 2"

Philip modified the SystemQA stress tests to force a recovery and found that the engine crashed every time. Up to this point, Falcon recovery was regarded as generally reliable, though not flawless, but now we had a reproducible failure that always manifested the same way:
Fatal assertion in Cache::fetchPage(), "Error: page 103/0 wrong page type, expected 7 got 2".
Cache::fetchPage() lies within the very nucleus of the Falcon database engine. It is the cosmological core through which every essential database action must flow.


Getting a "wrong page type" assertion in Cache::fetchPage() is like asking your HAL 9000 to compute "4 + 4 and getting "green" for an answer. This is not the type of error you ever want to encounter in a database, or in Jupiter orbit for that matter, because it puts everything into question. Nothing can be trusted--data pages, index pages, orbital trajectories--nothing.

Given that the cache error appeared to occur only during recovery, and given the c
omplexity of the recovery process, it was only natural to assume that we had somehow failed to properly record a page update within the serial log. (We later discovered that the error had indeed occurred elsewhere, but only occasionally. Hakan saw it during DBT2 testing, and I had also seen it testing online alter.)

"This used to work. Always"

Ninety minutes after Philip's email, Ann responded:
"This appears to be a recently introduced problem that (probably) has nothing to do with recovery. It appears that the PageInventoryPage is not being written often enough. I don't know why. As a result, pages that are actually in use are recorded as being free. This isn't a problem for a running Falcon because the page is up-to-date in the cache. After a crash, the out-of-date version is read from disk and pages are reallocated."
Falcon uses PageInventoryPages (PIPs) to track used and available pages, so an out-of-date PIP implies a very serious corruption.

Jim provided his assessment shortly thereafter:
"It doesn't work at all. There has been a huge regression. The simplest of all possible test -- start with a clean database, create a single table, kill the server, and restart -- fails first time, every time."

"This used to work. Always."

"We don't have a recovery bug, we have a non-functional code base."
And so it began.

(Part 2 of the story continues tomorrow, "Why Falcon Doesn't Work Anymore")

The Weekly Falcon Index

Planned Falcon Blog posts: 4
Actual posts: 1

Team emails debating the cost of changing core components: 44
Post-debate emails suggesting a major change to a core component: 1
Days advance notice to change my LDAP password: 9
Nag emails to change my LDAP password: 9
Days of delay before changing my LDAP password: 9
Hours unable to access email after changing my LDAP password: 6
Calls to support after changing my LDAP password: 1
Falcon IRC non-system messages: 2022
IRC champ: Hakan (23%)
Runner-up: Vlad (21%)
Messages in German: 35
References to "Ramadan": 10
Best quote: "I try never to argue against clarity"
Source of quote: Ann
Intrusions by senior staff: 1
Number of times this week I typed a password into IRC: 1
Number of times this year: 3