Wednesday, February 17, 2010

Revival

April 20, 2009 was one hell of a day. In the morning, Sun announced the Oracle acquisition. In the afternoon, my mother was diagnosed with terminal lung cancer. She died three months later.

Within hours, three foundational aspects of my life were rocked to the core: The project I worked on, the company I worked for, and one of the two people directly responsible for my very existence (however, I did receive a decent tax refund, so the news that week wasn't all bad.)

Now that I'm solidly on the other side of an extraordinarily disruptive year, I can afford to reflect and offer some perspective.

The burning question is, of course, What became of Falcon? The short answer is that the project was shelved and the team disbanded. But let me offer a somewhat more considered answer to that question.

Falcon's very existence was predicated upon the fact that MySQL needed a transactional storage engine alternative to InnoDB. To give you a sense of the importance of the project and a sense of the urgency back then, on my first day on the job with MySQL, Marten Mikos sat down with the Falcon team and said, "If you deliver Falcon, I will take MySQL public." That's pretty motivating, and the Falcon team was, if anything, motivated.

Yet, two years later, the Oracle acquisition instantly rendered Falcon strategically unnecessary. Would there be a role for Falcon? It seemed doubtful, but no one in MySQL was certain, and if they were, they weren't talking.

Falcon Reimagined

In response to the Oracle announcement, and quite apart from the pithy "Whither MySQL?" post, I assembled an internal wiki page describing my vision of a post-acquisition Falcon. Entitled Falcon Reimagined: The Falcon Performance Engine, it suggested that by reducing the engine to its purest elements--highly concurrent transactions and in-memory speed--and by optimizing the engine for screaming fast Sun hardware, we could turn Falcon into a performance beast.

Of course, the vision depended upon the vision of our new owners, but I figured given that

1) Falcon would no longer be positioned as a competitor of InnoDB, and

2) Falcon has neither a legacy tail nor an installed base to contend with,

then we might have an opportunity to inhabit a new niche with the Falcon equivalent of a concept car. (I'll post Falcon Reimagined once I determine that it's within the corporate social media guidelines.)

Golden Slumbers

Weeks prior to the Oracle announcement, Sun management challenged the Falcon team to finally stabilize the engine once and for all. We'd been chasing performance and memory targets for months, and although progress on those fronts was excellent, the bug trend was discouraging and the project seemed stalled. Management was restless, irritable and demanding changes.

Kevin, Ann and I scrambled to devise a detailed and realistic plan with which we would finally drive Falcon home, but by then it was too late: Sun announced the acquisition and the Falcon team began to dissipate. By the end of the summer, each member of the Falcon team had either been reassigned or had chosen to leave Sun altogether.

In mid-October, after being granted a three-week stay, Kevin, Ann and I wrapped up the last remaining issues and put Falcon to sleep.

And In The End...


I could write a book reflecting on Falcon and the experience of trying to deliver a high-visibility project
in the face of constant deadline pressure. Who knows? Perhaps some day I will. For now, I'll keep it simple:

What we did wrong:

Chase InnoDB at the expense of stability. For example, our performance goals should have remained fixed until the engine was stable. Instead, we reprioritized performance whenever InnoDB improved theirs. Yes,
extremely poor performance is a critical flaw, but 70% of InnoDB should've been good enough until GA.

What we did right:


Assemble a top-notch team. By the March '09 Falcon team meeting in Athens, the Falc
on team was comprised of a solid mix of senior engineers with a variety of technical skills. We had really come together in terms of collaboration, enthusiasm and technical innovation. We returned from Athens recharged and ready to kick ass. In some respects, the Athens meeting was like our Abbey Road album--the best and the last.

Falcon was fun. Falcon was interesting. Falcon was intense. Falcon was also frustratin
g and, at times, Falcon was insane. But that's why we do this, isn't it?

So...yeah.

What's next? For now, I am part of the MySQL Search Team, a self-directed, cross-team SIG within MySQL. Our mission:
  • Improve the ease of implementing native or third-party fulltext search with MySQL.
  • Improve the quality of fulltext search results.
  • Improve the performance of fulltext search response.
Fulltext search is a fascinating subject in its own right. It is an aspect of MySQL worthy of the considerable attention it's received over the years, and it is a feature for which there is room for considerable improvement.

Now that the acquisition dust has settled, the MySQL Search Team will continue to gain momentum, and I will use The Falcon Blog to reflect upon our progress.



Monday, April 20, 2009

Monday, February 09, 2009

Minnesota MySQL Users Group


Last month, the local Sun office was gracious enough to host the local MySQL Meetup. My attendance has been sporadic last couple of years (57%, apparently), largely due to travel, but this time I committed to giving a talk on MySQL 6.0 and Falcon.

It was a great meeting with excellent turnout--thirty-one, I think, including three from Sun/MySQL.

Benjamin Wood, a systems engineer from Dallas, called in to give a MySQL 5.1 overview, including a nice drilldown on the new features from which I learned a few things myself.

Prior to joining Sun last May, Benjamin had 12 years experience as an Oracle DBA, which gives him substantial database street cred and the ability to speak with some technical authority to Sun's database customers about MySQL.

Following Benjamin's talk, I gave an overview of MySQL 6.0 with emphasis (naturally) on Falcon and the Falcon architecture. I find that there is genuine interest in MySQL 6.0, especially with regard to scaling on newer platforms. As the drama of the MySQL 5.1 launch recedes and the product matures, 6.0's time will arrive.

Several folks expressed interest in trying out Falcon, and I offered to lend my services, including providing dev builds of the engine, after we get the 6.10 alpha out the door.

I had a chance to chat briefly with Marc Grabanski, a local web developer, who gave a quick demo of a sweet little UI that he developed (link tbd). Marc was later kind enough to forward an introduction to Garrett Woodworth and Nate Abele, the project manager and lead developer, respectively, of CakePHP.

Also in attendance was Erik from FISDAP, Jim from Schawk and of course, the usual suspects, Chris Barber from CB1 and Charlie from Carol.com.

The real strength of a user group is, well, users, and I look forward to hearing presentations from other members of the group, real-world stories, problems encountered, problems solved, lessons learned, etc.

Sunday, December 07, 2008

Bluster, Blather and B.S.

Many have published their Very Important Opinions on the MySQL 5.1 GA. Now that the issue has been forced, I must take a stand.

Falcon is strictly a MySQL 6.0 project. I did not participate directly in the development of MySQL 5.1 and cannot render an informed technical opinion as to its quality except to say that I absolutely trust the judgment of MySQL management and the ability of my fellow engineers, and I fully stand behind their decision to release MySQL 5.1 GA.

In twenty-two years, I have worked at companies ranging from mom-and-pop software shops to companies like IBM, CSC, Harris, Boston Scientific and everything in between. I have developed operating systems, telecommunication software, database microkernels, medical device firmware and, most importantly, applications for the wholesale distribution of beer.

Every single one of these products shipped with known bugs--serious bugs--and every single one of these products shipped with at least someone strongly questioning the decision to ship. Every single one.

But we did ship, and the operating system ran millions of PCs and the telephone switch connected billions of phone calls and the database stored petabytes of data and the heart device saved thousands of lives and the warehouse delivered millions of gallons of beer.

And the bugs got fixed and then we moved on. We moved on. Never once did I question our ability to do so, and never once did we fail to do so.


So, in my experience, "Disagree and Commit" isn't just a feel-good platitude, it is a proven, practical approach that works. It is the hallmark of a mature organization, and it is the core principle of smart, successful teams that actually deliver.

Yet, despite all of the bluster, blather, and b.s. (mine included), the issue is really quite simple: If you can't be a team player, find another team.

Thursday, October 30, 2008

Play Action Fake

I host a MySQL server for use by the Falcon team, and this requires business class broadband service.

Comcast Business Services support has been consistently stellar.
Network or line problems are addressed within the same day, and I always get straight through to a knowledgeable person when I call support.

Recently, the coax connector on the cable modem/router snapped loose. I called Business Services, and within two hours a replacement router sat on my desk. I configured the device and called to have them enable it on their network. I got right through.
"This is Ron. How may I help you?"
Twentysomething, confident in tone. A good sign.
"Hi Ron. This is Chris. I just configured a replacement router and need to have it enabled on your network."
Hard stop.
"That is not a router. It is a mod-em. We call it a cable mod-em."
Like I was five, or something.
"Ok, well, I use it to manage the network, that's why I call it a router."
"That is a cable modem, not a router."
Strictly speaking, he was right: the Comcast Business Gateway (SMC 8014)
modulates and demodulates the cable signal and is therefore a cable modem, albeit one with router-y capabilities.

I called it a router anyway. Another question:
"I have two Comcast devices. Next to the router is a Scientific Atlanta DPX2203. It's what we get our TV and phone with. What do you call that?"

"I have no idea what that is."

He should have. It was just a standard cable modem.
He probably doesn't do residential customer support. Let it go.
Ron logged on to the router. He did something and it rebooted.

"Try it now."
No joy. My carefully configured network ceased up like an acrophobic mountain goat.
"Ron, the LAN can't see the WAN, local DHCP is busted, the NAT and DMZ configuration pages are disabled on the router and my server can't see the network."

"What's your server's IP?"

He shouldn't have to know, but I gave it to him. The router rebooted.
"Ok, try it now."
Better, but not right.
"The NAT and DMZ options are still disabled, and my local systems are assigned public IPs. Not good."
I explained, patiently and at some length, that I wanted my local network behind the firewall and that my server had to bypass the firewall. I also didn't want my local systems using the block of static IPs assigned to my account. Basic stuff, but Ron simply didn't grasp the concept.

Then he said this, I swear:
"I can't help you with that. Those features are disabled for your account."
Something had gone terribly, terribly wrong in my world. Subarachnoid hemorrhage? Radon?
Shake it off, Powers.
"Now, wait a minute. The router was shipped unlocked, just like my old router. I configured it, just like my old router. When you remotely configured it, those features became disabled. I NEED THEM. Did the firmware revision change or something?"

"Sometimes the firmware changes, but NAT is disabled. You do not need those features."
Clearly, I was being tested. Ron said this with complete conviction, and it threw me.
"Ron, I most certainly DO need those features to configure my network. I had a server in the DMZ and a LAN behind the firewall, and now you're saying that I can no longer do that? Has my account status changed?"

"No, but we don't support those features. I really can't help you with that."
Confusion yielded to irritation.
"Look, this is unacceptable. You replaced my router--sorry, cable modem--but disabled the features that I was using. It's useless now. Please let me speak to your supervisor or someone else who can help."
Ron stumbled, then recovered.
"I can transfer you, but those features are not enabled for your account. We don't support NAT."
My incredulity finally inspired a calm sense of purpose.
I will appeal to Ron's higher mind. He can be reasoned with. He is stubborn because he is afraid. I know he is wrong and he suspects he is wrong but wants to appear confident like his fellow techs who really do know their stuff.
I chose to de-escalate and reason with him instead.
"Ok, let's think this through. Assuming you are right, why is my laptop accessing the Comcast DHCP server? Are you saying I can't even turn on the router's DHCP?"
A pause.
"Let me look."
The router rebooted. Third time.
"Check it now."
All lights were green. Homepages blossomed.
"There! Great! What did you change?"

"I enabled NAT."
I thanked him. He said to call back if I needed anything else.

Saturday, October 25, 2008

The Weekly Falcon Index

Planned Falcon Blog posts: 4
Actual posts: 4
Hours resolving network outage due to physically damaged router: 3
Hours from first service call that a new router arrived via courier: 2
Hours haggling with Comcast tech over router configuration: 1
Hours resolving concurrent but unrelated disk corruption: 2.5
Hours locked out of building where I was supposed to speak at MySQL Meetup: 0.5

Falcon IRC non-system messages: 2661
IRC champ: Vlad (26%)
Runner-up: Lars-Erik (19%)
Highest percentage of smiley faces per lines of chat: 19%
Friendliest chatter: Olav
German/Norwegian message ratio: 1:1
References to reiserfs: 13
References to prison: 2
References to football (soccer): 182
Gratuitous references to Ramadan: 1
Best quote: "You obviously do not live in an apartment with a bunch of left-wing girls."
Source of quote: [redacted]

Thursday, October 23, 2008

Some Perspective on Recent Events, Part III

[Part I, Part II]

Perturbations in the Field


In the weeks leading up to the Dev meeting in Riga, the Falcon team dumped a record number of changes into the codebase, including a page cache optimization that contained a severe but
undetectable bug.

The usual indicators offered no sign of trouble: the Pushbuild matrix was green, the Falcon regression tests passed, Syst
em QA reported nothing unusual. It wasn't until Philip modified the System QA stress tests that a problem emerged.

The modification was simple: kill mysqld at the end of the test, then restart the server and wait for Falcon recovery to complete. This unsophisticated, almost whimsical improvisation should have amounted to little more than a savage afterthought in the test plan, however, not only did it expose a severe bug, it also revealed a serious gap in our test coverage.

What else had we missed? Jim surfaced an old argument:
"The question is how something that broken could escape detection."

"What we really need is an internal application that actually uses Falcon for something we care about. If we were actually using it, we would have noticed the breakage in about a minute."
In fact, MySQL 4.1, 5.0 and 5.1 is used extensively throughout development. We last discussed dogfooding Falcon at the London meeting in May, though nothing came of it.

A Personal Vendetta

The discussion continued.


Philip:
"In an ideal world, we should have had a recovery test in PushBuild or in the Weekly Falcon Test starting sometime around Nixon's resignation."

"I wish this bug was caught in the course of the execution of some determined, comprehensive QA strategy for Falcon. Instead, it was caught because I happened to have a personal vendetta against Falcon recovery that drove me to create those tests."
Manyi
1) We need better coverage on regression testing. Serious side-effects of a patch from September were not caught by regression testing until Philip introduced more recovery tests. We need to complete the remaining tasks on FalconQAIdeas and improve the regression testing we have today.

2) Falcon patches tend to be large and contain several bug fixes and worklogs, that adds complexity and makes it more difficult for the reviews. One way to solve this is having smaller patches and each patch corresponds to a bug report or a worklog.
Ann:
"Neither worklogs nor small patches would have made this any easier to find. You have to understand the code, specifically that there can be several dirty page with the same page number and walk the hash duplicates chain to get all the pages written. I doubt that anybody but Jim would have caught that subtlety."
Hakan had seen the recovery failure, but hadn't realized it:
"Actually it was caught by the weekly DBT2 testing with 100 warehouses, where we load the data and restart the server. There were couple of 100 warehouses runs missing in September, but I ignored it because of Falcon being unstable in general."

"Todo: Add a hook into DBT2 to figure out why a 100 warehouse run was not started."
Vlad:
"I'm sorry to be annoying with my ideas of quality, but I really believe if we had good unit testing (and one for page cache of course) such regression would be caught there. SystemQA test is good to find regressions but it does not help localize the error..."
Jim
"I'm the heretic about code reviews. An architectural discussion before the design is cast into code is a hundred times more useful than a code review afterward...a bad implementation of a good design is easier to fix than a good implementation of a bad design."
And on it went.

El Photo Grande

My take was that this bug did us a favor:

"I think this entire episode was a gift. We identified the cause of a serious, complex problem and consequently identified holes in our process."

"Strictly speaking, our QA system worked, albeit very inefficiently, because the bug didn't make its way into a telco switch or a 911 call center or Lehman Brothers financial system or even 6.0.7 for that matter (ok, maybe that was luck), so let's improve the process and move on."
I continued with my analysis, bordering on bureaucratic blather:
What Happened
There was "perfect storm" of concurrent events:
  1. A complex, multi-part change to a core component (page cache), not fully reviewed
  2. Other core changes piled on top (online alter, memory manager, transaction manager)
  3. Disproportional reliance on Pushbuild I results
  4. SystemQA stress tests were changed, thus no pass/fail history (was recovery bug new or old?)
  5. DBT2 failures were detected, but not correlated
What To Do
(I promise that these ideas are perfect and that there will never be another problem if we adopt them.)

Stricter Code Reviews
  1. One commit-one bug (mea culpa on this one!)
  2. Stability is KING
  3. Jim is still the reigning Falcon expert. USE HIM. Nag, if necessary.
  4. Changes to core components (page cache, memory manager) require a higher barrier of entry:
  • Clear, reproducible proof of benefit
  • Clear evidence of no regression
  • Assign a reviewer to understand EVERY LINE of the fix
I also think we should suspend all performance changes until we get results from PAE.

Assign Experts
Assign responsibility for subsystem expertise. I've seen this work on other teams ("Who's the DMA guy? Ralph? Ok, great.") Each subsystem (TransactionManager, Cache) has an expert, a go-to person whose sign-off is required for code reviews.

Cross-component expertise is important but we need more depth.

"El Photo Grande" for Falcon QA
We need the equivalent of Homeland Security for Falcon QA, an aggregation of QA intelligence such that we can, perhaps in a dashboard or something, see the overall state of QA: Pushbuilds I&II, SystemQA, DBT2, Performance, etc.

Once we establish a comprehensive QA baseline, then subtle-but-catastrophic bugs will manifest as perturbations in the field.
Ann thoughtfully countered many of my suggestions:
"It's not possible to do a visual code review of something as complex as Kelly's flush mechanism."

"The problem is that we have to develop QA in parallel with the code."


"This is all hard stuff, particularly recovery since it's not easy to make it deterministic, and if you do make it deterministic, you miss cases."

"People are better at analyzing and understanding designs than code."
I disagree with Jim and Ann on the value of code reviews. Yes, senior engineers should spend mor
e time reviewing design than source code, but my experience has been that a second or third pair of eyes will always catch stuff--always--and that changes to core components must be reviewed by senior engineers. Less critical changes require less scrutiny.

But code reviews always miss stuff, too. Always.

The Mansfield Anomaly

One case in particular comes to mind, a pacemaker firmware defect so subtle and so pernicious that it was named in honor of the engineer who introduced it.

The "Mansfield Anomaly" evaded in-depth peer reviews, design analysis tests, unit tests, s
ystem verification and validation, pre-clinical testing and even clinical trials. It never manifested as a device failure in the field, fortunately, and the bug was eventually found in-house and a firmware patch released.

I was introduced to the Mansfield Anomaly by an advisory level engineer
during an informal training session. He flashed a rather innocuous 'switch' statement on the screen and challenged the ten or so engineers in the room to find the bug.

No one did.

After entertaining a few guesses, he pointed out a very subtle cut-and-paste error that could result in a device failure, but only during a 4ms hardware timing window. Anyone could (and did) miss it, especially in the context of a more comprehensive review with hundreds of lines of code.


A Savage Pleasure

So, back to the original question: How can the Falcon team better ensure that subtle-but-devastating bugs are caught? Code reviews? Design reviews? Dogfooding? Unit tests? More system tests? The MySQL community?

All of the above, of course, but what can a team do mid-project? What is the most effective change to make now?


Here's what we've done so far:
  1. More eyes on the code--all pushes must be code-reviewed
  2. Smaller pushes into the codebase
  3. Public intra-team email: falcon@lists.mysql.com
  4. Falcon QA is moving ahead with new tests
  5. Run the new tests on previous releases, compare with current or pending release
  6. Focus on stability in favor of performance--for now
  7. Save before and after file snapshots for recovery debugging
  8. Pushbuild2: a new array of servers dedicated to automated System QA stress tests
Item 8 is my personal favorite. Stress tests are irrational killers, but I find them to be a very satisfying challenge, which brings to mind one of my favorite quotes:
"Problem-solving is hunting; it is savage pleasure and we are born to it."

--Thomas Harris, Silence of the Lambs