Strategy in Principle

The Leadership / Management Blog of Kevin Crenshaw 
« Back to blog

Can Rackspace (Please) Deliver "Fanatical Support?"

Kevin Crenshaw
 
We really, really hope Rackspace will soon deliver the "fanatical customer support" they advertise for all its clients, all the time. It's a great slogan, it's a superb goal, and it is achievable.

Unfortunately, Rackspace isn't even close yet. This business case study analyzes why Rackspace falls short and—more importantly—what they must change to deliver the promise.

If you yearn for world-class support in any industry, read on. These principles are universal.
 
The Story

 
My company, Priacta, uses Rackspace Cloud Sites. We really liked the idea of agile, infinite scalability and multiple-site hosting packaged in a single managed account. But we especially liked the idea of fanatical support promised by Rackspace, seeing as that's part of our mission, too. Your suppliers need to share your company values, or they could work against you.
 
Furthermore, to be productive (and we are productivity experts) you can't waste your time pulling weeds. Hire someone else to do that—someone fanatical about weeds.
 
Lately, however, Rackspace seriously let Priacta down in the reliability department over and over ... and over. Small groups of emails started intermittently "bouncing" and then disappeared, accompanied by only occasional, cryptic error messages. Then an entire days' worth of emails disappeared, with Rackspace's knowledge but without a word of warning until we asked. Worst of all, our shopping cart kept going offline through no fault of our own. Mysterious "No suitable nodes" errors kept flashing before our eyes—and our customers' eyes—as prospects left our store with nothing to show but abandoned carts.
 
Not to worry! Mistakes happen. Cloud technology is new, Rackspace is a pioneer, and we signed on for that ride, so we expected a few arrows along the trail. We didn't worry, at first, because "fanatical support" will make it right. If the person on the phone really, really cares, a solution is always within reach.
 
Hacking Fanatical Support
 
Unfortunately, chat sessions became long email chains followed by frustrating phone calls. Support ticket after ticket was opened. Rackspace techs said our email provider was to blame. Our email provider blamed Rackspace. Neither party pressed the matter once they figured that it wasn't their fault. End of discussion. Both parties lost sight of the simple fact that their customer still had a problem.
 
Far from fanatical support.
 
Finally, in desperation, our top developer dusted off his social engineering skills and hacker instincts, poked around the web, and found the email addresses of both CEOs. Then he crafted a beautiful email and cc:ed everyone, "Escalating the Issue to the CEO's." (Anyone else want those email addresses? Sorry, it's a Priacta trade secret for now. We may need them again to get our remaining problems resolved.)
 
We also did a little tweeting and were contacted by a fanatical (but not all-powerful) PR rep at Rackspace. He tried pulling some strings for us too. Finally, with both CEOs and others looking out for us, the ball began to roll. It took more than a week and two dozen more emails, but that issue was finally identified and fixed.
 
Should a customer have to go through all that?

No. But we can learn some things from it.
 
The Plot Thickens

 
More recently, we started down a similar path on our "No suitable nodes are available to serve your request" errors. (As of this writing, Google gives about 43,300 hits when you search that message!) Rackspace explained that this error comes from their load balancers, but Rackspace cannot track down the cause. What?! Their message was clear: in essence, "Our error, not our responsibility. Find some way to handle this yourself. Call us if you see it happen again." Never mind that we had already reported it several times, and by the time their admins checked, they could not reproduce the problem. Our issue was unresolved.
 
Not fanatical.
 
What's the Real Problem Here?

 
You might think we got a bad support operator or two. Maybe, but it's definitely bigger than that since we got poor support repeatedly, on different issues, regardless of the technician. Something was wrong with the Rackspace support process itself.
 
Any company the size of Rackspace needs reliable, repeatable processes to support its mission. You just can't count on always hiring the few exceptional people with initiative, and then emphasize the need for fanatical service. Your basic delivery process has to be right, regardless of who is on the phone, or you will fail.
 
So what, exactly, is currently missing from the Rackspace support model?
 
Clue #1:  Social Media (Accountability)
 
Remember United Breaks Guitars? 6 million views and counting. That catchy YouTube video worked for Dave Carroll and hurt United's reputation big-time. But why?
 
Social media creates a fast channel of accountabilitypublic accountability, driven by the consumer. When you have darkness, social media lets you dispel it by shining a bright light on it. YouTube can be a very big light, and so is Twitter. Just make sure you are very careful with your facts when you do it.
 
The truth is, however, that companies are always accountable to their customers, whether they realize it or not.  Social media just accelerates and magnifies that accountability and makes it stick. That YouTube video will be out there for years.
 
Public accountability isn't ideal for the company, however. It can hurt big time. That's why Rackspace was smart to have a sharp PR rep monitoring tweets and offering help pronto. That's also why United Airlines was downright stupid. When Mr. Carroll popularized their blunder, United should have jumped up and quickly countered with a video of their own: "United LOVES Guitars." I'm picturing reformed, teary-eyed, penitent baggage handlers handing off guitars with gloves, as if they were made of fine glass. Make it funny, and try to turn huge negative publicity into a new, positive reputation. Give Mr. Carroll lifetime free airline tickets, so he can unofficially test the "new" United every time he flies. Help him post a public report card on all his flights. They could turn Mr. Carroll into a very popular, very influential, unofficial company spokesperson. Think Verizon guy, but so much better.
 
That's called "damage control." But how much better is damage prevention? And as good as damage prevention is, there is still something better.
 
Clue #2: Kenny the Printer (Ownership)

 
Years ago, my father told me about an amazing company called Kenny the Printer. When he walked into the store with a project, someone at the counter took the job, asked questions, and then ended with these critical words: "I will personally manage your job from start to finish. Here is my card. Call me if you have any questions or issues."

Any fear that your job won't go as planned? Why not?

Kenny the Printer delivered personal accountability to the customer that created a sense of ownership in the service representative. No need to create YouTube videos or blog articles to get their attention. No need to wonder if they care about your problem. You have a personal rep who's taken charge of your job; the staff member's reputation is tied to the service they give; the company ties its reputation to its staff's reputation; you are in the driver's seat as the consumer.

Accountability and Ownership at Rackspace?
 
When we first called Rackspace with our bounced email problem, who owned it? We did, not the technician. The customer owns the problem with their current system. No wonder support reps feel comfortable passing the work back to us. "Let us know if it happens again." And again, and again.

And who was measuring if the service was excellent? The customer service rep, not the customer. The simple customer satisfaction survey afterwards didn't fix that problem. Customer surveys usually fail to provide proper accountability because they are implemented or used badly. If they only provide semi-anonymous feedback that is reviewed internally against standards that the customer does not see (lack of transparency), then the survey doesn't really exist for the customer, and the rep feels no accountability to the client. No real customer accountability, no real staff ownership.
 
Fixing the System
 
Rackspace support became fanatical when the CEO was brought into the loop. Why? 1) We were given a named contact who felt real ownership of the issue, because 2) that contact knew that they had to satisfy us to satisfy the CEO. The support operator was suddenly accountable to the customer, and his reputation depended on our satisfaction.
 
So, for Rackspace to deliver consistent, fanatical support, they must re-engineer their process like the one their CEO created. Make it the rule, not the exception—and yes, make it economical at the same time. It can be done.

Apply the lessons of Kenny the Printer and social media. Create personal staff ownership by assigning specific reps to specific problems or clients, and let the consumer be the judge of "fanatical." 
 
Real, personal accountability to the customer. Personal ownership within the organization based on that accountability. Taking ownership of the problem from the customer.

Now that's a fanatical idea!

Kevin Crenshaw is a productivity expert, business consultant, and executive coach. He is also CEO of Priacta, Inc., a time management company that helps you get an extra two hours out of your day—for life.

Comments (29)

Dec 07, 2009
Kevin Crenshaw said...
Just ended another sad support session. My frustration builds each time I try to resolve this ongoing issue. Sadly, that means I'm not interacting as politely as I want to. (Shame on me for that, not them...)

Each operator is new (or acts new, if I'm chatted with them before). Each one seems to repeat the same mistakes, checking the wrong (easy) things instead of going straight for the jugular while the error is reproducable, then leaves without owning the problem personally. Lack of history, lack of accountability, lack of ownership, lack of resolution. Not fanatical.

This time, the operator ended the session before verifying that I was satisfied (I wasn't); the chat transcript I requested was not sent (I would have posted it here for review), and no satisfaction survey was sent out either. Can their operators cherry pick their surveys or cancel transcripts? I doubt it, but it bothers me that the question even enters my mind....

The Rackspace support process needs fixing (simple, read above), or their "fanatical support" claim could be challenged. That would be a powerful marketing move--by a Rackspace competitor...

Dec 07, 2009
Kevin Crenshaw said...
Angela,

Thanks for the offer of help. We will certainly take you up on that and report back here. I trust that we'll get great results with your help (see reasons below).

HOWEVER, please hear me when I say that there IS still a problem with your support process, it is not just our special situation. Our most recent pains ("No suitable nodes") are nothing unusual, many others are getting those errors too. We are following the standard Rackspace Cloud support channels, and we are never, ever, not once, getting "fanatical" support that way. To me, "fanatical support" means that the standard channels give me fanatical support, not that I have to be fanatical to get it (escalating it to GM, etc).

Your goal of "fanatical support" is awesome, I laud it, I want to see it happen, I want to tweet about it over and over. That can only happen if your basic support processes are changed to support it. Your human processes must support your goals. The standard support process has to incorporate 1) individual technician ownership of a problem and 2) technician accountability to the customer. You and the other contacts you're now involving will feel that, so this round of special support will almost certainly succeed. Why not ensure that your front-line technicians feel the same thing, all the time?

Kevin

Dec 23, 2009
Sean Daily said...
We are experiencing the same issues having moved our entire network over to RackSpace Cloud Sites recently. "No Suitable Nodes" errors all over the place with our various WordPress blogs and no help whatsoever from support. "Applications are the customer's problem" is what we get. VERY disappointed, and it is affecting our business.
Jan 03, 2010
Ben said...
I began experiencing this error last night on an xmlhttp call that takes a long time to load. After googling the error and finding this post, I decided to avoid rackspace support if possible. From what I was able to piece together, the error in my case was because of no response at all from the page I was calling for quite a long period of time. I added a few response.flush() lines in the ASP code I was calling so the server at rackspace received a response sooner and the error stopped.

Not sure if this helps, but with the limited info out there, this may help someone.

Jan 13, 2010
Chris said...
We have been getting this error quite often too since roughly the beginning of December 2009 through today (Jan 13, 2010). You can see each time below when it is getting that error or otherwise not responding. These are today's statistics. It will go for a while without a problem, then get the "no suitable nodes" available then be ok. 34 times today we've gotten it:
00:00 - 06:29 371 Ok 1.129
06:32 - 06:32 1 Failed n/a
06:32 - 10:43 242 Ok 1.305
10:45 - 10:45 1 Failed n/a
10:46 - 11:22 34 Ok 2.383
11:24 - 11:24 1 Ok 4.429
11:27 - 11:27 1 Failed n/a
11:27 - 11:44 17 Ok 3.083
11:46 - 11:46 1 Failed n/a
11:46 - 11:48 3 Ok 2.389
11:51 - 11:51 1 Failed n/a
11:51 - 11:59 8 Ok 3.942
12:02 - 12:02 1 Failed n/a
12:02 - 12:28 24 Ok 3.082
12:31 - 12:31 1 Failed n/a
12:31 - 12:41 10 Ok 4.373
12:43 - 12:43 1 Failed n/a
12:44 - 13:13 27 Ok 4.894
13:16 - 13:26 7 Failed n/a
13:27 - 13:29 2 Ok 10.913
13:30 - 13:30 1 Failed n/a
13:31 - 13:31 1 Ok 10.029
13:33 - 13:33 1 Failed n/a
13:33 - 14:01 24 Ok 6.143
14:02 - 14:02 1 Failed n/a
14:02 - 14:19 14 Ok 5.688
14:20 - 14:20 1 Ok 6.707
14:23 - 14:23 1 Failed n/a
14:23 - 14:32 9 Ok 3.279
14:34 - 14:34 1 Failed n/a
14:34 - 14:53 18 Ok 2.7
14:56 - 14:56 1 Failed n/a
14:57 - 15:01 5 Ok 3.15
15:04 - 15:04 1 Ok 8.859
15:06 - 15:10 3 Failed n/a
15:10 - 15:10 1 Ok 8.623
15:12 - 15:12 1 Failed n/a
15:13 - 15:17 5 Ok 2.913
15:19 - 15:19 1 Failed n/a
15:21 - 15:42 20 Ok 4.46
15:43 - 15:43 1 Ok 2.134
15:46 - 15:46 1 Ok 2.063
15:49 - 15:50 2 Failed n/a
15:50 - 15:58 7 Ok 3.841
16:00 - 16:00 1 Failed n/a
16:01 - 16:16 16 Ok 3.824
Jan 13, 2010
Chris said...
p.s And as you can see, the number after the "Ok/Failed" column is the average response time in seconds. I have two system monitors on it from two different services in two different locations and both are seeing terrible response times.
Jan 13, 2010
Jackson said...
Angela -

I have been a customer since MOSSO was in beta. And I can tell you from experience and spending time in your forums that this experience is becoming the norm for many many of your customers.

You must scream at the top of your lungs, and if you make enough fuss and noise you will get the support you're expecting. Otherwise you will chase your tail listening to canned responses.

Your no suitable nodes problem has become laughable at this point. I'm going on 6.5 hours today of up down up down availability. Sigh. Everytime I think things are going smoothly - you guys run out of nodes and angry clients start calling.

Jan 13, 2010
Kevin Crenshaw liked this post.
Jan 13, 2010
Sean Daily said...

Rackspace still needs to deal with this issue, but one thing that we’re finding is helping somewhat is disabling plugins that might take longer than 30 seconds to get a response which is one of the things that apparently triggers the error. For example, disabling WordPress stats and Gravatar plugins for us on one site alleviated some of the issue. Still, we’ve never seen anything like this with the same sites and plugins on ANY other host, including low-end $6/month shared hosting plans.<o:p></o:p>

Jan 14, 2010
Chris said...
I too haven't seen anything it from a dedicated server at ServerBeach to a super-cheap shared host at DreamHost (for other reasons, Dreamhost is not good, but not that).

It hasn't helped to have their downtime that they've had recently either or the repeated "degraded" status where everything operates super slowly.

Anyway, getting errors on a plan that costs 20 times much as Dreamhost or 1/2 as much as a pretty good server at ServerBeach is not great support.

Jan 14, 2010
rahulgupta said...
We've been having the same surprising problem since moving Popdose.com over to Rackspace sites...

Same troubling support loop with chat and tickets, finally getting some sense of fanatical customer care when I got the ear of an individual who finally seems to be taking the problem seriously and giving us some real help.

At least it feels like real help since it's _different_ help. There's still a lot of throwing ideas over the fence: try this and let us know... but I get the sense that there are some real heads being put together to narrow down the problem for us.

I've been a customer of their dedicated hosting for a number of years and have *no* complaints about the fanatical support there. I just don't get the sense that the culture of accountability has trickled down into Mosso yet.

I will report back here to let you folks know how this support journey fairs, if nodes will ever be found, and if we end up switching hosts.

Jan 14, 2010
Chris said...
It is funny, I got a call from one of the account managers just minutes ago and he said they were aware of the problem and were working on it. He said that "This issue of downtime is extremely pivotal and our solution to this matter needs to happen sooner than later." Of course I'd expect them to say that, but with Rackspace's rep on the line, not just Mosso's, I think they'll deliver. It is just a question of the time frame for them to do so. I hope sooner.

Of course right now, load times are > 10 seconds for the front page, kind of ironic. :-)

Jan 14, 2010
Jackson said...
Well, if there's one benefit to this rollercoaster of uptime and downtime - I finally set up message filters for Pingdom to clean up my inbox.

Last day's stats from Pingdom

php5-2.dfw1-1

01/13/2010 - 01/14/2010

Uptime
96.63%

Downtime
1h 17m 21s

The average downtime length is 1m 27s

Number of downtimes 53

The longest downtime was 6m on 01/13/2010 12:38:15PM and the shortest was 33s on 01/13/2010 4:52:51PM

I think I prefer my downtime in big chunks. This up down up down is utter BS. I want consistency and reliability - and that's missing on many fronts here.

It's offensive when the "answer" is - it's your code and not our problem. While that's true to some extent - why does the same code run fine one day and not the next. The performance of the system is not consistent - which leads one to wonder if this really is capable of scaling under load as advertised. If we're running out of nodes it seems not.

Jan 14, 2010
Chris said...
They seem to be addressing it. I certainly hope so because it has a ton of potential. From status.rackspacecloud.com earlier this afternoon:

JANUARY 14, 2010
Ongoing Issues with Cloud Sites in DFW

Cloud Sites customers in our DFW facility continue to experience intermittent issues during certain peak times that started yesterday. We apologize for these problems and the impact to your business and want to assure you we are working tirelessly to ensure these issues are resolved permanently.

We want to give you some more detail on the causes of these issues and the plan we are pursuing.

Starting yesterday we began experiencing very high loads on our storage devices for cluster WC1 in DFW. In order to reduce load we have shut down processes like CRON to ensure core site content continue to load. While load spikes are common in our cloud infrastructure, we have not been able to fully identify the root cause of these unusual issues.

We are working with engineers from inside and outside the company with the best expertise on these issues to resolve them and develop a plan of action to ensure we do not repeat this state. We have a series of changes that are being implemented in real time. We are being careful to minimize issues as we proceed.

As we have news we will share it with you on this post. Please note: Due to this issue, you may experience longer than normal wait times for live chat and phone calls. Since all our support reps will have exactly the information detailed in this post, checking here first will be the fastest way to stay updated.

UPDATE: We have been seeing improved performance on our Cloud Sites WC1 storage cluster for the last few hours. Assuming stability continues we will resume CRON operations this evening. At this time, we cannot declare victory on this issue, but we have many plans in place to continue to increase headroom and ensure stability under all conditions.

Jan 14, 2010
Chris said...
p.s. I agree, I prefer it in one lump (or no lumps).

I think the real test will be tomorrow - I've seen less of the issues in the evenings during the days this week, so hopefully they are fixing the issues.

My list for them:
Quadruple the memory in all the servers
Make sure they are running giga-ethernet
Quadruple or more the cores:Make all the machines dual-quad core (if they are single core, dual core now) with the highest clock speed available
Increase the number of file servers and increase the performance of each

Then look at whatever the bottlenecks are now if they are in the software and remedy them.

:-)

Jan 15, 2010
Chris said...
I hope they are still working on it because last night as compared to the night before still saw lots of spikes of > 6000 ms for page load times. Things are definitely not fixed as of yet.
Jan 15, 2010
Kevin Crenshaw said...
Thanks for continuing to comment and bring these problems to light, everyone.

Good news: there *is* light at the end of the tunnel. If you want information on how to coax better, "fanatical" support from Rackspace, see our latest update here: http://j.mp/8bBsRz

Mar 11, 2010
Roman said...
Hi,

Does anyone know or solved the issue with email bouncing...It is happenning a lot...I have many customers and email has to be running at all times....

Mar 11, 2010
Kevin Crenshaw said...
Roman,

We have not seen the Rackspace email bouncing problem recently.

This is something you should take up with Rackspace support. If you aren't getting fast progress, escalate to Rackspace management using the instructions here: http://j.mp/8bBsRz. They will listen and work with you on it until it is resolved.

Quick update on Priacta "No Suitable Nodes" errors: We worked with Rackspace management per link above. It was a very difficult technical problem. They threw an amazing array of resources at it, however, including people from the data storage vendor living on site. Through it all, we got regular updates and could talk directly with techs and management to resolve it.

At one point they offered to move us to a Cloud Server solution (instead of Cloud Sites). That involves moving departments, so it shows different departments in Rackspace communicate and cooperate well for the customer. We declined to move for tech reasons only.

Finally, it looks like they have it under control. The failures on our site are unusual. However, Rackspace says they still aren't satisfied, and they are continuing to monitor our site, using it as a gauge for further improving response times. I'm hopeful that for us and everyone else, those NSN errors are pretty much gone.

Jul 05, 2010
franz maruna said...
Very interesting as we have heard similar complaints about our software running on Rackspace Cloud.. ... sometimes... ;)

As we also host our own sites, and have plenty of experience being both a development and infrastructure partner, I can tell you the only way we can even shoot for this type of top tier support is by owning the whole process.

Angela from Rackspace says in a comment here: "Unfortunately, your case was out of the ordinary and involved multiple groups and was somewhat complicated." Wait, what??

How is it out of the ordinary for multiple groups to be involved when hosting web applications? It strikes me as far more out of the ordinary for them not to be. This answer is complete non-sense. Does Rackspace provide any development services? No. So how can there be ANY scenario when there WOULDN'T be multiple groups involved?

It's easy to find someone to blame in the complicated mess of web apps. You've got the hosting provider, the middleware framework (hey maybe it's PHP's fault?), the web developer (he's probably off snowboarding now and certainly doesn't have Rackspace's clout, so blaming him is easy, no?), the client (they're always wrong), the client site's users (they're the dumbest!) on and on and on.. To hear that "gee this is complicated because it involved a lot of people" is akin to hearing from a mechanic that your car doesn't start because "There's SOOOO many parts!"

I'm afraid I have a more dismal view of how this goes together:

1) The author is correct: support is always better with dedicated account managers. Actually that is something you can get from Rackspace if you choose to get a dedicated server and pay for the service. The person you're talking to might not be a brilliant engineer, and they might change every 6 months, but they will at least give you a single point of contact for what it's worth.

2) That single point of contact will never be able to afford to "own the problem." I know it sucks to hear that because whatever service you're paying for out of Rackspace chances are you're paying 4-10x what you'd pay somewhere else. Unfortunately the challenge of debugging a web app from soup to nuts is no joke and they would easily loose weeks of digging around to get some of these issues resolved, if it's even possible. This may not resonate with folks who don't have development experience, but I speak the truth. Design me a business model where people pay you a fixed price to maintain a car that they built from scratch, themselves, using stuff the found from all over, and the neighbor kid to put it together. Show me what monthly fee it takes to make THAT profitable and scale well. The only way to put Rackspace in a position where they "own the problem" as the author suggests would be to pay them hourly to fix it, and even then you're not getting a promise that a solution will be found, just an awareness of the work that went into trying. I don't see any fixed monthly hosting cost that can cover the unknowns of debugging so many different systems.

3) There's a famous, if somewhat dated, line in IT: "no one ever got in trouble for buying IBM." This is the role that Rackspace plays on the web. Sure there's some marketese about being fanatical, but the reality is unless you're willing to hire your own 24 hour sys admin staff and manage your own servers, there's no one with a stronger enterprise brand than Rackspace when it comes to hosting. Case in point, back in 2007 we were doing a lot of application development for funded startups and suggested one of them put an array of servers powering their site on Rackspace. One evening as they pitched VC, all the servers went down. Not good. A quick bit of google research and it turns out we're in the midst of an operator controlled shut down because a truck has backed into Rackspace's power supply. Yup. An actual semi-truck backed into something it shouldn't have and Rackspace realized their backup generator was going to die because of the A/C load and decided to shut everything down. Kudos to them for shutting things down before the room melted, but not so great when you're pitching VC's about your awesome stable web app that you're paying 4x market rates to host. Here's the kicker though. It's Rackspace. Okay so that's an embarrassing mess, but all you have to say is "well where should we be hosting this Mr. VC?" and there's no answer beyond "in your own infrastructure like Google or Amazon." Meanwhile everyone in the room starts keying away at their blackberries to see how their other projects are fairing.

So this is a very long way of saying that "fanatical support" is trash. What Rackspace is, and what makes me think that they'll do just fine regardless of these issues, is they're the biggest. They certainly aren't perfect, but there's really no where else you can go that will do any better for a fixed monthly fee. When our client's site went down, everyone was pissed. But then again, half the internet went down including some big chunks of very famous sites... So yeah, you might be getting screwed, but you're getting screwed in very good company.

-Franz Maruna
CEO
http://concrete5.org

PS: The ability of cloud computing to deliver massive storage, or massive processing power to repetitive tasks is pretty impressive. Hosting regular ol' websites in the cloud is not as cool as the marketing dollars want you to think. Unix based servers work great and when something goes wrong you can actually see where the load is being distributed and do stuff about it. Relying on someone's mysterious proprietary resource sharing seems like a huge leap of faith, particularly when it is called a "cloud." Don't clouds kinda puff around at their own pace in unpredictable ways? Surely the weather isn't the first concept I'd throw in a white board when coming up with images to express reliability, is it? ;)

Jul 21, 2010
Amy said...
I wish I had come across this post before I signed on with Rackspace.
My expirience is that I spent an entire month going back and forth with Rackspace tech support trying to fix an NSN error. I got any number of techs who did not read the ticket and simply pasted or type (badly) the same suggestions as their predecesors without reviewing the troubleshooting steps I had already completed. When I finally made enough noise to get the attention of a tech from Linux System Operations, it still took them nearly another month to actually fix the bug with their infrastructure. Meanwhile no offer was made to help me work around the bug, no compensation was offered, no real owning-up apology offered, nothing by canned insincere "we apologize for any inconvience" - like not being able to use a service I'm paying for? - and "Thanks for your patience" - because we don't care enough to fix this right away, so you'll just have to wait?
I called tech support to let them know it was unacceptable, I wrote "This is unacceptable service" in my tickets and they were left to close with no further comment from Rackspace.
I think cloud hosting has great potential, but if Rackspace doesn't shape up soon, well... I'm already shopping for a new host.
Jul 28, 2010
Kevin Crenshaw said...
Franz, a single point of contact CAN own the customer's problem. They just need internal support to empower the reps. The rep doesn't know everything, they only know some things, but they should have access to others and be the "point person" who's head is on the chopping block based on how happy the client is. If they really want fanatical support, I think it's essential. THIS IS NOT HARD TO IMPLEMENT, but yagottawanna.

BTW: Rackspace is moving forward in a major way (on the tech side) to prevent issues like these. See this game-changing announcement: http://j.mp/cnesww

On the positive side: our experience with Rackspace continues to be good and error-free since they resolved our NSN errors. They had a nasty hardware problem that took many minds to pin down. (They still don't have support rep ownership and client accountability in place yet, however (read article above).

Apr 05, 2011
Brent said...
Mr. Crenshaw, Thank you for taking the time to document your experience and thoughts here.

Sadly, I can report that the problems created by lack of ownership pointed out by Mr. Crenshaw still exist as of April, 2011. I recently moved to Rackspace and exeprienced the typical "canned support responses loop" echoed by many comments here. This was in relation to three separate issues and not the NSN error.

I am about to attempt escalating the problem myself as recommended by Mr. Crenshaw here: http://j.mp/8bBsRz . We have postponed moving sites to Rackspace and are on the fence whether to continue or just go with a budget host. High support expectations were a primary reason we decided to pay the higher cost associated with Rackspace Cloud.

Interesting note: Due to a recurring issue with our previous host, I put Rackspace pre-sales through the "ringer" trying to make sure we would not have the same problem. As a potential customer, we were assigned a dedicated contact who went above and beyond calling in higher level support staff to answer our questions. It is a shame this practice doesn't also continue for existing customers. Even if it only happened with issues that remained unresolved for 24 hours or more, it would be a huge step in the right direction.

Apr 05, 2011
Kevin Crenshaw said...
Brent, very helpful! We are deciding *right now* whether to deploy a major new site to Rackspace or Amazon or someone else. Our new site needs to be absolutely bullet-proof, will rely on managed hosting (many dedicated servers, rapidly scalable) instead of cloud hosting (shared servers, which we use now). That's a different department at Rackspace, but still...

As a CEO I preach a "three strikes" rule: if a supplier messes up three times in any consistent pattern, expect it to continue. Get a new supplier, pronto. Rackspace started giving us the NSNs and slowdowns again just last week. This feels like a systemic problem, not an isolated issue, and it's strike 5 or 10....

Anyone have experience with large clusters on Amazon?

Apr 05, 2011
Kevin Crenshaw said...
This information just in from our Chief Architect:

I finally have an answer explaining the NSNs. I also know why moving clusters helped, and why the problem returned.

Basically, they have a common “fallback” that they use when administering these servers to deal with any unknown issues they have. If they detect that a MySQL server is under heavy stress, but they can’t determine why, they’ll kill all currently active SELECT queries on that server. This results in everybody on that server getting “MySQL has gone away” errors. Most clients never notice, but because we have aggressive error logging in place, we’ve been able to detect it.

Sometimes they also have to reboot the server, resulting in a brief period of no connectivity following such an error.

This explains why every app using the same server would go down…the queries had been clobbered cluster wide. Moving to a new cluster helps because we get away from whichever other client was actually creating the problem on the cluster in question. Ironically, this also explains why the query load always looked low immediately following such a problem: they had just killed all the open select queries to that cluster.

John Crenshaw
Priacta, Inc.

Apr 05, 2011
Kevin Crenshaw said...
So, looks to me like MySQL is a big part of the problem. Does MySQL just have inadequate reporting/logging to let admins track down queries offenders on a shared db cluster? (Our new site does not rely on any flavor of SQL. Not scalable enough for our needs.)
Apr 08, 2011
Chris said...
I had posted in January 2010 and we've been at Rackspace Cloud Sites (RSCS) aka Mosso for a while before that. My experience is that some sites work great at RSCS, while others do not, they are just too slow.

The two that do not in my experience are:
1. SSL sites because they are all going through a single load balancer to a single server vs a cluster - this was what one of the more senior tech support people finally told me. We had one site there and switched it to a non-SSL server and performance improved by about 3-5 times. It took 6 months for someone at RSCS to suggest that it could be the problem though which was extremely frustrating. Again, the issue of someone "owning" a problem is still lacking in the "fanatical support" regardless of assurances that it is not the case. Having SSL work with similar performance characteristics is important for most sites, so this is one area where Rackspace Cloud is lacking.

2. Sites using MySQL seem to have worse performance in some cases if tables are any reasonable size. PHPBB and WordPress seem to work nicely under lower loads, but one site that makes use of MySQL in an only limited fashion is usually extremely slow as compared to the same site hosted at Amazon ECC. (The funny thing was, that I had it at Rackspace and they were monitoring it after several weeks of performance issues. It was running too slowly even with the improvements they were making, so I switched it to ECC and within about 48 hours, had a message from RACKSPACE tech support that they were seeing "great performance" with the site and asked whether they could close the ticket. I said they might want to check the IP they were checking for performance because when they saw the improved performance was when I moved it to ECC. It was NOT "FANATICAL SUPPORT." )

Since I have many other sites at still Rackspace I have a backup of this site still there and performance tested about 2 weeks ago was still super slow for it and it does not use MySQL excessively. The SQL servers are just overloaded in my experience.

Apr 08, 2011
Chris said...
p.s. I am hoping that as they upgrade infrastructure that *eventually* it will be able to support this one site (it isn't PhpBB or Wordpress). I prefer the Rackspace Cloud system because for all but this one site I don't have to be root and fix things if there is a problem in the middle of the night or while on vacation. In many ways Rackspace Cloud could be great, but the database servers need upgrades, maybe, the network needs upgrades to support faster communication between the DB servers and the web servers.

Perhaps they will upgrade and use the Intel Thunderbolt technology to help improve performance and upgrade in other ways too. Abstracting the server away and so outsourcing the management to them is great for my needs.

The performance just is not there yet.

Apr 08, 2011
Kevin Crenshaw said...
Chris, Thanks for the precise intel on what is happening over there and how Amazon ECC stacks up in comparison. Very helpful.

Leave a comment...