For this interview, we talked to Greg Wester, Senior Member Technical Staff, Craig Jennings, Senior Director, Quality Engineering and Ritu Ganguly, QE Director at Salesforce.
Salesforce.com is a cloud-based enterprise software company specializing in software as a service (SaaS). Best known for its Customer Relationship Management (CRM) product, it was ranked number 27 in Fortune’s 100 Best Companies to Work For in 2012.
What is big or complex about your system (users, physical size, data, load, distribution, safety, regulation, security, other)?
This should give you an idea. Salesforce processes 700 million highly complex business transactions per day for nearly 3 million active users, whose raw processing needs are growing at a 50% compounded annual rate. We expect to soon exceed 1 billion transactions a day across global 6 data centers housing 20 computing clusters, which we call “pods”. We have a multitenancy architecture where each customer’s data lives with a group of other customers in one of these pods. Within a pod we have a horizontally scaled application tier on x86 commodity hardware that, among other things, also hosts a distributed cache. We have a home grown message queueing system that allows asynchronous processing to be scheduled in either the database tier or the application tier.
Do you remember any remarkable event that changed your mind about how big or complex your system is?
This company has a strong culture of “putting our money where our mouth is”, so it should be no surprise that we use our own product to run our business. It’s also well-known that Salesforce employees collaborate, share, and align on our corporate social networking product, Chatter. When Chatter was still under development, our founder, Marc Benioff, encouraged every employee to share their vision and goals document on their Salesforce profile on Chatter. As a result, one of our non-customer facing pods showed a brief performance decrease while we added physical storage to the file servers. This affirmed that monitoring and management tools are often as important as the product software itself in achieving high uptime. You have to watch what’s happening, you have to respond quickly, and you have to learn from what’s happened. Our early movement towards being an open social enterprise exceeded estimations. However, we were prepared by our DNA of using our own product to run our own business.
Do you document testing as you have in the past, or has documentation become leaner even with a big or complex system?
At Salesforce, we think automated test cases describe how a feature works far better and more efficiently than a design document. We’re an Agile shop, so our design documentation isn’t voluminous. However, the aggregate of tests that have passed and failed, are a more current, accurate, detailed, and up-to-date description of our product than any written test documents. We have made it more efficient based on our customer’s needs. We have reviewed traditional test plans/strategies and kept what is needed but our philosophy is lean: less documentation and more testing. That’s not to say we don’t do test planning. Our Quality Engineers must first think about the feature at a high level. We built a tool that encourages thinking about the feature at a high level. You must understand the customer’s use cases, list out all your assumptions, and plan out a testing strategy is. After this though, we get the engineer right into their coding environment and keep them there. They’ll stub out their test cases and write the intent of the test case and expected result in Javadoc. When we check those tests in, we have a tool that parses the Javadoc, and sticks the test case name, description, expected result into our test case repository automatically. It’s quite intelligent and keeps the engineer productive since they don’t have to switch contexts at all.
What type SDLC do you follow? Have you found limitations in SDLC due to the size of the system you support?
Salesforce has its own flavor of Agile called the Adaptive Delivery Methodology, or ADM. It’s pretty much textbook Scrum with a few twists. Product Owners prioritize features from a backlog based on customer interest, and business opportunity. Teams of four to a dozen engineers in development, quality, performance, user experience, and documentation meet in a daily stand up meeting and collaborate to deliver “potentially releasable” features each iteration, which can be chosen by the individual team. Most are on two-week iterations, but some are on week-long sprints. Our customer base requires that we introduce features with ample notice and staging beforehand on sandbox environments. We are very careful about what goes into a patch release, because a fix for one customer can turn out to be a bug to another.
What is the biggest problem you face in delivering your system to users?
Our platform is basically an ecosystem that is built and managed by us, but controlled by the customers. Our sales are increasing at over 38% year over year. As a result of this success, the performance tuning tweaks we’ve verified and deployed to our system today may be suboptimal a year from now, even if we made no major code changes. Scale is key. We know we have some of the best Technical Operation and R&D teams in the world. Their coordinated success is ultimately the foundation of our business model.
How has your testing strategy changed as your system got bigger or more complex?
Definitely, we have had to look at our customer’s complex implementation needs, complex business processes and customizations and ensured we represent real customer scenarios in our testing. As we grew at an enormous rate, we learned how support cases that escalate to R&D drag the velocity of feature work. They mire teams in bug fixing. Too many bug fixes in patch releases also introduces risk to the product. The goal of our testing strategy is to minimize the amount of time supporting the feature after it’s released. In other words, our aim must be to find all of the impactful bugs, corner cases, and quirks. We leverage tests written by customers in our Apex Code language to verify their use cases before each major release. Since some bugs are inevitable even with a very thorough process, we then put effective monitoring and management systems in place so that we can react to issues immediately when they arise.
Did you change the experience or job requirements for test engineers as a result of a bigger or more complex system?
Salesforce’s customers have expectations that our system will have minimal downtime each year, and no unscheduled downtime. We are delivering a service at a scale where every test must be automated, and most features have more SLOCs of test code than application code. We hire only software engineers into Quality Engineering who can perform white and black box testing. Not every development engineer has the instinct to be a test engineer, and vice versa. Quality Engineering requires solid programming skills, a laser focus on customer service, a knack for risk management, and an eye for hidden or low frequency/high impact bugs. We have also expanded our performance testing team.
Have you changed your reliance on test automation due to size or complexity?
We rely on it in increasing amounts. This is the only way we can continue scaling. We think test automation has a maturity model by which you can measure the commitment to quality within an organization:
- Level 1 is unit test coverage, representing a certification by the developer that individual implementations of software classes function in a particular way.
- Level 2 includes functional testing of a module of software classes.
- Level 3 is end to end testing of every module in a particular application while it is running on a single host or node.
- Level 4 is testing the application under load for an extended period with all of its supporting subsystems including database, cache, message queues, etc.
- Level 5 has the same parameters as Level 4, with the added requirement that every piece of hardware, every operating system library, and every configuration is as it appears in a production environment with customer data.
When we start seeing diminishing returns from one level, we move to the next.
What percentage of your tests are automated?
Over 90%. Our goal is that no teams run manual tests, unless you’re counting on exploratory testing (which every team does). That last 10% is amazingly difficult. We’re forced to test manually when the tools for automation are in their infancy, such as on Mobile platforms. We’ve made impressive leaps forward in these areas, but there’s still work to do. Talk to us next year. We’ll have solved some of those problems and be closer to 100%.
What do you see in the future for testing big or complex systems?
Mainstream tools like JUnit were designed for unit testing a single class but have evolved to accommodate complex functional testing scenarios. Unit testing on a simple piece of code has a binary outcome: pass or fail. Functional testing on a live distributed system with components designed on loose service level agreements to accommodate graceful degradation and failure of neighbors is different. The tools for this are in their infancy and require engineers as creative and talented as the ones who designed the system to write frameworks for testing it. This presents an opportunity for a thought leader to emerge with an industry standard for distributed software testing. We’re proud of our accomplishments in this area, and are aiming for that goal.