Non-Functional Requirements and the Cloud

As discussed here the term non-functional requirements really is a complete misnomer. Who would, after all, create a system based on requirements that were “not functional”? Non-functional requirements refer to the qualities that a system should have and the constraints under which it must operate. Non-functional requirements are sometimes referred to as the “ilities,” because many end in “ility,” such as, availability, reliability, and maintainability etc.

Non-functional requirements will  of course an impact on the functionality of the system. For example, a system may quite easily address all of the functional requirements that have been specified for it but if that system is not available for certain times during the day then it is quite useless even though it may be functionally ‘complete’.

Non-functional requirements are not abstract things which are written down when considering the design of a system and then ignored but must be engineered into the systems design just like functional requirements are. Non-functional requirements can have a bigger impact on systems design than can functional requirements and certainly if you get them wrong can lead to more costly rework. Missing a functional requirement usually means adding it later or doing some rework. Getting a non-functional requirement wrong can lead to some very costly rework or even cancelled projects with the knock-on effect that has on reputation etc.

From an architects point of view, when defining how a system will address non-functional requirements, it mainly (though not exclusively) boils down to how the compute platforms (whether that be processors, storage or networking) are specified and configured to be able to satisfy the qualities and constraints specified of it. As more and more workloads get moved to the cloud how much control do we as architects have in specifying the non-functional requirements for our systems and which non-functionals are the ones which should concern us most?

As ever the answer to this question is “it depends”. Every situation is different and for each case some things will matter more than others. If you are a bank or a government department holding sensitive customer data the security of your providers cloud may be upper most in your mind. If on the other hand you are an on-line retailer who wants your customers to be able to shop at any time of the day then availability may be most important. If you are seeking a cloud platform to develop new services and products then maybe the ease of use of the development tools is key. The question really is therefore not so much which are the important non-functional requirements but which ones should I be considering in the context of a cloud platform?

Below are some of the key NFR’s I would normally expect to be taken into consideration when looking at moving workloads to the cloud. These apply whether they are public or private or a mix of the two. These apply to any of the layers of the cloud stack (i.e. Infrastructure, Platform or Software as a Service) but will have an impact on different users. For example availability (or lack of) of a SaaS service is likely to have more of an impact on the business user than developers or IT operations whereas availability of the infrastructure will effect all users.

  • Availability – What percentage of time does the cloud vendor guarantee cloud services will be available (including scheduled maintenance down-times)? Bear in mind that although 99% availability may sound good that actually equates to just over 3.5 days potential downtime a year. Even 99.99 could mean 8 hours down time. Also consider as part of this Disaster Recovery aspects of availability and if more then one physical data centre is used where do they reside? The latter is especially true where data residency is an issue if your data needs to reside on-shore for legal or regulatory reasons.
  • Elasticity (Scalability) – How easy is it to bring on line or take down compute resources (CPU, memory, network) as workload increases or decreases?
  • Interoperability – If using services from multiple cloud providers how easy is it to move workloads between cloud providers? (Hint: open standards help here). Also what about if you want to migrate from one cloud provider to another ? (Hint: open standards help here as well).
  • Security – What security levels and standards are in place? for public/private clouds not in your data centre also consider physical security of the cloud providers data centres as well as networks. Data residency again needs to be considered as part of this.
  • Adaptability – How easy is it to extend, add to or grow services as business needs change? For example if I want to change my business processes or connect to new back end or external API’s how easy would it be to do that?
  • Performance – How well suited is my cloud infrastructure to supporting the workloads that will be deployed onto it, particularly as workloads grow?
  • Usability – This will be different depending on who the client is (i.e. business users, developers/architects or IT operations). In all cases however you need to consider ease of use of the software and how well designed interfaces are etc. IT is no longer hidden inside your own company, instead your systems of engagement are out there for all the world to see. Effective design of those systems is more important than ever before.
  • Maintainability – More from an IT operations and developer point of view.  How easy is it to manage (and develop) the cloud services?
  • Integration – In a world of hybrid cloud where some workloads and data need to remain in your own data centre (usually systems of record) whilst others need to be deployed in public or private clouds (usually systems of engagement) how those two clouds integrate is crucial.

I mentioned at the beginning of this post that non-functional requirements should actually be considered in terms of the qualities you want from your IT system as well as the constraints you will be operating under. The decision to move to cloud in many ways adds a constraint to what you are doing. You don’t have complete free reign to do whatever you want if you choose off-premise cloud operated by a vendor but have to align with the service levels they provide. An added bonus (or complication depending on how you look at it) is that you can choose from different service levels to match what you want and also change these as and when your requirements change. Probably one of the most important decisions you need to make when choosing a cloud provider is that they have the ability to expand with you and don’t lock you in to their cloud architecture too much. This is a topic I’ll be looking at in a future post.

Consideration of non-functional requirements does not go away in the world of cloud. Cloud providers have very different capabilities, some will be more relevant to you than others. These, coupled with the fact that you also need to be architecting for both on-premise as well as off-premise clouds actually make some of the architecture decisions that need to be made more not less difficult. It seems the advent of cloud computing is not about to make us architects redundant just yet.

For a more detailed discussion of non-functional requirements and cloud computing see this article on IBM’s developerWorks site.

Advertisements

It’s the NFR’s, Stupid

An apocryphal (to me at least) tale from Forbes that provides a timely reminder of the fact that even in this enlightened age of clouds that give you infrastructure (and more) in minutes and analytical tools that business folk can use to quickly slice and dice data in all manor of ways, fundamentals, like NFRs, don’t (or shouldn’t) go out of fashion.According to Forbes the US retailer Target figured out that a teenager was pregnant before her parents did. Target analysed the buying behaviour of customers and identified 25 products (e.g. cocoa-butter lotion, a purse large enough to double as a diaper bag and zinc and magnesium supplements) that allowed them to assign each shopper a “pregnancy prediction” score. The retailer also reckoned they could estimate the due date of a shopper to within a small window and so could send coupons timed to very specific stages of a pregnancy. In the case of this particular shopper Target sent a letter, containing coupons, to a high-school pupil whose father opened it and was aghast that the retailer should send coupons for baby clothes and cribs to a teenager. The disgruntled father visited his local Target store accusing them of encouraging his daughter to get pregnant. The manager of the store apologised and called the father again a few days later to repeat his apology. However this time the father was somewhat abashed and said he had spoken to his daughter only to find out she was in fact pregnant and was due in August. This time he apologised to the manager.

So, what’s the lesson here for architects? Here’s my zen take:

  1. Don’t assume that simply because technology seems to be more magical and advanced you can ignore fundamentals, in this case a persons basic entitlement to privacy.
  2. With cloud and advanced analytics IT is (apparently) passing control back to the business which it has done in a cyclical fashion over the last 50 – 60 years (i.e. mainframe -> mini -> PC -> client-server -> browser -> cloud). Whoever “owns” the gateway to the system should not forget they should have the interests of the end user at heart. Ignore their wants and needs at your peril!
  3. Legislation, and the lay-mans understanding of what technology can do, will always lag advances in technology itself. Part of an architects role is to explain, not only the benefits of a new technology, but also the potential downside to anyone that may be impacted by that technology. In the connected world that we now live in that can be a very large audience indeed.

Part of being an architect is to talk to everyone to explain not only your craft but also your work. Use every opportunity to do this and reject no one who might want to understand a technology. As Philippe Kruchten says in his brilliant interpretation of Lao-Tsu’s Tao Te Ching for the use of software architects:

The architect is available to everyone and rejects no one.
She is ready to use all situations and does not waste anything.
This is called embodying the light.

Make sure you repeatedly “embody the light”.

Oops There Goes Our Reputation

I’m guessing that up to two weeks ago most people, like me, had never heard of a company called Epsilon. Now, unfortunately for them, too many people know of them for all the wrong reasons. If you sign up to any services from household names such as Marks and Spencer, Hilton, Marriott or McKinsey you will have probably had several emails in the last two weeks advising you of a security breach which led to “unauthorized entry into Epsilon’s email system”. Unfortunately because Epsilon is a marketing vendor that manages customer email lists for these and other well known household brands chances are your email has been obtained by this unauthorised entry as well. Now, it just might be a pure coincidence, but in the last two weeks I have also received emails from the Chinese government inviting me to a conference on some topic I’ve never heard of, from Kofi Annan, ex-Secretary General of the United Nations and from a lady in Nigeria asking for my bank account details so she can deposit $18.4M into the account so she can leave the country!According to the information on Epsilon’s web site the information that was obtained was limited to email addresses and/or customer names only. So, should we be worried by this and what are the implications on architecture of such systems?

I think we should be worried for at least three reasons:

  1. Whilst the increased spam that is seemingly inevitable following an incident such as this is mildly annoying a deeper concern is how could the criminal elements who now have information on the places I do business on the web put this information together to learn more about me and possibly construct a more sophisticated phishing attack? Unfortunately it’s not only the good guys that have access to data analytics tools.
  2. Many people probably have a single password to access multiple web sites. The criminals who now have your email as well as knowledge of which sites you do business at only have to crack one of these and potentially have access to multiple sites, some of which may have more sensitive information.
  3. Finally how come information I trusted to well known (and by implication ‘secure’) brands and their web sites has been handed over to a third party without me even knowing about it? Can I trust those companies not to be doing this with more sensitive information and should I be withdrawing my business from them?. This is a serious breach of trust and I suspect that many of these brands own reputations will have been damaged.

So what are the impacts to us as IT architects in a case like this? Here are a few:

  1. As IT architects we make architectural decisions all the time. Some of these are relatively trivial (I’ll assign that function to that component etc) whereas others are not. Clearly decisions about which part of the system to entrust personal information to is not trivial. I always advocate documenting significant architectural decisions in a formal way where all the options you considered are captured as well as the rationale and implications behind the decision you made. As our systems get ever more complex and distributed the implications of particular decisions become harder to quantify. I wonder how many architects consider the implications to a companies reputation of entrusting even seemingly low grade personal information to third parties?
  2. It is very likely that incidents such as this are going to result in increased legislation that covers personal information just like there is legislation on Payment Card Industry (PCI) standards. This will demand more architectural rigour as new standards essentially impose new constraints on how we design our systems.
  3. As we trundle slowly to a world where more and more of our data is to be held in the cloud using a so called multi-tenant deployment model it’s surely only a matter of time before unauthorised access to one of our cloud data stores will result in access to many other data sources and a wealth of our personal information. What is needed here is new thinking around patterns of layered security that are tried and tested and, crucially, which can be ‘sold’ to consumers of these new services so they can be reassured that their data is secure. As Software-as-a-Service (SaaS) takes off and new providers join the market we will increasingly need to be reassured they are to be trusted with our personal data. After all if we cannot trust existing, large corporations how can we be expected to trust new, small startups?
  4. Finally I suspect that it is only a matter of time before legislation aimed at systems designers themselves is enforced that make us as IT architects liable for some of those architectural decisions I mentioned earlier. I imagine there are several lawyers engaged by the parties whose customers email addresses were obtained and whose trust and reputation with those customers may now be compromised. I wonder if some of those lawyers will be thinking about the design of such systems in the first place and, by implication, the people who designed those systems?

How Much Does Your Software Weigh, Mr Architect?

Three apparently unrelated events actually have a serendipitous connection which have led to the title of this weeks blog. First off, Norman Foster (he of the “Gherkin” and “Wobbly Bridge” fame) has had a film released about his life and work called How Much Does You Building Weigh, Mr Foster. As a result there have been a slew of articles about both Foster and the film including this one in the Financial Times. One of the things that comes across from both the interviews and the articles about Foster is the passion he has for his work. After all, if you are still working at 75 then you must like you job a little bit! One of the quotes that stands out for me is this one from the FT article:

The architect has no power, he is simply an advocate for the client. To be really effective as an architect or as a designer, you have to be a good listener.”

How true. Too often we sit down with clients and a jump in with solutions before we have really got to the bottom of what the problem is. It’s not just about listening to what the client says but also what she doesn’t say. Sometimes people only say what they think you want them to hear not what they really feel, So, it’s not just about listening but developing empathy with the person you are architecting for. Related to this is not closing down discussions too early before making sure everything has been said which brings me to the second event.

I’m currently reading Resonate by Nancy Duarte which is about how to put together presentations that really connect with your audience using techniques adopted by professional story tellers (like film makers for example). In Duarte’s book I came across the diagram below which Tim Brown also uses in his book Change by Design.

For me the architects sits above the dotted line in this picture ensuring as many choices as possible get made and then making decisions (or compromises) that are the right balance of the sometimes opposing “forces” of the requirements that come from multiple choices.

One of the big compromises that often needs to be made is how much can I deliver in the time I have available and, if its not everything, what is dropped? Unless the time can change then its usually the odd bit of functionality (good if these functions can be deferred to the next release) or quality (not good under any circumstances). This leads me to the third serendipitous event of the week: discovering “technical debt”.

Slightly embarrassingly I had not heard of the concept of technical technical debt before and it’s been around for a long time. It was originally proposed by Ward Cunningham in 1992 who said the following:

Shipping first time code is like going into debt. A little debt speeds development so long as it is paid back promptly with a rewrite… The danger occurs when the debt is not repaid. Every minute spent on not-quite-right code counts as interest on that debt. Entire engineering organizations can be brought to a stand-still under the debt load of an unconsolidated implementation.

Technical debt is a topic that has been taken up by the Software Engineering Institute (SEI) who are organising a workshop on the topic this year. One way of understanding technical debt is to see it as the gap between the current state of the system and what was originally envisaged by the architecture. Here, debt can be “measured” by the number of known defects and features that have not yet been implemented. Another aspect to debt however is the amount of entropy that has set in because the system has decayed over time (changes have been made that were not in line with the specified architecture). This is a more difficult thing to measure but has a definite cost in terms of ease of maintenance and general understandability of the system.

Which leads to the title of this weeks blog. Clearly software (being ‘soft’) carries no weight (the machines it runs on do not count) but nonetheless can have a huge, and potentially damaging weight in terms of the debt it may be carrying in unstructured, incorrect or hard to maintain code. Understanding the weight of this debt, and how to deal with it, should be part of the role of the architect. The weight of your software may not be measurable in terms of kilograms but it surely has a weight in terms of the “debt owed”.

When a Bridge Falls Down is the Architect to Blame?

Here’s a question. When a bridge or building falls down whose “fault” is it. Is it the architect who designed the bridge or building in the first place, is it the the builders and construction workers who did not build it to spec, the testers for not testing the worst-case scenario or the people that maintain or operate the building or bridge? How might we use disasters from the world of civil engineering to learn about better ways of architecting software systems?Here are four well known examples of architectural disasters (resulting in increasing loss of life) from the world of civil engineering:

  1. The Millenium Bridge in London, a steel suspension bridge for pedestrians across the Thames. Construction of the bridge began in 1998, with the opening on 10 June 2000. Two days after the bridge opened the participants in a charity walk felt an unexpected swaying motion. The bridge was closed for almost two years while modifications were made to eliminate the “wobble” which was caused by a positive feedback phenomenon, known as Synchronous Lateral Excitation. The natural sway motion of people walking caused small sideways oscillations which in turn caused people on the bridge to sway in step increasing the amplitude of the bridge oscillations and continually reinforcing the effect. Engineers added dampers to the bridge to prevent the horizontal and vertical movement. No people or animals were injured in this incident.
  2. The Tacoma Narrows Bridge was opened to traffic on July 1st 1940 and collapsed four months later. At the time of its construction the bridge was the third longest suspension bridge in the world. Even from the time the deck was built, it began to move vertically in windy conditions, which led to it being given the nickname Galloping Gertie. Several measures aimed at stopping the motion were ineffective, and the bridge’s main span finally collapsed under 40 mph wind conditions on November 7, 1940. No people were injured in this incident but a dog was killed.
  3. On 28 January 2006, the roof of one of the buildings at the Katowice International Fair collapsed in Katowice, Poland. There were 700 people in the hall at the time. The collapse was due to the weight of the snow on the roof. A later inquiry found numerous design and construction flaws that contributed to the speed of collapse. 65 people were killed when the roof collapsed.
  4. The twin towers of the World Trade Center (WTC) in downtown Manhattan collapsed on September 11 2001when al-Qaeda terrorists hijacked two commercial passenger jets and flew them into the skyscrapers. A government report that looked at the collapse declared that the WTC design had been sound and attributed the collapses to “extraordinary factors beyond the control of the builders”. 2,752 people died, including all 157 passengers and crew aboard the two airplanes.

In at least one of these cases (the Katowice International Fair building) various people (including the designers) have been indicted for “directly endangering lives of other people” and face up to 12 years of prison. They are also charging the buildings operator for “gross negligence” in not removing snow quickly enough.

So what can we learn from these natural and man made disasters and apply to the world of software architecture? In each of these cases the constructions were based on well known “patterns” (suspension bridges, trade halls and skyscrapers have all successfully been built before and have not collapsed). What was different in each of these cases was that the non-functional characteristics were not taken into account. In the case of the bridges, oscillations caused by external factors (people and winds) were not adequately catered for. In the case of the trade hall in Katowice the building’s roof was not engineered to handle the additional weight caused by snow. Finally, in the case of the WTC, the impact of a modern passenger jet, fully laden with fuel, crashing into the building was simply not conceived of (although interestingly an “aircraft-impact analysis”, involving the impact of a Boeing 707 at 600 mph was actually done which concluded that although there would “a horrendous fire” and “a lot of people would be killed,” the building itself would not collapse. Here are some lessons I would draw from these incidents and how we might relate them to the field of software architecture:

  1. Architects need to take into account all non-functional requirements. Obviously this is easier said than done. Who would have thought of such an unexpected event as a passenger jet crashing into a skyscraper? Actually, to their credit, the buildings architects did but what they lacked was the ability to properly model the effect of such impacts on the structures, especially the effects of the fires.
  2. For complex systems, architects should build models to model all aspects of the architecture. Tools appropriate to the task should be deployed and the right “level” of modelling needs to be done. Prototyping as a means of testing new or interesting technical challenges should also be adopted.
  3. Designers should faithfully implement the architectural blueprints and the architect should remain on the project during the design and implementation phases to check their blueprints are implemented as expected.
  4. Testing should be taken into account early and thought given to how the non-functional characteristics can be tested. Real limits should be applied taking into account the worst case (but realistic) scenario.
  5. Operations and maintenance should be involved from an early stage to make sure they are aware of the impact of unexpected events (for example a complete loss of all systems because of an aircraft crashing on the data centre) and have operational procedures in place to address such events.

As a final, and sobering, footnote to the above here’s a quote from a report produced by the British Computer Society and Royal Academy of Engineers called The Challenges of Complex IT Projects.

Compared with other branches of engineering in their formative years, not enough people (are known to) die from software errors. It is much easier to hide a failed software project than a collapsing bridge or an exploding chemical plant.

The implications of this statement would seem to be that it’s only when software has major, and very public, failures that people will really take note and maybe start to address problems before they occur. There are plenty of learning points (anti-patterns) in other industries that we can learn from and should probably do so before we start having major software errors that cause loss of life.

You may be interested in the follow up to the above report which describes some success stories and why they worked (just in case you thought it was all bad).

 

When Systems Fail

This week I was a direct victim of a systems failure that set me thinking about how even mundane activities that we have been doing for several tens, if not hundreds, of years, like checking into a hotel  in this case, rely on systems that we take for granted and which, when they fail, throw everything into complete chaos.It’s a long and not particularly interesting story but in summary I checked into one of the large chain  hotels, which I use a lot, only to find when I opened my room door that the room was in a state of complete chaos and had clearly not been visited by housekeeping that day. On trying to change to another room I was told the system had been down since 4am (it was now 8pm) that morning and the staff could not tell what state rooms were in. Clearly not a great state of affairs and not great for client relations (there were a lot of grumpy people queueing in reception, some of whom I would guess would not be going back to that hotel). So what would an architect of such a system do to mitigate against such a system failure?

  1. I don’t profess to know too much about how hotel management systems work and whether they are provided centrally or locally however I would have thought one of the basic non-functional characteristics of such systems would have been a less than one hour recovery following a system failure (not 16 hours and counting). Learning point: Clarify your availability non-functional requirements (NFRs) and realise them in the relevant parts of the system. Maybe not all components need to be highly available (checking in a customer maybe more important then checking her out for example) but those that are need to be suitable ‘placed’ on high-availability platforms.
  2. There was a clear and apparent need for a disaster recovery plan that involved more than the staff apologising to customers. Learning point: Have a disaster recovery policy and test it regularly.
  3. A system is about more than just the technology; the people that use the system are a part of it as well. Learning point: The architecture of the system should include how the people that use that system interact with it during both normal and abnormal operating conditions.
  4. Often NFRs are not quantified in terms of their business value (or cost). When a problem occurs is the impact to the business (in terms of lost revenue, irate customers who won’t come back etc) really understood? Learning point: Risk associated with not meeting NFRs needs to be quantified so the right amount of engineering can be deployed to address problems that may occur when NFRs are not met.

Formal approaches to handling non-functional requirements in a systems architecture are a little thin on the ground. One approach suggested by the Software Engineering Institute is through the use of architectural tactics. An architectural tactic is a way of satisfying a desired system quality (such as performance) by considering the parameters involved (for example desired execution time), applying one or more standard approaches or patterns (such as scheduling theory or queuing theory) to address potential combinations of parameters and arriving at a reasoned (as opposed to random) architectural decision. Put another way an architectural tactic is a way of putting a bit of “science” behind the sometime arbitrary approach to making architectural decisions around satisfying non-functional requirements.

I think this is a field that is ripe for more work with some practical examples required. Maybe a future hotel management system that adopts such an approach to during its development will allow a smoother check-in process as well.

Five Tips for Finding Non-Functional Requirements

Following on from my previous blog on the challenge of capturing non-functional requirements here are five tips for smoothing that process. These have been worked out through some often painful project experiences and, I hope, will avoid you making some of the mistakes I have seen made in the past if these are not considered.

  1. Really, really think about who your stakeholders are. The process of gathering non-functional requirements should begin with creating a list of the different stakeholders (that is, the people who have an interest or concern in the system under development). These are not just the people who will actually be using the system but also the ones who will be operating it, maintaining it, ensuring it is secure and so on). One of the commonest mistakes I observe is that a key stakeholder is forgotten about until the very end of the project (sometimes just before the system is about to go live ) and a critical NFR is not accounted for.
  2. Consider non-functional and functional requirements together. Some NFRs should be associated directly with functional requirements so need to be considered at the same time. For example discussions about response times (the system must provide a new account number in less than two seconds) are clearly related to the (functional) requirement for opening a new account. It is useful therefore to make sure NFRs are considered at the same time as functional requirements are gathered (otherwise you’re just going to go back and ask questions of the same stakeholders again).
  3. Be an “NFR evangelist”. NFRs seem to be one of those project artefacts which no one wants to own. If this is the case on your project then be that person. Read up and understand the importance of NFRs. Create checklists of the kind of NFRs that need to be gathered. Be ready and prepared to present to users and management alike the importance of NFRs. Have a list of anecdotes at the ready that illustrate the importance of NFRs.
  4. Realise once round the block won’t be enough. Despite your best efforts at understanding stakeholders and looking for NFRs at the time you gather functional requirements there will always be some things that are simply not known early on in the project. It’s not always known what external regulations need to be conformed to early on for example or even what detailed availability or performance requirements are needed. I favour a breadth-first approach here. Make sure that early on you at least capture the categories of NFR (performance, security, regulatory etc) together with the key requirements in each category. Later you can capture missing detail.
  5. Be prepared to challenge peoples prejudices and unreasonableness. Believe it or not some people ask for things which aren’t as reasonable as they might be? 24X7 availability, sub-second response times are typical. Whilst it may be possible to build systems that support such NFRs the cost will probably be prohibitive. Be prepared therefore to challenge such requests and offer alternatives, or at least spell out what the costs will be.

Clearly there is more to gathering and analysing NFRs than is stated here however these are some of the considerations you might want to think about and use to start the process if you find yourself on a project where you are the one who is responsible for this.