KPIs for API Service Level Agreement
In my previous blog post I wrote about the absence of Service Level Agreements in the average Business to Business integration and in the business use APIs.
To sum up the previous blog post: The absence of SLAs in the B2B integrations and APIs lead to significant amount money and time being consumed for useless meetings and blame games. In order to avoid the unnecessary costs and work, you have to have agreement in place which states what level of service each party can expect.
If you read the post and maybe even agreed with the message, the next logical question is: What should the Service Level Agreement contain? The full SLA should of course contain non-technical details such as contact points, escalations paths, sanctions, and so on but that is not in the scope of this blog post. I leave that to JustIneers with more experience on those matters. In this post I will be looking at the measurable technical KPIs.
KPIs
Depending on the industry and use case you may have different needs. In banking transactions the time it takes to handle each individual transaction may not be as relevant as in situations where paper mill’s telemetry data is fed to management system for on-the-fly decision making.
Then what is the bare minimum which should be mentioned in the Service Level Agreement in B2B environment?
Availability
What it means: The availability means that the endpoint is up and running and can be called by external systems.
Why is it important: Each SLA should always contain what is the agreed availability of the interface in question, and if there are different availability levels for different time frames. This will give the user an indication of the reliability of the API and the level of staffing needed for handling issues.
Example: The SLA may state that the endpoints are available 99,9% of time 24/7. Or it may state that the availability of the endpoint is 99,9% during business hours in Finland and 95% during other times. The measurement period could be agreed to be one month.
Pass rate
What it means: You may have a different term for this, but essentially the pass rate indicates the ratio of API calls the API guarantees will succeed provided that the message is formatted correctly and has suitable data. Everyone who has been building B2B integrations knows 100% will not be a number you can get, but the number should be very close.
Why is it important: This is important since in many cases the API you call is just the gate into a larger (eco)system with multiple possible braking points. It could be an ESB issue, issue with 3rd party integration platform, destination system’s data issue, network issue, or something else. Hence just having the API endpoint available is not yet guarantee that everything will work smoothly if you can pass the message to the endpoint. The promised pass rate will again indicate the level of staffing needed for issue handling, plus indicates the level of trust the API provider has for their infrastructure, systems and external providers as a whole.
Example: The SLA may state that the pass rate for correctly formatted API calls is 99,9% while the transactions per time period rule is followed. The measurement period could be for example one month.
Transactions per time period
What it means: The amount of transactions per time period the API endpoints must be able to handle.
Why is it important: You should always agree the volume of transactions the API must be able to handle. If you expect your peak time need to be 1000 transactions per second then you should agree a number greater than that. The purpose is to make sure the infrastructure and applications can handle the expected volume of transactions without issues.
Example: The SLA may state that the peak volume the API can handle is 1000 transaction per second while average volume per hour stays below 100 transactions per second. The API provider can throttle the API calls to make sure the levels are not surpassed but that information should also be known for the API consumer.
Transaction time
What it means: The end to end time it takes to receive a response to the call.
Why is it important: If the API provider has multiple systems behind the API endpoint the time it takes to provide a response message may rise to multiple seconds. Sometimes even above the default request timeout which will result in the calling system abandoning the transaction. Even if the API endpoint is not that slow, depending on the integration scenario the caller will need to have at least some idea if the thread handling the call will be blocked for 100ms or for 10000ms while it waits for a response.
Example: The SLA may state that the maximum transaction time for a call is 5seconds.
Conclusions
So there are the bare minimum KPIs you should have in your SLA for B2B APIs. Each API for each specific use case may always have additional KPIs, like stating how long each service interruption can take (Mean Time To Repair). The main thing is that the KPIs must be easily measured and calculated. Some platforms make the measurement easier than others. So before committing yourself to anything, make sure you can measure what you agree to.
There are many API Management tools available which you can look into if your current systems are not up to the task. If you are into Gartner’s Magic Quadrant’s, then the top three tool vendors are (in alphabetical order) Apigee, CA Technologies and MuleSoft. You will of course need to weight the needs of your organization against the cost of the tools before making decisions.
If you are interested in discussing Service Management in the age of automation and APIs, do not hesitate to contact us via our pages or engaging with us via LinkedIn, Twitter or just plain old skype/phone call directly to us in person.
Julkaistu 06.03.2017