Application Integration Pattern : the Choreography way

Application Integration Pattern : the Choreography way

Choreography is more than just eventing mechanism, it's a tool to integrate legacy systems with new ones in a non-invasive manner.

Integration can be quite complex, especially when you are talking about making your swanky new application work with a legacy system. I've seen multiple generations of patterns to tackle this problem gain prominence over the past decade and a half. Starting with SOAP web-services to Message Bus to Enterprise Service Bus, REST API and even WebSockets! Each of them has been the blue-eyed boy, or heartthrob of their generation and seen their popularity vain as the new kid on the block got more famous. Having said that, and allowing myself to feel relatively old, whatever the approach, the bigger challenge in integrating legacy services has been that they might eventually get replaced with other services soon, or soon enough for you to get into yet another exercise of re-integrating with the new system all over again with a sense of repetitive labour induced fatigue. Shameless plug, We at NimbleWork were facing this very strategic question when our SaaS journey started a couple of years ago; and this blog is about our learnings and experience in using Choreography as the cornerstone of our integration approach just like it was for implementing Strangler Fig.

Choreography

There are many definitions of Choreography if you happen to browse popular articles or blogs over the web. The way I like to put it is

Choreography represents the communication between microservices through the publishing and subscribing of domain events.

It essentially means that each of the services publishes a Domain Event notifying a change that's occurred out of an action initiated on the service. Services to whom this event is of interest subscribe to them and act accordingly. It's an extremely efficient, lightweight, distributed chain of command. Choreography has emerged as the mechanism of choice in implementing the Saga Pattern in Microservices, and I've seen it increasingly replace Orchestrators. I'd like to point out however that it's not a universally applicable solution or implementation/design in the microservices world, there's some criticism of this approach going around too, though I haven't burnt my hands using it yet, so I'd like to think the criticism is more out of design issues elsewhere which might have lead to bottlenecks in using Choreography.

The Approach

Here's our approach, both the legacy system and the new systems emit Domain Events, which are published to Kafka. Each Kafka topic is backed by MongoDB collection using the Kafka MongoDB connector for Fault Tolerance and HA purposes. Each of these collections is replicated over a MongoDB replica-set. In case of failed processing of the domain event, the failing side emits an event for compensating transactions instead of rollback. Any outage on either side is covered via two factors:

  1. Kafka Messages are persisted for durations longer than the outage of either of the services

  2. On being live again each of the services starts from where the consumer offset was when they went offline

There isn't much we have to do as developers for either of these provisions but knowing that we have these provisions is more than just handy knowledge.

Handling Failures

Handling failures, like most modern architectures is a layered approach in this case. So let's look at them from the perspective of where we choose to tackle them.

  • Outages: are best handled at the infrastructure layer, if you're deploying both the legacy and new services in Kubernetes (which makes sense for that matter as you'll be phasing out the legacy system with newer services eventually) then we can leave it to Kubernetes liveness and readiness probes coupled with deploying services as a stateful or replica set to get over the outage issue. If the legacy system is too old to run as a stateless service, you're still not out of luck here, you can deploy the traditional Sticky session clusters fronted with a load-balancer by using internal ingresses in Kubernetes too. So there too the master slave cluster gives you some degree of protection against the outages.

  • Dirty Reads: If Choreography has scared the daylights out of any of your colleagues it's most likely for the fear of this, a comprehensive explanation on how to mitigate this via design in a whole blog post in itself so I'd like to keep it short. Dirty reads are more likely to occur in the case of bi-directional integration, by that I mean both services are reading or writing to each other. And if you find yourself doing that, just stop, you've perhaps created too many routes for data change, it's better to design for all operations on the data that is owned by new services to happen in the new service only and the associated events flow to the legacy app, and vice-versa. Even if you have a duplex of Domain Events, don't allow the legacy system to update its copy of new system data directly nor should the new system allow editing the copy of the legacy system directly. Keep clear separation of traffic is what I'm saying essentially.

  • Idempotence: this is a tricky one, while idempotent API Operations can be guaranteed within the context of each of the systems integrated, there's an additional Domain Event whose consumption should also lead to idempotence, which means that consumption of Domain Events also has to preserve the idempotence on both sides of the integration. A simple rule of thumb here is to follow what their APIS do; a CREATE API call is easiest to handle, and the Create Domain Event can never be idempotent. What about DELETE, PUT? That's where idempotence has to be preserved (assuming the legacy system did respect this principle in the first place, or else God bless you!). We've noticed that using upsert operations for processing PUT Domain Event helps a lot. DELETE Idempotence is tricky, should I keep sending 200 OK when deleting an ID repeatedly? or throw an error the second time around? We chose the first option in our REST API design so we followed the same norm when processing a Domain Event for DELETE, allowing it to fail silently when deleting a deleted entity.

Conclusion

Choreography is a strong pattern in Microservices and is here to stay for good. This blog post however was our attempt to show that we can improvise to use well-established patterns for much wider goals than they were envisioned for. It requires a bit of imagination, lots and lots of reading, hours of experimentation and not to forget, the temperament the handle their failures. I hope this blog helps fellow engineers with one more option to consider while in the transition phase of their microservices journey where new services have to integrate and play well with legacy systems that are built on completely different designs and patterns altogether.