Australia Post tackles ‘observability’ after digital transformation
Australia Post’s adoption of microservices and cloud via a digital transformation allowed it to move faster, but also created a complex environment of interdependencies and a need to establish “observability” across that.
When Post kicked off its digital transformation in 2013, the overarching goal was speed.
“Our time to market was incredibly slow at the time,” recalls head of platform engineering Andrew Nette.
“It was taking up to 50 days for code to reach production, so environments were really slow in spinning up.”
The organisation set up a digital delivery centre internally to help internal application owners transform.
The centre was a predecessor to platform engineering, where 30 engineers now assist Post’s delivery teams to get products to market quickly, and then support the application once in production.
Nette told last month’s New Relic FutureStack19 conference that Post re-architected applications into arrays of cloud-hosted microservices.
That structure worked insofar as it reduced the time needed to get code into production.
“We were quite successful,” Nette said.
“We were able to get things into production in about 12 minutes, so our time to market has significantly improved.
“[But] we were using a microservices architecture, so our number of things in production scaled out as well, which means our environment got a lot more complex, and we had to think about the way we monitored those applications differently.”
The problems appeared “after 2013”, not long into the transformation.
“Microservices were proliferating, and it was really difficult for us to keep up and keep the focus on the number of microservices that we had,” Nette said.
Post’s challenge quickly became establishing “observability” over the transformed environment, and Nette said the organisation had spent “a lot of time” getting to a point of “100 percent visibility.”
“Observability is more than just monitoring – it’s the ability to understand what’s happening inside your application or inside your system through all its dependencies,” Nette said.
“Being able to understand if there’s an issue in the network layer, infrastructure layer, application layer, and even out to third party services that you’re utilising.
“If you can do that, then you have a truly observable system, and if you can do it all in one place then … you have your single pane of glass where you can see all of your issues.”
Nette continued: “If I think about the way Post’s observability platform or our monitoring ecosystem developed … we’ve spent a lot of time trying to develop our ecosystem so that we have 100 percent visibility and no gaps.”
Tool-wise, Australia Post uses a mix of “New Relic APM [application performance monitoring], synthetics, Sumo Logic event [management], even Bash scripts if that was what was required to get the visibility.”
Nette described finding the right mix of tools as Post’s “Goldilocks zone” – “not too many, not too few, just the right number.”
“The tools [also] need to provide value and not add more toil,” he said.
The ‘Vanilla Ice rule’
Post worked closely with delivery teams to instrument all parts of the environment.
“Collaboration was really important,” Nette said.
“We follow the Vanilla Ice rule: ‘stop, collaborate and listen’.
“It was a two-way conversation with delivery teams. They needed to understand why we were trying to do the things we were doing, and we needed to appreciate that they had other work, they were delivering features.”
One of the main things to monitor were Post’s customer-facing APIs, which large enterprise customers use to directly integrate with Post’s parcel delivery systems.
Post built all its APIs using a standard pattern that “had API health checks built in”, Nette said.
“It was really important that we worked with the delivery teams and the developers to make sure that those health checks were instrumented correctly, that they were calling their dependencies and that the dependencies were showing the correct states,” Nette said.
“Once we had a screen where all of our APIs were calling all of their dependencies, we could very quickly identify when there was an issue, and once we had that, we significantly reduced our mean time to identify issues and also our mean time to resolve.
“It was a big win, and it showed effective collaboration was really helpful.”
Australia Post used an unspecified set of open source tools to create that “API health check dashboard”, and then other “observability” dashboards useful to platform engineering.
It then started creating dashboards for individual delivery teams that showed them only the alerts they needed to see for the code and services they maintained.
In part, these dashboards helped the delivery teams to “evolve their DevOps capabilities”, Nette said, instead of fully relying on operations for alerting and support.
Platform engineering at Australia Post now offers “hybrid” support options to delivery teams.
“There’s full end-to-end support that we provide with delivery teams providing a third level [of] escalation, all the way up to the delivery teams providing full DevOps and doing all of the support [themselves], and [us just] providing advice where required,” Nette said.