In October 2017 I joined a new team in the BBC called Datalab. Our remit is to be the first dedicated BBC team working in data science and machine learning at scale. This is the story of how the team was formed.
Part 1: First Steps
As we were a newly-formed team we had some ideas of where we wanted to go and what we wanted to build but we didn’t have any development processes to move forward with. The first task was to work out how our team should run and as such we needed to settle on an Agile methodology (as Agile is the defacto approach in the BBC). Our tooling supported two approaches; Kanban and Scrum. Initially we picked Kanban as we felt it had a lower barrier to entry for the team although quickly moved to Scrum as it became clear that one of the keys to our success would be how we communicate progress to our external stakeholders. With its regular sprints and other metrics; Scrum supported our ethos. We decided on a daily standup at 10am each morning as that falls within core hours and feels like a positive start to the day. Of course stand-up doesn’t need to be at any particular time but it feels like a good exercise to begin the working day, for us. We began with the traditional What did you do yesterday? and What did you do today? but that eventually has developed into a much more focused process whereby we talk about anything new that’s been done, anything that’s needing review, anything blocked, and then if anybody would like to update on an in progress task they can do but they don’t need to. This avoids people feeling embarrassed to be contiually reporting “Still working on X” and keeps the stand-ups focused.
So we now had a process, some ceremonies (such as stand-up, retrospective, elaboration, planning), and some tooling (we use Jira to track our issues so also developed a workflow to support our process). Next we needed to think about how we were going to build things!
We use git/Github for code version control so that was an easy decision to make as we could just set up our team in the organisation and start writing code…but also we needed to consider the skillsets of the team and our long and short-term goals. We’re developing on Google Cloud Platform so there are some golden languages including Java, Go, Python, and Node.js but as we are a team comprising software engineers, data engineers, and data scientists we needed to be careful of what technologies we were going to use. As such we settled on Python for our house language so that all members of the team (who are already familiar with Python) could understand and work on all parts of our systems. Also we decided that we wanted to run our code in containers rather than virtual machines as this would support a nice separation of concerns. Finally we decided we were going to build using microservices because, while our system supports our short-term goals, we want this system to be useful across the BBC and so having a flexible system means we’re able to pivot much more easily and support the varied needs of our clients.
Next was to put together an example microservice and deploy it… We considered our live runtime enviroment and because we’re using GCP it’s was an obvious choice for us to use Google Kubernetes Engine as our container orchestrator and runtime enviroment. We needed a method of deploying to Kubernetes and while GitOps was very tempting we have gone for Spinnaker because this gives us the ability of doing incredibly complex green-blue deployments and provide a great U for team members that don’t need to deal with the underlying complexity of Kubernetes. Connecting this tooling together we also use Google’s Container Builder with build triggers which means we can support push on green deployments. As somebody merges to master in our repositories our Continuous Integration pipeline will be (…we’re working on it) executed which runs unit tests, integration tests, checks style, checks coverage, and then updates the module status etc. When we decide to release we can tag our master branch and the deployment pipeline takes over.
The Continuous Delivery deployment pipeline is designed to support a team who don’t want to have to expend too much energy on releases and as such (all being well) the promotion of a feature to production is a single click away after tagging a new release. When a new release is tagged in Github a Container Builder trigger builds our container and pushes it to our Container Registry. At this point Spinnaker takes over, triggered on a new container image being available. At this point we do an immediate green-blue deployment onto our stage environment which destroys the previous environment. To support this we have static IP addresses for every microservice’s stage and production endpoint and a cluster of that microservice (Kubernetes pods) sit behind that load-balancer.
When the stage deployment is complete, Spinnaker messages us on Slack to tell us that we need to perform our manual verification checks. All being well, an approval is made and the same container is picked up from stage and promoted to production. This blue-green deployment is more complex where we start traffic-splitting and performing canary tests to ensure the deployment is operating correctly and when fully deployed the previous cluster is disabled and left for a few hours in case we need to quickly switch back to that one. In fact we could just re-deploy a previous version of the container but it’s good practice, we feel. In future we’re planning a lot more functionality in this space with internal traffic splitting etc. But at least we have a servicable and strong pipeline for now.
We’re running in the cloud now and have a great deployment pipeline although it takes time to set all of this stuff up. We can assume that it’s all resilient and stable in the cloud and we’ll not loose everything but being skeptical you need to assume that this is a possibility. In any event we decided to not do any in-place updates of Kubernetes as that could leave us in a broken state in production but instead duplicate the entire infrastructure and then cut over using DNS. To support this we have some automation, including Slipway which provisions everything we need quickly and in an easily reproducible manner. This also means that we can get everything up and running in a new project within minutes and our infrastructure is fully immutable.
Our latest discussions have been in migrating our internal microservices from REST to gRPC. Our public API still is running REST implemented as containerised Flask/Green Unicorn applications although our internal microservices run on gRPC. More on that next time!