website/content/zh/case-studies/blackrock/index.html

113 lines
16 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

---
title: BlackRock Case Study
case_study_styles: true
cid: caseStudies
css: /css/style_blackrock.css
---
<div class="banner1">
<h1> CASE STUDY: <img src="/images/blackrock_logo.png" class="header_logo"><br>
<div class="subhead">Rolling Out Kubernetes in Production in 100 Days</div>
</h1>
</div>
<div class="details">
Company &nbsp;<b>BlackRock</b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Location &nbsp;<b>New York, NY</b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Industry &nbsp;<b>Financial Services</b>
</div>
<hr>
<section class="section1">
<div class="cols">
<div class="col1">
<h2>Challenge</h2>
The worlds largest asset manager, <a href="https://www.blackrock.com/investing">BlackRock</a> operates a very controlled static deployment scheme, which has allowed for scalability over the years. But in their data science division, there was a need for more dynamic access to resources. "We want to be able to give every investor access to data science, meaning <a href="https://www.python.org">Python</a> notebooks, or even something much more advanced, like a MapReduce engine based on <a href="https://spark.apache.org">Spark</a>," says Michael Francis, a Managing Director in BlackRocks Product Group, which runs the companys investment management platform. "Managing complex Python installations on users desktops is really hard because everyone ends up with slightly different environments. We have existing environments that do these things, but we needed to make it real, expansive and scalable. Being able to spin that up on demand, tear it down, make that much more dynamic, became a critical thought process for us. Its not so much that we had to solve our main core production problem, its how do we extend that? How do we evolve?"
</div>
<div class="col2">
<h2>Solution</h2>
Drawing from what they learned during a pilot done last year using <a href="https://www.docker.com">Docker</a> environments, Francis put together a cross-sectional team of 20 to build an investor research web app using <a href="https://kubernetes.io">Kubernetes</a> with the goal of getting it into production within one quarter.
<br><br>
<h2>Impact</h2>
"Our goal was: How do you give people tools rapidly without having to install them on their desktop?" says Francis. And the team hit the goal within 100 days. Francis is pleased with the results and says, "Were going to use this infrastructure for lots of other application workloads as time goes on. Its not just data science; its this style of application that needs the dynamism. But I think were 6-12 months away from making a [large scale] decision. We need to gain experience of running the system in production, we need to understand failure modes and how best to manage operational issues. Whats interesting is that just having this technology there is changing the way our developers are starting to think about their future development."
</div>
</div>
</section>
<div class="banner2">
<div class="banner2text">
"My message to other enterprises like us is you can actually integrate Kubernetes into an existing, well-orchestrated machinery. You dont have to throw out everything you do. And using Kubernetes made a complex problem significantly easier."<br style="height:25px"><span style="font-size:14px;letter-spacing:2px;text-transform:uppercase;margin-top:5% !important;"><br>- Michael Francis, Managing Director, BlackRock</span>
</div>
</div>
<section class="section2">
<div class="fullcol">
One of the management objectives for BlackRocks Product Group employees in 2017 was to "build cool stuff." Led by Managing Director Michael Francis, a cross-sectional group of 20 did just that: They rolled out a full production Kubernetes environment and released a new investor research web app on it. In 100 days.<br><br>
For a company thats the worlds largest asset manager, "just equipment procurement can take 100 days sometimes, let alone from inception to delivery," says Karl Wieman, a Senior System Administrator. "It was an aggressive schedule. But it moved the dial."
In fact, the project achieved two goals: It solved a business problem (creating the needed web app) as well as provided real-world, in-production experience with Kubernetes, a cloud-native technology that the company was eager to explore. "Its not so much that we had to solve our main core production problem, its how do we extend that? How do we evolve?" says Francis. The ultimate success of this project, beyond delivering the app, lies in the fact that "weve managed to integrate a radically new thought process into a controlled infrastructure that we didnt want to change."<br><br>
After all, in its three decades of existence, BlackRock has "a very well-established environment for managing our compute resources," says Francis. "We manage large cluster processes on machines, so we do a lot of orchestration and management for our main production processes in a way thats very cloudish in concept. Were able to manage them in a very controlled, static deployment scheme, and that has given us a huge amount of scalability."<br><br>
Though that works well for the core production, the company has found that some data science workloads require more dynamic access to resources. "Its a very bursty process," says Francis, who is head of data for the companys Aladdin investment management platform division.<br><br>
Aladdin, which connects the people, information and technology needed for money management in real time, is used internally and is also sold as a platform to other asset managers and insurance companies. "We want to be able to give every investor access to data science, meaning <a href="https://www.python.org">Python</a> notebooks, or even something much more advanced, like a MapReduce engine based on <a href="https://spark.apache.org">Spark</a>," says Francis. But "managing complex Python installations on users desktops is really hard because everyone ends up with slightly different environments. Docker allows us to flatten that environment."
</div>
</section>
<div class="banner3">
<div class="banner3text">
"We manage large cluster processes on machines, so we do a lot of orchestration and management for our main production processes in a way thats very cloudish in concept. Were able to manage them in a very controlled, static deployment scheme, and that has given us a huge amount of scalability."
</div>
</div>
<section class="section3">
<div class="fullcol">
Still, challenges remain. "If you have a shared cluster, you get this storming herd problem where everyone wants to do the same thing at the same time," says Francis. "You could put limits on it, but youd have to build an infrastructure to define limits for our processes, and the Python notebooks werent really designed for that. We have existing environments that do these things, but we needed to make it real, expansive, and scalable. Being able to spin that up on demand, tear it down, and make that much more dynamic, became a critical thought process for us."<br><br>
Made up of managers from technology, infrastructure, production operations, development and information security, Franciss team was able to look at the problem holistically and come up with a solution that made sense for BlackRock. "Our initial straw man was that we were going to build everything using <a href="https://www.ansible.com">Ansible</a> and run it all using some completely different distributed environment," says Francis. "That would have been absolutely the wrong thing to do. Had we gone off on our own as the dev team and developed this solution, it would have been a very different product. And it would have been very expensive. We would not have gone down the route of running under our existing orchestration system. Because we dont understand it. These guys [in operations and infrastructure] understand it. Having the multidisciplinary team allowed us to get to the right solutions and that actually meant we didnt build anywhere near the amount we thought we were going to end up building."<br><br>
In search of a solution in which they could manage usage on a user-by-user level, Franciss team gravitated to Red Hats <a href="https://www.openshift.com">OpenShift</a> Kubernetes offering. The company had already experimented with other cloud-native environments, but the team liked that Kubernetes was open source, and "we felt the winds were blowing in the direction of Kubernetes long term," says Francis. "Typically we make technology choices that we believe are going to be here in 5-10 years time, in some form. And right now, in this space, Kubernetes feels like the one thats going to be there." Adds Uri Morris, Vice President of Production Operations: "When you see that the non-Google committers to Kubernetes overtook the Google committers, thats an indicator of the momentum."<br><br>
Once that decision was made, the major challenge was figuring out how to make Kubernetes work within BlackRocks existing framework. "Its about understanding how we can operate, manage and support a platform like this, in addition to tacking it onto our existing technology platform," says Project Manager Michael Maskallis. "All the controls we have in place, the change management process, the software development lifecycle, onboarding processes we go through—how can we do all these things?"<br><br>
The first (anticipated) speed bump was working around issues behind BlackRocks corporate firewalls. "One of our challenges is there are no firewalls in most open source software," says Francis. "So almost all install scripts fail in some bizarre way, and pulling down packages doesnt necessarily work." The team ran into these types of problems using <a href="/docs/getting-started-guides/minikube/">Minikube</a> and did a few small pushes back to the open source project.
</div>
</section>
<div class="banner4">
<div class="banner4text">
"Typically we make technology choices that we believe are going to be here in 5-10 years time, in some form. And right now, in this space, Kubernetes feels like the one thats going to be there."
</div>
</div>
<section class="section4">
<div class="fullcol">
There were also questions about service discovery. "You can think of Aladdin as a cloud of services with APIs between them that allows us to build applications rapidly," says Francis. "Its all on a proprietary message bus, which gives us all sorts of advantages but at the same time, how does that play in a third party [platform]?"<br><br>
Another issue they had to navigate was that in BlackRocks existing system, the messaging protocol has different instances in the different development, test and production environments. While Kubernetes enables a more DevOps-style model, it didnt make sense for BlackRock. "I think what we are very proud of is that the ability for us to push into production is still incredibly rapid in this [new] infrastructure, but we have the control points in place, and we didnt have to disrupt everything," says Francis. "A lot of the cost of this development was thinking how best to leverage our internal tools. So it was less costly than we actually thought it was going to be."<br><br>
The project leveraged tools associated with the messaging bus, for example. "The way that the Kubernetes cluster will talk to our internal messaging platform is through a gateway program, and this gateway program already has built-in checks and throttles," says Morris. "We can use them to control and potentially throttle the requests coming in from Kubernetess very elastic infrastructure to the production infrastructure. Well continue to go in that direction. It enables us to scale as we need to from the operational perspective."<br><br>
The solution also had to be complementary with BlackRocks centralized operational support team structure. "The core infrastructure components of Kubernetes are hooked into our existing orchestration framework, which means that anyone in our support team has both control and visibility to the cluster using the existing operational tools," Morris explains. "That means that I dont need to hire more people."<br><br>
With those points established, the team created a procedure for the project: "We rolled this out first to a development environment, then moved on to a testing environment and then eventually to two production environments, in that sequential order," says Maskallis. "That drove a lot of our learning curve. We have all these moving parts, the software components on the infrastructure side, the software components with Kubernetes directly, the interconnectivity with the rest of the environment that we operate here at BlackRock, and how we connect all these pieces. If we came across issues, we fixed them, and then moved on to the different environments to replicate that until we eventually ended up in our production environment where this particular cluster is supposed to live."<br><br>
The team had weekly one-hour working sessions with all the members (who are located around the world) participating, and smaller breakout or deep-dive meetings focusing on specific technical details. Possible solutions would be reported back to the group and debated the following week. "I think what made it a successful experiment was people had to work to learn, and they shared their experiences with others," says Vice President and Software Developer Fouad Semaan. Then, Francis says, "We gave our engineers the space to do what theyre good at. This hasnt been top-down."
</div>
</section>
<div class="banner5">
<div class="banner5text">
"The core infrastructure components of Kubernetes are hooked into our existing orchestration framework, which means that anyone in our support team has both control and visibility to the cluster using the existing operational tools. That means that I dont need to hire more people."
</div>
</div>
<section class="section5">
<div class="fullcol">
They were led by one key axiom: To stay focused and avoid scope creep. This meant that they wouldnt use features that werent in the core of Kubernetes and Docker. But if there was a real need, theyd build the features themselves. Luckily, Francis says, "Because of the rapidity of the development, a lot of things we thought we would have to build ourselves have been rolled into the core product. [The package manager<a href="https://helm.sh"> Helm</a> is one example]. People have similar problems."<br><br>
By the end of the 100 days, the app was up and running for internal BlackRock users. The initial capacity of 30 users was hit within hours, and quickly increased to 150. "People were immediately all over it," says Francis. In the next phase of this project, they are planning to scale up the cluster to have more capacity.<br><br>
Even more importantly, they now have in-production experience with Kubernetes that they can continue to build on—and a complete framework for rolling out new applications. "Were going to use this infrastructure for lots of other application workloads as time goes on. Its not just data science; its this style of application that needs the dynamism," says Francis. "Is it the right place to move our core production processes onto? It might be. Were not at a point where we can say yes or no, but we felt that having real production experience with something like Kubernetes at some form and scale would allow us to understand that. I think were 6-12 months away from making a [large scale] decision. We need to gain experience of running the system in production, we need to understand failure modes and how best to manage operational issues."<br><br>
For other big companies considering a project like this, Francis says commitment and dedication are key: "We got the signoff from [senior management] from day one, with the commitment that we were able to get the right people. If I had to isolate what makes something complex like this succeed, I would say senior hands-on people who can actually drive it make a huge difference." With that in place, he adds, "My message to other enterprises like us is you can actually integrate Kubernetes into an existing, well-orchestrated machinery. You dont have to throw out everything you do. And using Kubernetes made a complex problem significantly easier."
</div>
</section>