148 lines
26 KiB
HTML
148 lines
26 KiB
HTML
---
|
||
title: 案例研究:Workiva
|
||
|
||
case_study_styles: true
|
||
cid: caseStudies
|
||
css: /css/style_workiva.css
|
||
---
|
||
|
||
<div class="banner1">
|
||
<!-- <h1> CASE STUDY:<img src="/images/workiva_logo.png" style="margin-bottom:0%" class="header_logo"><br> <div class="subhead">Using OpenTracing to Help Pinpoint the Bottlenecks
|
||
|
||
</div></h1> -->
|
||
<h1> 案例研究:<img src="/images/workiva_logo.png" style="margin-bottom:0%" class="header_logo"><br> <div class="subhead">使用 OpenTracing 帮助查找瓶颈
|
||
|
||
</div></h1>
|
||
|
||
|
||
</div>
|
||
|
||
<!-- <div class="details">
|
||
Company <b>Workiva</b> Location <b>Ames, Iowa</b> Industry <b>Enterprise Software</b>
|
||
</div> -->
|
||
<div class="details">
|
||
公司 <b>Workiva</b> Location <b>艾姆斯,爱荷华州</b> 行业 <b>企业软件</b>
|
||
</div>
|
||
|
||
|
||
<hr>
|
||
<section class="section1">
|
||
<div class="cols">
|
||
<div class="col1">
|
||
<h2>挑战</h2>
|
||
<!-- <a href="https://www.workiva.com/">Workiva</a> offers a cloud-based platform for managing and reporting business data. This SaaS product, Wdesk, is used by more than 70 percent of the Fortune 500 companies. As the company made the shift from a monolith to a more distributed, microservice-based system, "We had a number of people working on this, all on different teams, so we needed to identify what the issues were and where the bottlenecks were," says Senior Software Architect MacLeod Broad. With back-end code running on Google App Engine, Google Compute Engine, as well as Amazon Web Services, Workiva needed a tracing system that was agnostic of platform. While preparing one of the company’s first products utilizing AWS, which involved a "sync and link" feature that linked data from spreadsheets built in the new application with documents created in the old application on Workiva’s existing system, Broad’s team found an ideal use case for tracing: There were circular dependencies, and optimizations often turned out to be micro-optimizations that didn’t impact overall speed. -->
|
||
<a href="https://www.workiva.com/">Workiva</a>提供了一个基于云的平台,用于管理和报告业务数据。此 SaaS 产品 Wdesk 被 70% 以上的财富 500 强企业使用。随着公司从单体系统向更分散的基于微服务的系统转变,“我们有许多人在研究这个问题,他们都在不同的团队中,所以我们需要确定问题是什么,瓶颈在哪里,”高级软件架构师 MacLeod Broad 说。随着后端代码在 Google App Engine,Google Compute Engine,以及 Amazon Web Services 上运行,Workiva 需要一个还未被开发的跟踪系统。在准备公司首批使用 AWS 的产品时,Broad 的团队将新应用程序中构建的电子表格数据与 Workiva 现有系统上的旧应用程序中创建的文档相关联,该功能涉及“同步和链接”功能,从而发现了一个理想的跟踪用例:存在循环依赖关系,而优化通常证明是不影响整体速度的微优化。
|
||
|
||
<br>
|
||
|
||
</div>
|
||
|
||
<div class="col2">
|
||
<h2>解决方案</h2>
|
||
<!-- Broad’s team introduced the platform-agnostic distributed tracing system OpenTracing to help them pinpoint the bottlenecks. -->
|
||
Broad 的团队推出了与平台无关的分布式跟踪系统 OpenTracing,帮助他们找出瓶颈。
|
||
<br>
|
||
<h2>影响</h2>
|
||
<!-- Now used throughout the company, OpenTracing produced immediate results. Software Engineer Michael Davis reports: "Tracing has given us immediate, actionable insight into how to improve our service. Through a combination of seeing where each call spends its time, as well as which calls are most often used, we were able to reduce our average response time by 95 percent (from 600ms to 30ms) in a single fix." -->
|
||
现在在整个公司使用,OpenTracing 产生了立竿见影的效果。软件工程师 Michael Davis 报告说:“跟踪让我们能够立即了解如何改进我们的服务。通过查看每个调用的花费时间以及最常使用哪些调用的组合,我们能够在单个修复中将平均响应时间减少 95%(从 600ms 到 30ms)。”
|
||
|
||
</div>
|
||
|
||
</div>
|
||
</section>
|
||
<div class="banner2">
|
||
<div class="banner2text">
|
||
<!-- "With OpenTracing, my team was able to look at a trace and make optimization suggestions to another team without ever looking at their code." <span style="font-size:14px;letter-spacing:0.12em;padding-top:20px;text-transform:uppercase"><br>— MacLeod Broad, Senior Software Architect at Workiva</span> -->
|
||
使用 OpenTracing,我的团队能够查看跟踪,并针对其他团队提出优化建议,而无需查看他们的代码。<span style="font-size:14px;letter-spacing:0.12em;padding-top:20px;text-transform:uppercase"><br>— MacLeod Broad, Workiva 高级软件架构师</span>
|
||
|
||
</div>
|
||
</div>
|
||
<section class="section2">
|
||
<div class="fullcol">
|
||
<!-- <h2>Last fall, MacLeod Broad’s platform team at Workiva was prepping one of the company’s first products utilizing <a href="https://aws.amazon.com/">Amazon Web Services</a> when they ran into a roadblock.</h2> -->
|
||
<h2>去年秋天,MacLeod Broad 在 Workiva 的平台团队在准备该公司的首批产品之一,当遇到障碍时就使用 <a href="https://aws.amazon.com/">Amazon Web Services</a> 。</h2>
|
||
<!-- Early on, Workiva’s backend had run mostly on <a href="https://cloud.google.com/appengine/">Google App Engine</a>. But things changed along the way as Workiva’s SaaS offering, <a href="https://www.workiva.com/wdesk">Wdesk</a>, a cloud-based platform for managing and reporting business data, grew its customer base to more than 70 percent of the Fortune 500 companies. "As customer needs grew and the product offering expanded, we started to leverage a wider offering of services such as Amazon Web Services as well as other Google Cloud Platform services, creating a multi-vendor environment."<br><br> -->
|
||
早期,Workiva 的后端主要运行在<a href="https://cloud.google.com/appengine/">Google App Engine</a>上。但随着 Workiva 的 SaaS 产品,<a href="https://www.workiva.com/wdesk">Wdesk</a>,一种基于云的管理和报告业务数据平台,将客户群增长到财富 500 强公司的 70% 以上,情况发生了变化。“随着客户需求的增长和产品供应的扩大,我们开始利用更广泛的服务,如 Amazon Web Services 以及其他 Google Cloud Platform 服务,从而创建一个多供应商环境。”
|
||
|
||
<!-- With this new product, there was a "sync and link" feature by which data "went through a whole host of services starting with the new spreadsheet system [<a href="https://aws.amazon.com/rds/aurora/">Amazon Aurora</a>] into what we called our linking system, and then pushed through http to our existing system, and then a number of calculations would go on, and the results would be transmitted back into the new system," says Broad. "We were trying to optimize that for speed. We thought we had made this great optimization and then it would turn out to be a micro optimization, which didn’t really affect the overall speed of things." <br><br> -->
|
||
有了这个新产品,具备“同步和链接”功能,通过它,数据“经历了一大堆服务,从新的电子表格系统[<a href="https://aws.amazon.com/rds/aurora/">Amazon Aurora</a>]到我们所谓的链接系统,然后通过 http 推送到我们现有的系统,然后进行一些计算,结果将传回新系统,”Broad 说。“我们当时进行优化是为了速度。我们认为做了这种极好的优化,然后它会变成一个微观优化,这并没有真正影响系统的整体速度。”<br><br>
|
||
<!-- The challenges faced by Broad’s team may sound familiar to other companies that have also made the shift from monoliths to more distributed, microservice-based systems. "We had a number of people working on this, all on different teams, so it was difficult to get our head around what the issues were and where the bottlenecks were," says Broad.<br><br> -->
|
||
Broad 团队面临的挑战听起来可能为其他公司所熟悉,这些公司也从单体系统转向更分散的基于微服务的系统。Broad 说:“我们有许多人从事这方面的工作,他们都在不同的团队中,因此很难了解问题是什么以及瓶颈在哪里。”<br><br>
|
||
<!-- "Each service team was going through different iterations of their architecture and it was very hard to follow what was actually going on in each teams’ system," he adds. "We had circular dependencies where we’d have three or four different service teams unsure of where the issues really were, requiring a lot of back and forth communication. So we wasted a lot of time saying, ‘What part of this is slow? Which part of this is sometimes slow depending on the use case? Which part is degrading over time? Which part of this process is asynchronous so it doesn’t really matter if it’s long-running or not? What are we doing that’s redundant, and which part of this is buggy?’" -->
|
||
“每个服务团队都经历着不同的架构迭代,很难了解每个团队系统中的实际内容,”他补充道。“我们有循环依赖关系,其中有三个或四个不同的服务团队不确定问题的真正位置,需要大量的来回沟通。因此,我们浪费了很多时间说,‘这个哪一部分比较慢?根据应用场景的不同,哪一部分有时很慢?随着时间推移,哪一部分会逐渐变慢?这个进程的哪一部分是异步的,因此,它是否长时间运行并不重要?我们在做的哪些是多余的,其中哪一部分是错误?’”
|
||
|
||
|
||
</div>
|
||
</section>
|
||
<div class="banner3">
|
||
<div class="banner3text">
|
||
<!-- "A tracing system can at a glance explain an architecture, narrow down a performance bottleneck and zero in on it, and generally just help direct an investigation at a high level. Being able to do that at a glance is much faster than at a meeting or with three days of debugging, and it’s a lot faster than never figuring out the problem and just moving on."<span style="font-size:14px;letter-spacing:0.12em;padding-top:20px;text-transform:uppercase"><br>— MACLEOD BROAD, SENIOR SOFTWARE ARCHITECT AT WORKIVA</span> -->
|
||
“跟踪系统可以一目了然地解释体系结构,缩小性能瓶颈并归零,通常只是帮助指导高级别的调查。一眼就能做到这一点,比在会议或三天的调试中要快得多,而且比从来不找出问题得过且过要快得多。”<span style="font-size:14px;letter-spacing:0.12em;padding-top:20px;text-transform:uppercase"><br>— MACLEOD BROAD, WORKIVA 高级软件架构师</span>
|
||
|
||
</div>
|
||
</div>
|
||
<section class="section3">
|
||
<div class="fullcol">
|
||
|
||
<!-- Simply put, it was an ideal use case for tracing. "A tracing system can at a glance explain an architecture, narrow down a performance bottleneck and zero in on it, and generally just help direct an investigation at a high level," says Broad. "Being able to do that at a glance is much faster than at a meeting or with three days of debugging, and it’s a lot faster than never figuring out the problem and just moving on."<br><br> -->
|
||
简而言之,它是跟踪的理想用例。Broad 说:“跟踪系统可以一目了然地解释体系结构,缩小性能瓶颈并归零,通常只是帮助在高级别指导调查。”“一眼就能做到这一点,比在会议或三天的调试中要快得多,而且比从不找出问题得过且过要快得多。”<br><br>
|
||
<!-- With Workiva’s back-end code running on <a href="https://cloud.google.com/compute/">Google Compute Engine</a> as well as App Engine and AWS, Broad knew that he needed a tracing system that was platform agnostic. "We were looking at different tracing solutions," he says, "and we decided that because it seemed to be a very evolving market, we didn’t want to get stuck with one vendor. So OpenTracing seemed like the cleanest way to avoid vendor lock-in on what backend we actually had to use."<br><br> -->
|
||
Workiva 的后端代码在 <a href="https://cloud.google.com/compute/">Google Compute Engine</a> 、App Engine 和 AWS 上运行,Broad 知道他需要一个与平台无关的跟踪系统。“我们正在研究不同的追踪解决方案,”他说,“我们决定,由于市场似乎是一个不断发展的市场,我们不想被一个供应商卡住。因此,OpenTracing 似乎是避免供应商限制我们实际使用的后端的最简捷方法。”<br><br>
|
||
<!-- Once they introduced OpenTracing into this first use case, Broad says, "The trace made it super obvious where the bottlenecks were." Even though everyone had assumed it was Workiva’s existing code that was slowing things down, that wasn’t exactly the case. "It looked like the existing code was slow only because it was reaching out to our next-generation services, and they were taking a very long time to service all those requests," says Broad. "On the waterfall graph you can see the exact same work being done on every request when it was calling back in. So every service request would look the exact same for every response being paged out. And then it was just a no-brainer of, ‘Why is it doing all this work again?’"<br><br> -->
|
||
Broad 说,一旦他们将 OpenTrace 引入到第一个用例中,“跟踪使得瓶颈所在位置变得非常明显。”尽管每个人都认为是 Workiva 的现有代码减缓了速度,但事实并非如此。Broad 说:“看起来现有代码速度很慢,只是因为它能够达到到我们下一代服务的水平,而且它们需要很长时间才能处理所有这些请求。”“在瀑布图上,你可以看到每个请求在得到响应时所做的完全相同的工作。因此,对于被分页的每个响应,每个服务请求看起来都完全相同。然后,就不用费神去思考‘为什么它再次做所有这些工作?’”<br><br>
|
||
<!-- Using the insight OpenTracing gave them, "My team was able to look at a trace and make optimization suggestions to another team without ever looking at their code," says Broad. "The way we named our traces gave us insight whether it’s doing a SQL call or it’s making an RPC. And so it was really easy to say, ‘OK, we know that it’s going to page through all these requests. Do the work once and stuff it in cache.’ And we were done basically. All those calls became sub-second calls immediately."<br><br> -->
|
||
使用 OpenTrace 的跟踪功能,“我的团队能够查看跟踪,并针对另一个团队提出优化建议,而无需查看他们的代码,”Broad 说。“我们命名跟踪的方式可以说明请求是执行 SQL 调用还是创建 RPC。因此,可以非常简单地说,‘好吧,我们知道它能处理所有这些请求。处理一次缓存一次。’我们基本上完成了。所有这些调用立即变成了亚秒级的调用。”<br><br>
|
||
|
||
</div>
|
||
</section>
|
||
|
||
<div class="banner4">
|
||
<div class="banner4text">
|
||
<!-- "We were looking at different tracing solutions and we decided that because it seemed to be a very evolving market, we didn’t want to get stuck with one vendor. So OpenTracing seemed like the cleanest way to avoid vendor lock-in on what backend we actually had to use." <span style="font-size:14px;letter-spacing:0.12em;padding-top:20px;text-transform:uppercase"><br>— MACLEOD BROAD, SENIOR SOFTWARE ARCHITECT AT WORKIVA</span> -->
|
||
“我们正在研究不同的跟踪解决方案,我们决定使用 OpenTracing,由于市场似乎是一个不断发展的市场,我们不想被一个供应商卡住。因此,OpenTracing 似乎是避免供应商限制我们实际使用的后端的最简捷方法。”<span style="font-size:14px;letter-spacing:0.12em;padding-top:20px;text-transform:uppercase"><br>— MACLEOD BROAD, WORKIVA 高级软件架构师</span>
|
||
|
||
</div>
|
||
</div>
|
||
|
||
<section class="section4">
|
||
<div class="fullcol">
|
||
<!-- After the success of the first use case, everyone involved in the trial went back and fully instrumented their products. Tracing was added to a few more use cases. "We wanted to get through the initial implementation pains early without bringing the whole department along for the ride," says Broad. "Now, a lot of teams add it when they’re starting up a new service. We’re really pushing adoption now more than we were before." <br><br> -->
|
||
在第一个用例成功后,参与尝试的每个人都回去重新编排了他们的产品。跟踪功能已添加到几个更多的用例中。Broad 说:“我们希望尽早度过最初的实施难关,而不会让整个部门一起渡过难关。”“现在,许多团队在启动新服务时添加它。我们现在确实比以前更将推动采用。”<br><br>
|
||
<!-- Some teams were won over quickly. "Tracing has given us immediate, actionable insight into how to improve our [Workspaces] service," says Software Engineer Michael Davis. "Through a combination of seeing where each call spends its time, as well as which calls are most often used, we were able to reduce our average response time by 95 percent (from 600ms to 30ms) in a single fix." <br><br> -->
|
||
一些团队很快就赢了。软件工程师 Michael Davis 说:“跟踪让我们能够立即了解如何改进我们的(工作空间)服务。通过查看每个调用的花费时间以及最常使用哪些调用的组合,我们能够在单个修复中将平均响应时间减少 95%(从 600ms 到 30ms)”。<br><br>
|
||
<!-- Most of Workiva’s major products are now traced using OpenTracing, with data pushed into <a href="https://cloud.google.com/stackdriver/">Google StackDriver</a>. Even the products that aren’t fully traced have some components and libraries that are. <br><br> -->
|
||
Workiva 的大部分主要产品现在都使用 OpenTracing 进行跟踪,将数据推送到 <a href="https://cloud.google.com/stackdriver/">Google StackDriver</a>。即使未完全使用跟踪功能的产品,也具有一些组件和库。<br><br>
|
||
<!-- Broad points out that because some of the engineers were working on App Engine and already had experience with the platform’s Appstats library for profiling performance, it didn’t take much to get them used to using OpenTracing. But others were a little more reluctant. "The biggest hindrance to adoption I think has been the concern about how much latency is introducing tracing [and StackDriver] going to cost," he says. "People are also very concerned about adding middleware to whatever they’re working on. Questions about passing the context around and how that’s done were common. A lot of our Go developers were fine with it, because they were already doing that in one form or another. Our Java developers were not super keen on doing that because they’d used other systems that didn’t require that."<br><br> -->
|
||
Broad 指出,由于一些工程师正在使用 App Engine,并且已经拥有平台的 Appstats 库分析性能的经验,因此他们不需要花太多时间才能习惯使用 OpenTracing。但其他人则有点不情愿。“我认为,使用 OpenTracing 的最大障碍是担心引入跟踪和 StackDriver 的成本会是多少,”他说。“人们也非常关心将中间件添加到他们正在处理的任何内容中。有关传递上下文以及如何完成这些工作的问题很常见。我们的许多 Go 开发人员对此都很好,因为他们已经以这样或那样的形式这样做了。我们的 Java 开发人员并不热衷于这样做,因为他们使用了其他不需要该功能的系统。”<br><br>
|
||
<!-- But the benefits clearly outweighed the concerns, and today, Workiva’s official policy is to use tracing." -->
|
||
但好处显然超过了人们的担忧,而今天,Workiva 的官方政策是使用追踪。
|
||
<!-- In fact, Broad believes that tracing naturally fits in with Workiva’s existing logging and metrics systems. "This was the way we presented it internally, and also the way we designed our use," he says. "Our traces are logged in the exact same mechanism as our app metric and logging data, and they get pushed the exact same way. So we treat all that data exactly the same when it’s being created and when it’s being recorded. We have one internal library that we use for logging, telemetry, analytics and tracing." -->
|
||
事实上,Broad 认为跟踪天然符合 Workiva 现有的日志记录和指标系统。“这是我们在内部展示的方式,也是我们设计使用方式的方式,”他说。“我们的跟踪记录在与我们的应用指标和日志记录数据完全相同的机制中,并且它们被推送的方式完全相同。因此,在创建和记录所有数据时,我们对待这些数据完全一样。我们有一个内部库,用于日志记录、遥测、分析和跟踪。”
|
||
|
||
|
||
</div>
|
||
</section>
|
||
|
||
<div class="banner5">
|
||
<div class="banner5text">
|
||
<!-- "Tracing has given us immediate, actionable insight into how to improve our [Workspaces] service. Through a combination of seeing where each call spends its time, as well as which calls are most often used, we were able to reduce our average response time by 95 percent (from 600ms to 30ms) in a single fix." <span style="font-size:14px;letter-spacing:0.12em;padding-top:20px;text-transform:uppercase"><br>— Michael Davis, Software Engineer, Workiva </span> -->
|
||
“跟踪让我们能够立即、可操作地洞察如何改进我们的(工作空间)服务。通过查看每个调用花费的时间以及最常使用哪些调用的组合,我们能够在单个修复中将平均响应时间减少 95%(从 600ms 到 30ms)。”<span style="font-size:14px;letter-spacing:0.12em;padding-top:20px;text-transform:uppercase"><br>— Michael Davis, 软件工程师, Workiva </span>
|
||
|
||
</div>
|
||
</div>
|
||
|
||
<section class="section5">
|
||
<div class="fullcol">
|
||
<!-- For Workiva, OpenTracing has become an essential tool for zeroing in on optimizations and determining what’s actually a micro-optimization by observing usage patterns. "On some projects we often assume what the customer is doing, and we optimize for these crazy scale cases that we hit 1 percent of the time," says Broad. "It’s been really helpful to be able to say, ‘OK, we’re adding 100 milliseconds on every request that does X, and we only need to add that 100 milliseconds if it’s the worst of the worst case, which only happens one out of a thousand requests or one out of a million requests."<br><br> -->
|
||
对于 Workiva,OpenTracing 已成为一种重要工具,通过观察使用模式来聚焦最佳优化和确定什么是微观优化。Broad 说:“在某些项目中,我们经常假设客户正在做什么,我们针对这些疯狂的测试案例进行优化,这些案例的 1% 是同用户实际非常相符的。”“这样是非常有益的,‘好吧,我们在每个执行 X 的请求上添加 100 毫秒,如果最坏的情况,我们也只需要添加 100 毫秒,这仅发生在千分之一的请求或一百万个请求中的一个。’”<br><br>
|
||
<!-- Unlike many other companies, Workiva also traces the client side. "For us, the user experience is important—it doesn’t matter if the RPC takes 100 milliseconds if it still takes 5 seconds to do the rendering to show it in the browser," says Broad. "So for us, those client times are important. We trace it to see what parts of loading take a long time. We’re in the middle of working on a definition of what is ‘loaded.’ Is it when you have it, or when it’s rendered, or when you can interact with it? Those are things we’re planning to use tracing for to keep an eye on and to better understand."<br><br> -->
|
||
与许多其他公司不同,Workiva 还跟踪客户端。Broad 说:“对我们来说,用户体验很重要,如果 RPC 执行后还需要 5 秒才能在浏览器中显示,则 RPC 执行需要 100 毫秒就不重要了。”“因此,对于我们而言,这些客户端响应时间非常重要。我们跟踪它,看看加载的哪些部分需要很长时间。我们正在研究什么是“加载”的定义。是当你访问到它,还是当它被渲染时,或者当你可以与它交互时?我们计划使用跟踪来关注和更好地了解这些内容。”<br><br>
|
||
<!-- That also requires adjusting for differences in external and internal clocks. "Before time correcting, it was horrible; our traces were more misleading than anything," says Broad. "So we decided that we would return a timestamp on the response headers, and then have the client reorient its time based on that—not change its internal clock but just calculate the offset on the response time to when the client got it. And if you end up in an impossible situation where a client RPC spans 210 milliseconds but the time on the response time is outside of that window, then we have to reorient that."<br><br> -->
|
||
这还需要根据外部和内部时钟的差异进行调整。“在时间纠正之前,这是非常恐怖的;我们的跟踪比什么都更具有误导性,” Broad 说。“因此,我们决定在响应标头上返回一个时间戳,然后让客户端根据该时间调整其时间,而不是更改其内部时钟,而只需计算客户端收到时响应时间的偏移量。如果最终出现一种不可能的情况,客户端的 RPC 调用持续了 210 毫秒但响应返回时间并不在这个范围内,那就必须得重新调用请求。”<br><br>
|
||
<!-- Broad is excited about the impact OpenTracing has already had on the company, and is also looking ahead to what else the technology can enable. One possibility is using tracing to update documentation in real time. "Keeping documentation up to date with reality is a big challenge," he says. "Say, we just ran a trace simulation or we just ran a smoke test on this new deploy, and the architecture doesn’t match the documentation. We can find whose responsibility it is and let them know and have them update it. That’s one of the places I’d like to get in the future with tracing." -->
|
||
Broad 对 OpenTracing 对公司产生的影响感到兴奋,并且也在展望该技术还能实现什么。一种可能性是使用跟踪实时更新文档。他表示:“使文档与现实保持同步是一项重大挑战。”“例如,我们刚刚运行了跟踪模拟,或者我们刚刚在此新部署上运行了烟雾测试,但是结果显示架构与文档不匹配。我们可以找到是谁的责任,并通知他们进行更新。这是我希望在未来通过追踪获得的功能之一。”
|
||
|
||
</div>
|
||
|
||
</section>
|