How Customers Are Migrating Hadoop to Google Cloud Platform (Cloud Next ’19)

How Customers Are Migrating Hadoop to Google Cloud Platform (Cloud Next ’19)


[MUSIC PLAYING] CHRISTOPHER CROSBIE: I’m
really excited for the session, because this is the one where
we get to talk to you guys about all of the
different ways and reasons that customers are
migrating their Hadoop footprints into Google Cloud. We are going to save some
time at the end for questions, because we know this
topic gets a lot of those, and a lot of unique questions. So please, we are going to
run a Dory out of the app. Go ahead and pop into the app. And as you think of questions
as we go through, put those in. And we’ll try to get to as
many as we can at the end. So I am going to hand off
to Blake in a little bit to tell you about how
customers are actually moving their Hadoop footprints to GCP. But before you do
that, I’m going to discuss a little
bit of why they’re coming to Google Cloud in the
first place for their Hadoop workloads. So there’s a lot of
benefits of migrating Hadoop to Google Cloud. First of all, if it pay for use. So that’s just like
the rest of the cloud. You can start to map your
infrastructure to real usage. Now usually that’s
a really great fit for these Hadoop Spark
analytic workloads where you have a lot
of their ability. You’re also going to get
managed hardware and software. So just think about
the task of keeping up with all of the
open source security patches that are out
there every week. That can be quite
a daunting task, and just one of the many
things GCP can help with. You can also, as you move
Hadoop into the cloud, you get to keep the majority
of your existing code. So usually there’s
only minimal changes that are needed to run your on
prem Hadoop and Spark software on Google Cloud. You also get a lot
of flexibility. So there’s a lot of knobs
that you can turn in GCP. And so most of the
things that you’re going to do to customize
your on prem cluster, you’ll be able to do that
in Google Cloud as well. And finally, and this
is the really big one, is that Google Cloud
can really help you unlock a lot
of your HDFS data that might be siloed in various
clusters and hard to tap into. And I’ll do a whole
section on that in a bit. So Google Cloud,
we’ve been trying to solve data problems
for a long time. It started in 2004. We came out with one of the
original MapReduce papers that got turned– a
lot of those elements got put into the
original Apache Hadoop. About a year later we released
a paper about Bigtable. And that had a lot
[INAUDIBLE] bases. And a lot of that
got put into HBase. More recently, we’ve even open
sourced things with machine learning code in TensorFlow. Now about 10 years
back, we started to actually take a lot of that
infrastructure that was built– research papers,
things that were battle tested on Gmail and maps– and we started to put
those out as products that any company–
regardless of if you’re a startup in a garage
or a large enterprise– could go, use, and start to use
the same technology that Google was using. And so we took all
of this knowledge from research and open source
and running our own cloud platform going all
the way back to 2004 with that first MapReduce
paper, we took that and we put it into
Cloud Dataproc. And if you’re not familiar
with what that is, that is Google Cloud Platform
fully managed Apache Spark and Apache Hadoop service. I’m saying Hadoop and
Spark, but really, this is an engine for running
lots of open source software within that ecosystem. And we’re going
to give you things like customizable machine types. So let’s say you have a batch
of machine learning jobs. Those could live
on a cluster that’s got very compute
intensive specifications. At the same time, you could
have a lot of BI applications and dashboards and
interactive SQL that’s running on a cluster
that’s a very memory heavy. The two never have to
contend for resources, but both could still be reading
and writing from the same data set in cloud storage. And we also offer a lot of
tools and features specifically designed to help you with
those kinds of architectures, as well as ephemeral Hadoop and
Spark architectures in general. We’re also going
to try to give you tight integration back to the
rest of Google Cloud Platform. The idea with Cloud Dataproc is
that we’re not just giving you yet another cluster to manage. We’re actually going to let you
run your open source software, but in a way that will
actually modernize your stack at the same time. We also give you a
lot of flexibility when running Dataproc. We exposed a lot of those
knobs that you can turn. Now we try to put some same
default on those knobs, so if you don’t want
to deal with them, you can get started very easily. But if you do want to
start tuning the knobs, you have the ability to
get in there and do that. But that also means
that customers are using Cloud Dataproc
in a few different ways. So a couple of the common models
that we see customers using Cloud Dataproc
with is first off, what we call the job
scoped cluster model. And this has been
really effective for Cloud Dataproc customers
with batch processing and ETL workloads. Essentially in this mode, you
can have a single command that sends a graph of jobs
to Cloud Dataproc, we will spin up a right sized
cluster, run those jobs, then take down the cluster,
while keeping everything logged into Stackdriver. And this is really
easy on Cloud Dataproc, really effective,
because we can usually spin up a fully loaded Hadoop
cluster in around 90 seconds. So when we only have about
a minute and a half overhead to get yourself a fully
loaded Hadoop cluster, you really can start to
think about jobs and clusters as a single entity. And Cloud Dataproc has a lot
of features like the Jobs API, workflow templates, and
scheduled cluster deletion that can help with this
model of job scope clusters. But we also have customers
that have scenarios for long running Dataproc clusters. And some of those situations
include a shared interactive or ad hoc analysis with
web based notebooks like Jupiter or Zeppelin, or
BI dashboarding applications based on things like Druid. And we have a lot of features
here that help as well. Things like our auto scaler
or high availability mode. And if you want to
learn more, I actually did a top 10 tips for
longstanding Dataproc clusters a couple months back. There’s a webinar
associated with it. So there’s lots of
materials out there if you want to learn more about
how to use longstanding Cloud Dataproc clusters. But regardless of the
model that you choose, our goal is to change
open source software to be cloud native. And so this means
making features that are going to be fast,
easy, and cost effective. Last year alone, Cloud Dataproc
launched more than 30 features. So this is still an area
that Google Cloud is really investing in. And I’m calling out
cost effective here. And we actually had
some external validation that we’re building from
cost effective features. We worked with an analyst firm
ESG, who came back and told us that compared to an on prem
self-managed Hadoop cluster, Cloud Dataproc is about
60% less expensive. And even compared to some of
our competitors like Amazon EMR in the cloud space,
we still came in about 32% less expensive in
terms of total cost of ownership for these
Cloud Dataproc clusters. And so cost is an
obvious optimization. But most customers that
are coming to Blake and I, it’s not cost that’s
the main driver. They are really coming
to us because of a desire to make better data
driven decisions. And this is usually the
appeal of the cloud migrations much more so than
just cost alone. And the key to
making that happen is really unlocking all of
the data that is trapped in your HDFS clusters today. And so I’m actually going to
do a full section on converting HDFS to Google Cloud
Storage, because this is really imperative
to achieving cloud freedom for a cluster. This is what really severs
that dependency between storage and compute. Now luckily, this is in code,
in the Hadoop code itself. It’s often one of the
easiest changes to make. Usually, it’s just as
simple as a prefix change. But the bottom line is
that you can’t really scale HDFS because that storage
is still tied to compute. So you’re taking a very
pricey resource of compute, and tying yourself to a
relatively inexpensive resource of storage. Not only that, but you’re also
siloing all of your HDFS data to a single cluster
instead of exposing it to all of the
different possibilities that a cloud provider
like GCP could offer. Now there are some reasons to
still use HDFS in the cloud, especially on Dataproc. You might have some
local storage devices. So you can do things like
scratch space and shuffle data, or like, an LLAP cache. But the idea is that’s all
pretty much temporary data. You want everything that
you plan to maintain to be written to cloud storage. And the reason why you can do
this file system substitution, it’s because Hadoop has this
abstract notion of a file system. And HDFS, that’s just
one implementation. There’s a Java abstract
class fs.file system. And that can call into
various replacements for HDFS. All MapReduce or
Spark really care about is getting blocks of data. And so this idea, this concept,
it provided a lot of advantages when HDFS was first developed,
and it still does now as well, as we kind of move to the cloud. And I don’t have
time to deep dive on everything in this slide. I did release an article
a couple months back if you care about comparing
cloud storage directly to HDFS that does a deep
dive on a lot of this and does a full comparison. But the too long
didn’t read version of that is that you
can pretty much replace most of your Hadoop jobs that
are in HDFS with cloud storage. And that does not mean
that you pull the data from cloud storage into HDFS. It means that your Hadoop and
Spark jobs can read directly from cloud storage just as they
can HDFS over our Google Cloud Storage connector,
which is an open source connector for doing just that. There’s also a couple
differentiators that you should be aware
of with Cloud Storage versus other cloud providers. So in particular,
our Cloud Storage, we’re going to offer a
strong global consistency for a lot of those operations
that you care a lot about with Hadoop and Spark. So you don’t have
to be afraid to use Cloud Storage as a destination. If you’ve worked with
objects stores on other cloud providers, you may
have had to introduce some clugy workarounds with
like, a NoSQL database that sits in the way to make up for
their eventual consistency, but you don’t have to
do that on Google Cloud. I was working with one customer. And they said they were trying
to move to another cloud provider for over a year. And they couldn’t get it done. There’s too much conflicts
and all the different writes they were doing. They tried to do a POC. Before giving up, they
tried to do a POC with us. And in three weeks,
they were able to get done what they couldn’t
get done in a year trying to get on another
cloud provider because of this strong consistency. In addition, our renames
are metadata operations. So what that means is a lot
of other cloud providers– if you were to do a Hadoop
move, all of the data associated with that Hadoop move, that’s
going to actually get copied to a new part of object store. And then you have to wait
for that to happen, and then wait for all of the
old data to also get deleted before you get a
return from your Hadoop move. For us, all of that happens
as a metadata operation. You’re not actually going to
have to move the data around. Another differentiator
that we have is that across
Google Cloud Storage, we’re going to give
you a different storage classes for cost optimization. But you’re actually going
to get a consistent API across our storage portfolio. So you don’t actually have to
change your Hadoop or Spark code as you move across
these storage classes. You can get direct
access to the archive. You don’t have to go
to a different service, wait hours for a restore,
and then rerun your code. You can just go. Even between our regional
and nearline storage classes, you can even get the same less
than a millisecond latency on first byte reads. So in the context of
Hadoop, just the way to think about our storage
classes, regional storage, this is what you’re going to use
for your interactive high Spark analysis or batch
jobs that might occur more than once a month. Next, you’re going to
have nearline storage. And this is for batch
jobs that are really used for reporting or aggregations. The rule of thumb
here is data that you expect to use about no
more than once a month you’ll put into nearline. And then finally,
code line storage– that’s for your post process
data, things you really don’t expect to
use again, but you want to keep around for
either compliance reasons or just to have it just in case. That can be moved to
cold line storage. Rule of thumb on
that one is data you don’t expect to touch
again more than once a year. And so once you’ve
made the shift, and you’re able
to move your HDFS data into Cloud
Storage, that opens up a lot more possibilities of
things you can do with it. So if you have multiple
distributions of both Horton Works and Cloudera in
your environment today, both of those can talk over
the GCS connector directly to Cloud Storage,
while at the same time, it opens you up to use a lot
more Google Cloud tooling. So you can start to incorporate
maybe Dataproc for some of your batch
processing, or maybe you can start to move some ad
hoc stuff into BigQuery and access Cloud Storage
via external tables, or even start to introduce
new streaming type applications with Dataflow on
top of all the same data sets. And you can be sure
that everything is strongly consistent. So if you want to
make the move, you can move an entire
petabyte of data from HDFS to Cloud Storage
in about three steps. What you would do is you would
come into the Google Cloud Console and request
a Transfer Appliance. We would then ship
you a storage device. It would look something like
this box on the screen here. You would put it into an
NFS share capture mode, mount those NFS shares
to each of your data nodes in your Hadoop cluster. And then it’s just as simple
as running the state or distcp Hadoop command. And you’ll copy all that
data over in parallel. At that point, you just
ship us the device back, and we’ll go ahead and load
that into Cloud Storage for you. Now if you don’t have
a petabyte of data that you want to
move right away, but you just want to start
with some proof of concept, or maybe a data set size
that can move comfortably across a VPN connection,
you can simply install the Google
Cloud Storage connector on an existing on
prem Hadoop cluster, and then copy the data
over with a distcp. I think, as far as I know, we’re
the only cloud provider that supports not only copying
of the data over distcp, but copying of the metadata
attribute from HDFS as well. So if you don’t know what
I’m talking about when I say metadata attributes,
what this means is these are a file system
feature that are going to allow a
user applications to associate additional metadata
for a file or a directory. Oftentimes, you’ll see
applications do this for things like maybe what
encoding was used. But when I ran across this, I
was working with a large bank and they wanted to do
a full HDFS backup, keep everything encrypted as
is, just get it copied over to Cloud Storage. And what they were doing is
they had an on prem HSM that was storing the key locations
in the metadata attributes. And so when they’re
comparing cloud providers, they were looking
and saying, well if I go to some
other folks, I would have to figure out a new
encryption scheme or decrypt everything before I moved over. But with us, it
was just as simple as running that
distcp-px command. The px stands for
preserve attributes. And we we’re able to
take all of both the HDFS data and the
metadata attributes, and they could use
it as their backup. But once that data is available
in Google Cloud Storage, now that’s going to let
you take a really hard look at your existing
Hadoop cluster and see what workloads might make
sense to move to Google Cloud. A common workload target
that I’ll start with is they taken on prem
Hadoop cluster that might pretty monolithic and sharing
both ad hoc analysis and batch reporting. So they have business
user that are in there. And they’re struggling to
get their SQL queries to run, because over the last
few years, people have scheduled a bunch of jobs,
had a bunch of stuff in batch. They’re just now
piling up and they’re contending for resources. So it’s a easy
way to get started in the cloud is they just start
to offload a lot of that batch stuff to Cloud Dataproc,
which frees up there on prem resources for
the business analyst. But Cloud Dataproc it might
not always be the best choice, depending on your goals. For instance, in a lot
of ad hoc Hive SQL, that migrates really well to a
BigQuery, our serverless data warehouse. Also, there might be
streaming applications that fit better with Dataflow. But regardless of your
tool or application, having that data
in Cloud Storage, that’s what’s going to
give you the freedom to modernize your
environment and actually start to pick the
right tool for the job instead of just being
limited to the tools and the compute resources that
are on that single HDFS storage cluster. Now we aren’t doing this alone. We have a lot of partners
in this space that can help. We are partnered with all the
traditional Hadoop vendors to help you with
data processing. We have partners that can
help with data ingestion and movement, and
service partners that can actually be
your hands-on keyboard to actually get you migrated. In addition, we have some new
partnerships for some niche solution providers. They can bring a unified kind of
experience to Hadoop and Spark. And you’re going to hear
more about that in our data analytics keynote tomorrow. But like I said, the
big three on prem Hadoop platforms in this space, they
all support the GCS connector today. So you can move your data
between HDFS and Cloud Storage, or you can even process
Cloud Storage data directly from these platforms. We’re also partnered with
folks like Informatica among others who can support
very complex enterprise data workflows. And they really help because
they integrate deeply with GCP services,
but also you’re on prem scheduling
systems today. And finally, again, we
have migration partners that actually can
get this stuff moved. We have a lot longer
list than just these, but these are the
partners that we know, that at least I know have
a lot of proof points for successfully migrating
customers from on prem Hadoop to GCP relatively quickly. So with this, I’m
going to hand this off to Blake, who’s going
to talk and give us some firsthand experience
on how he’s actually migrated some really massive
Hadoop footprints to the Cloud. BLAKE DUBOIS: Thanks, Chris. Hi, everyone. So first, we’re going to talk
about some reference cases. We tend to see a
few uber-patterns that people follow in their
move to Google Cloud Platform in terms of Hadoop and Spark. And we’re going to tell that
through a couple of customer stories. So first off, with
the uber-patterns, we tend to see three
primary use cases. So the first is lift and shift. These are customers
that are looking to lift and move to
GCP because they’re having pain around scaling,
elasticity, globalization, et cetera. And they sometimes want
to move very quickly, and sometimes they want
to move without risk. That can drive them to kind
of the lift and shift pattern. Sometimes customers will
decouple, storage, and compute, and sometimes they will not. And we’ll talk a little
bit about that later. A customer example
we’re going to go over today is Twitter, which started
their Google Cloud Platform journey about a year ago, and
has quite a few talks here at Next ’19. And the kind of
partner technologies we kind of run into
in this case are things like on premises Hadoop
distributions, Cloudera, Hortonworks, MapR,
as well as customers that use Apache Big Top
and customize themselves. The key GCP services
that are used are GCE Google and Compute
Engine for raw compute, as well as Google Cloud Storage
that we kind of went over today. Next is lift and
re-architect or modernize. These are customers– they want
to kind of alleviate more than just– they want to realize more than
just the decoupling of storage and compute or the
elasticity of the cloud. They want to reduce
operational overhead. They want auto scaling. They want to just
pay for the resources that they need, use things
like auto scaling and et cetera by using kind of dynamic
and ephemeral clusters. Kind of like Chris was telling
earlier with a pattern. We’re going to go
over a great customer example, 3PM Solutions. And the key services that are
used here, on top of the two that we already
discussed, Dataproc for Managed Hadoop
and Spark Environment, as well as Composer, which is
our managed Apache Airflow. Next is transform
to cloud native. So customers that are
looking to kind of really transform how they’re doing
batch stream processing, and adopt kind of like,
Cloud native products. You know, these are customers
that are usually operating and at the high end
of the data scale. They’re looking for advanced
things around Object Lifecycle Management. They’re looking to kind
of streamline their batch and streaming pipelines by using
a unified analytical engine, which we’ll go into
a little bit earlier. And on top of just
using the technology that we kind of talked
to today, they’re also doing things like
ML Engine or BigQuery. They’re doing ML or BigQuery. And I guess one last point here
is that this is not necessarily a sequential journey. Some customers targets
lift and shift. And then not only
do they migrate to lift and re-architect
or transform cloud native, but because some customers kind
of make that jump immediately to cloud native. So your journey is dependent
upon the time, the risk profile, and a lot
of other factors. And Google Cloud can help
you make the right choice. So the first pattern
we’re going to go into is kind of lift and shift. So we see here is that
customers typically have a monolithic Hadoop cluster
that’s running on premises. We’re co-locating
storage and compute. And across and
interconnect or VPN, we are moving data using distcp,
or in some cases, the transfer appliance that Chris talked
about earlier into Google Cloud Storage. That Google Cloud Storage can
be used in one of two ways. For clusters that
are being spun up with the goal of being
decoupled from their storage, the compute is created in
GCE with the distribution of your choice. Edge nodes are
added with the tools of your choice, things like
Jupiter Lab, Zeppelin, et cetera. And they directly use the GCS
connector and the GS prefix to access the data
in cloud storage. Four clusters are
also hosting data in on the GCE VM’s
local disk for HDFS. Customers are looking to use
tools such as Impala, which has limited connectivity
to Google Cloud Storage at this time
will then saturate the data in each shift that’s
from Google Cloud Storage. So a customer that is
kind of doing this very, very successfully– we
have a lot of customers actually, doing
this successfully as a starting point and
moving into the cloud, but one of the
largest is Twitter. So you know, Twitter
has currently on premises in their
various Hadoop clusters, around half a million
cores of Hadoop. Some of these clusters are
as large as 12,500 nodes. There’s under 300 petabytes
of data under management. And these systems process over
1 trillion messages per day. The breakdown of the workloads
is kind of seen on top part in the image clashing colors. But we see that processing and
production and batch processing using scouting is
about 30% real time, production processing is about
60% using quite a few tools, and ad hoc processing in
tools like Presto in Hive is about 10%. Twitter is actually quite
vocal with their journey to the cloud. And they’re
continuing that trend by presenting multiple
talks at Google Next 2019. So we see here is
a couple of talks. They kind of
primarily break down into how is Twitter migrating
that much data into GDP. So if you’re interested
in understanding how large their interconnects
are, how they use distcp, how they verify the integrity
of the data, et cetera, the first talk is a
great one for you. If you’re looking to understand
how Twitter understands and implements Object lifecycle
management policies in Google Cloud and basically extends
their on premises custom tooling that they
do open source, last talk on what
the architecture is for data storage and GCS and
how user identity management comes into play– great sessions. So the second talk is– I mean, sorry,
the second pattern is lift and re-architecture
or modernize. And so now we’re getting
into some of the things that Chris illustrated
the Dataproc is very capable of doing. What we see here is that
customers, once again, are moving data into
Google Cloud Storage, either via transfer
of clients or GCP. But then it’s really
unlocking the ability for us to bring very, very
defined resources to bear on that data, and that GCS
provides near unlimited ingress and egress bandwidth
to basically support all these analytical
engines using the data at the same time. And so here on the left, we see
Cloud Composer or Manage Apache Airflow spinning up
job scope clusters. This is traditionally done
for batch or ETL processes. You can do things
understand how much data you have to ETL, and
dynamically assign a cluster with x
amount of nodes, spin that up, perform
the ETL batch process, and then tear it down. You’re only paying for
exactly what you use. And at the end, you’re
kind of persisting that analytical result back
to Google Cloud Storage. On the right hand side, we
see semi-long width clusters. So this aids in
quite a few things. What you see here
illustrated is kind of a red, black, or blue
green kind of deployment where we have version 1.1
of our Hadoop image running. But we’re also running a
[? Canary ?] for a newer version of our platform. While it used to be very
hard to upgrade and test a new cluster on premises,
like a huge monolithic cluster, like, being elastic
in the crowd, the ability to spin up
new clusters and basically load balance between them
makes that a lot easier. And then what we
also see here is that we can build clusters
for that our purpose built. We can have clusters that are
built to run HBase, Druid, Presto, Spark Streaming, et
cetera, any use case that requires long living clusters. On top of that, Dataproc
has native integration into Cloud Logging and
Google Stackdriver. And so you get metrics that the
cluster can use to auto scale. You get metrics that
help you understand the health of your jobs
and what’s going on through great operations tools. And then lastly, it becomes
very easy to take things like Jupiter Lab hosted in Cloud
Datalab or Jupiter Lab hosted in Google Cloud Machine
Learning environment, and point it at your
Dataproc clusters, and use them as an
analytical back end. And so we have 3PM Solutions. So 3PM is basically
a solution provider that basically pulls in
customer reviews and customer listings from Amazon, eBay, and
Walmart.com on a daily basis and helps sellers as well as
these retail organizations like, when not understand
that the signal from that data and whether
or not some product listings are fraudulent. And they originally had
this solution on premises– and it was just unable to
scale due to peaks in demand during events like a very
busy fourth quarter retail. So what they were trying
to do is they basically moved in to the cloud. They want to spin up clusters
for exactly what they needed. They wanted to run with
hundreds and hundred of nodes to basically complete this
in a very tight SLA of about four hours. So what they were able
to do is analyze over 160 million customer
views, ETL that, and put that into
an HBase cluster, and make that available
for their end users. And what’s really
kind of amazing here is like, when they
built the solution, when it was operating at a much
lower scale, that solution just scaled with them. So the CI/CD and the
solution, et cetera, basically scaled with their business. And so they didn’t need
any additional Hadoop administrators to kind of scale
that solution as their business scale. Lastly, we have kind of the
lift and transform pattern. And it’s aptly named like,
lifting cloud native. This is kind of getting
the best and greatest out of Google Cloud Platform. And so what this is really
kind of trying to show is that there’s many
injection methods, whether it be Apache Kafka or
Pub/Sub or external providers such as DoubleClick,
YouTube, Adobe Analytics, et cetera, moving data
into Google Cloud Storage, Bigtable, et cetera, via
many analytical engines. We can then bring
any type of analytics on top of that, BigQuery,
Dataflow, Dataproc, and start doing machine learning
analysis on that data to support those use cases,
and then just integrating with any I or
visualization tool, almost any that’s
in the market today, as well as Google
Cloud Data Studio. This is kind of the
lifting cloud needed. A couple of points here
since they’re kind of new. Customers that tend to
lift in cloud native tend to do things like
adopt Google Cloud Dataflow due to its kind
of unified batch and stream engine. So with very little
modification, you can run pipelines both
in batch and streaming. While in the Hadoop world,
those were traditionally done in different
analytical engines. Customers start
to adopt BigQuery. We see patterns where
customers will continue to do ETL in Hive
or Pig in Hadoop, but then look to do
interactive query in BigQuery. And then lastly, we
start to see customers that go cloud native really
go all in machine learning. So as customers become adept
at processing data, managing its lifecycle, increasing
its discoverability through other tools
to its end users, we really see that in
a lot of cases, queries per user in a company
double six to eight months, six to nine months after like,
a large scale adoption of GCP. And so we start to
see machine learning kind of enter the picture. And the company I want to
highlight here is Pandora. So we’ve been
working with Pandora for over the last six months. Pandora had quite
a large on premises Hadoop footprint of over 2,700
nodes, 7 petabytes of data. And internal analysts
basically participate in the creation of like,
over 120,000 Hive tables that really drove like,
all different corners of their business. And so if you’re looking
to understand how Pandora is making this migration
and some of the decisions that they made along
the way, there’s a talk tomorrow afternoon
in the Palace Hotel as well that talks about
how they move there on premise analytics
and interactive query to Dataproc and BigQuery. And then there’s also a
talk on Thursday morning on composing Pandora’s
move to GCP, which really kind of illustrates the journey
for moving from on premises Apache Airflow to Google
Cloud Composer for a more managed experience. And so we’ve reached
the point where we understand the tools that
are available within GCP. We understand some of the common
patterns that people take. And so how do you start
your own migration? How do you start
your own journey? So first off, it
wasn’t quite mentioned, but I’m part of Google’s
Professional Services Organization. Google goes above
and beyond to– and are obsessed with
customer success. So we have a very unconventional
Professional Services Organization that’s not
a profit center, that looks to meet their
customer where they are, develops methods and processes
on the fly with our customers to basically target and
achieve whatever lift and shift pattern that they’ve chosen. And we do that while hoping you
get the most out of your Google Cloud products. And we often work
with partners that– kind of Chris
talked to earlier– partners that have specific
expertise and who do migrations to actually
execute this work. So Google Cloud
Professional Services has a data analytics
practice within it that has migrated
dozens of customers from on premises and other
public cloud providers to GCP through every
single one of the patterns that I discussed earlier today. And we commonly create
artifacts that we democratize via solutions on our website,
blog posts, on the Google Cloud data blog, as well as
various articles on medium. So we kind of have
over the dozens of migrations that
we’ve actually performed, we kind of come up
with an opinionated methodology for moving customers,
whether it’s lift and shift, lift and
re-architect, or lift in cloud native. It’s kind of broken down
into two major themes. The first is the plan. And that is comprised
of four discrete steps. And the next is the
deploy, which is six. So what typically happens is
over the course of a few weeks we lead discovery
workshops where we not only understand our
customer’s environment, but we also educate them on
the really the ins and outs of the GCP data stack at
a very, very low level. We work with them to determine
end state, an end state that’s not only a solution
architecture, but also organizational changes
that are required as well, and in other
ancillary things. Next, we work with customers
to prioritize their workloads and to actually deep dive
into their data sets, and then actually design
the migration path and plan. So that kind of wraps
up the planning phase. Moving into deployment,
we typically see customers immediately
start to move data to GCS. When this is typically done
over an interconnect or a VPN connection, there’s
usually a long tail effort to this type of activity. And so we kind of get that
kicked off like, pretty quick. Next, is we do a deployment
of the big data stacks. So we understand web components
are required within the Hadoop ecosystem. And we work to build
an environment that is optimized for those
components and working and end. Next comes moving the workloads. So we actually work to
move every single workload, test and validate it,
and at the same time, integrate with other GCP
services like Stackdriver, Logging, Composer for
Orchestration, BigQuery, et cetera. And at the end, Google has
Cloud CRE and SRE processes that kind of help run the system
and make sure that there’s production operation readiness. I just want to cap off
this planning or migration section with a sample
Hadoop migration timeline. This is customized for every
customer, for every use case. And this is kind of typically
the high level activities that we kind of see
with the objectives that we try to reach. So first off, like, we
have analysis kickoff. And we understand
the architecture of the entire platform
and try to implement foundational things. So we find that to implement
a big data analyst solution, must be operating on kind of
a firm bedrock foundation, and those are things like
infrastructure setup. So you know, it’s
very important for us to ensure that there is
a correct network design, that there is logging, that
their et cetera is in place. Kind of flipping the Hadoop
migration on the previous side, in another direction
we see here, we talk about the data
migration, GCS configuration, understanding things like Object
Lifecycle Management, Identity, and access management. How not only are you going to
have that first initial sync of data into GCP,
but how you’re going to enable the continuous
delivery of data that you may be
generating on premises. So all these things are
kind of taken into account and processes are
created, automation is created to handle that. We find that it’s all about
getting that first cluster up and going, kind of defining
that data platform. Because once you
define it once, you can spin it at multiple times. And we tend to see two kind of
parallel threads where there’s kind of the migration
of existing workloads know after this first
cluster has been created. And then at the same time,
other business groups within a customer will start
using this new data platform to kind of bring new
use cases online. And then lastly, there’s kind
of the go live preparedness. So bringing that production
operational readiness through SLA’s SLO’s, et cetera,
and then foregoing live. And then we kind of
rinse and repeat. So this is an iterative process. So with that, I want to
point you to a few resources that you can basically
seek out to understand how do I make this real for me? So the first
article is migrating on premises Hadoop
infrastructure to Google Cloud Platform. This really goes into
the technical details of moving data into GCS,
as well as understanding what pattern is right for you. Lift and shift, lift
and re-architect, lift in cloud
native, and what it’s like to create jobs and
workflows, et cetera. And lastly, I would
like to encourage you to follow with your
favorite RSS reader our Data Analytics Cloud blog. You know, Chris mentioned
some of the articles that he’s created. There’s a lot of great
Googlers who do the same. And you’ll hear interesting
product announcements as well. And so that’s cloud.Google.com
/blog/products/data/analytics. [MUSIC PLAYING]

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *