ACE-12, GCP Associate Cloud Engineer – relational, transactional, storage

ACE-12, GCP Associate Cloud Engineer – relational, transactional, storage


Another day, another question and
solution on The Certified Q&A for the Google Cloud Platform Associate Cloud
Engineer. It’s very important to the process of understanding Google Cloud
and pass the certification exam that you go through the question and attempt
answering it yourself first. So pause the video, work through the question, we’ll
catch up in just a little while and I’ll show you how I do it. In this project
scenario, somebody’s approached your company asking for your guidance on
choosing the right storage system within Google Cloud. This company has a mission-critical
application that serves users globally. You need to select a
transactional, relational data storage system for this application. Which two
products should you consider: BigQuery, Cloud SQL, Cloud Spanner, Cloud Bigtable or
Cloud Datastore? Before we discuss within our team to figure out what is the
suitable storage system, let’s find out the key terms. We need to be able to
serve users globally, all around the world. The data needs to be transactional
and relational. What we will do is, we will create a table and mark off the products
that suit each of these requirements. First, which of these support relational
data? Before that, what is relational data? The relational database was defined
by somebody in IBM in 1970. Edgar Codd codified what defines an RDBMS or a
Relational Database Management System into a bunch of rules called
‘Codd’s 12 rules.’ Today’s commercial databases typically implement only a
subset of these rules. In a relational database, data is in rows, records or
tuples, and columns or attributes, and these are all
part of the table. The tables themselves are related to each other through common
columns. So, if you had a user column, maybe that person’s user ID is used to connect
that users name to probably his salary details and account details in
another table. So, they will have that common ID that connects the two tables,
and that’s what gives a relationship between this data. After Ted Codd wrote
about the relational model in 1970, a couple of other people at IBM,
Donald Chamberlin and Raymond Boyce, developed a way to interact with a
relational model. This eventually came to be called the Structured Query Language or
SQL. Ted Codd had defined operators to work
with RDBMS, most of which are supported by databases today, for example SELECT and UNION.
Commands like INSERT, UPDATE, DELETE modify the data, and SELECT queries the
data. Today’s SQL has become a standardized
way to work with relational data. This is great, because irrespective of
how the databases are implemented, we have a consistent way of working with the
database, which is using SQL. Relational data, typically accessd via
SQL, is referred to as structured data, because there is a predefined schema or a
structure with the tables, columns and rows. This is a flow chart that helps
you choose the right storage option. You need to be thorough with this flow chart. Looking at this flow chart, and just
marking off the portion that is required for relational data which is
accessed using SQL and typically structured, we can see it is this
highlighted portion on the top left. So, the three options we have are BigQuery
Cloud SQL and Cloud Spanner. Bigtable and Datastore are not relational data. Next,
let’s look at transactional data. What is a transaction in a database? So
some updates in a database, say like a financial transaction, involves multiple
steps that need to be done together. When you have two accounts, and you need to
move funds from one account to another, you should read the original account,
deduct the value from it, take that value and add it or deposit to the destination
account and update the balance there. It is very important to have all of these
steps happen or none of them. So after all happens, you can do a COMMIT and
treat that as a logical unit. Or if even one of them fails, you have to ROLLBACK
all of them. It is not allowed to have an intermediate state.
The transaction has to be complete. It is not allowed that you deduct an amount
from the source, but between deducting and and depositing
it into the destination account, if there is a failure, you cannot have the state
where money is already deducted here but not added to the other destination
account. So this is an all or nothing operation. When a sequence of data
based operations satisfies what are calling ACID properties, then it is
called a transactional database in database terminology. The letters of ACID
stands for Atomicity, Consistency Isolation, and Durability.
What is Atomicity? Each transaction is a single ‘unit’. Like we did for the
fund transfer, that whole fund transfer is treated as single unit. You can’t do
only just part of the withdrawal, and not do the deposit. So either it succeeds or
fails completely. When this is treated as a logical unit, it means that while the
transactions in progress another request should not be able to see that
intermediate state. A person trying to figure out the balance of an
account should not see it after the withdrawal of an amount but before it is
deposited to the destination account. So no intermediate states are allowed; it is
a single unit transaction. C stands for Consistency, which means that a
transaction in a database has to go from one valid state to another valid state.
Which means that, when a transaction completes it should be in a state that is valid according to all the
constraints defined by that database. For example, the data type should be as
defined in the columns, the data size should be as defined in the columns. So
those constraints have to be met. If there are cascade operations, update or a
change in one row has to make a bunch of other cascade or ripple changes in other
parts of the database, those have to happen. Triggers similarly, and any
combination of constraints, cascade, triggers, all of them have to be in a
valid state after the transaction is completed also. So in Consistency, it has
to go from one valid state to another. ‘I’ stands for Isolation, which means that
when a bunch of transactions are happening concurrently on a database,
which is very much a possibility, because these databases are being
accessed simultaneously, probably all across the world, from different
applications and different users … these transactions are going to happen
concurrently. Isolation means that when these transactions happen concurrently,
they should result in the same state as if those transactions happen sequentially.
Doing a transaction or multiple transactions concurrently should not
result in what is called a ‘race condition’. A race condition where,
depending on who got there first, if the results are different then that
would not be acceptable. ‘D’ stands for durability, which
guarantees that once a transaction is completed, it will remain on the system
even in the case of, say power failures or system failures. This essentially means
that all completed transactions are recorded in persistent storage, so that
the next time the system comes back up, you are able to
pull up all the transactions that happened, and they should all be in a
valid state. We also hear about this difference between an OLTP and a OLAP,
an Online Transaction Processing System and an Online Analytics
Processing System. What are the differences between these? So, OLTP
typically has recent data, things that are happening currently,
things that are happening real-time. Whereas, an Analytics Processing engine usually
has historic data. You want to be able to go back in time, and figure out
statistically or with data over many years,
what are the patterns in the data, essentially analytics. An OLTP is going
to be optimized for Inserts, Updates and Deletes, because this is going to happen
at real time. It is going to happen often, and therefore these OLTPs have
to support transactions and changes in the data,
like Insert, Update and Delete. An OLAP on the other hand, is optimized to read. You
might have large amounts of data and you want to be able to read it quickly. An
OLTP scales well to gigabytes or terabytes of data, whereas an OLAP scales well
to petabytes of data. So typically, you’re dealing with way larger data
when it comes to an analytics engine, as opposed to a transaction engine.
An OLTP is difficult to scale horizontally. When we say scale
horizontally, there are a couple of ways in which you can increase the throughput
in a system. Either you get more powerful machines with faster CPUs and more memory,
which means you’re scaling it vertically. Whereas a horizontal scaling means, you
get may be cheaper machines, but you keep adding machines when you want to achieve
greater throughput. Transaction Processing engines typically do not scale
horizontally very well, because it means that there are replicas
of this one system, and when you lock parts of the system for an update,
you need to make sure that all the other replicated systems are also locked for that
transaction. That is difficult to achieve at global scale. In an Analytics engine on
the other hand, mostly we want to read from it, and it will be okay to replicate
the data easily without having to lock the machines, or lock the database, and
therefore it’s easy to horizontally scale. OLTPs are mostly relational,
whereas OLAPs can be relational, or it could be columnar or other storage formats.
OLTPs are dealing with current data, which means you’re typically going
to deal with real time data inserts. As soon as a transaction or a fund transfer
is initiated, you want that to happen immediately. Analytics engines on the
other hand, because they’re working with historic data, and not always do they
need the absolute current data, it is okay to delay the information entering
the analytics engine, typically as batch insert maybe at the end of the day, end of the
week, maybe after a few hours, or it could be even after minutes. But it doesn’t
have to be real-time. The OLTPs we have on Google Cloud are Cloud SQL and
Cloud Spanner, and the OLAP we have on Google Cloud is BigQuery. Looking at this chart
now, we already know that our data is structured, SQL, relational, and we have
to choose between SQL, Cloud Spanner and BigQuery. Now that we know about
transactional databases, we have learnt that BigQuery is an Analytics engine
that is optimized for read, is an OLAP, whereas what we want is transactions. So,
the two databases that suit transactions or is an OLTP, are Cloud SQL and Cloud
Spanner. So we tick those two boxes, but we leave the others empty. Just for your
additional information though, Datastore and Firestore support ACID
transactions, so also does BigQuery. But these are not considered transactional
databases. Let’s confirm what we learnt about Bigtable a little bit, and also
about OLTPs, and what is available as OLTP within Google Cloud. So the
first line says that, Cloud Bigtable is not a relational database. That’s a good thing,
we have not marked that as one. It does not support SQL queries, joins, or
multi-row transactions. In the green box I’ve highlighted, if you need full SQL
support for an Online Transaction Processing System (OLTP), consider Cloud Spanner
or Cloud SQL. Those are the two that we have come to,
as a right choice for our transaction system. Alright, the next one
we need to consider is that our users are globally spread, and we need to serve
all of them. Looking at this, the two options that I have chosen are Cloud
Spanner and Bigtable as something that scales globally. Now, BigQuery and Firestore,
which is a newer version of Datastore, support what is called a multi-regional
option. Today that is either in Europe or in US. A multi-region is a large
geographic area that contains two or more geographic places.So instead of
choosing a particular region, there are some predefined regions chosen for you,
and the data is going to be replicated in that multi-region. This is not
necessarily globally available, but it is available in say Europe or US, to have a
multi-region which covers a large area of that particular region. I have not
chosen Cloud SQL as a option as yet, because that also tends to be regional.
Let us just look at the documentation on Bigtable Global Availability. It says
that replication for Cloud Bigtable enables you to increase the availability
and durability of the data by copying it across multiple regions or multiple
zones within the same region. In another section, every cluster in a replicated
instance accepts both reads and writes providing multi-primary replication with
‘eventual consistency.’ You can set up replication by adding one or more Cloud
Bigtable clusters, whether on the same continent or halfway around the world.
So, Bigtable is good for global availability. How about Cloud Spanner? The
landing page for Cloud Spanner reads out a few of the things we already know about Cloud Spanner.
It has global scale, horizontally scalable across rows, regions and
continents, from one to hundreds or thousands of nodes. It supports relational
semantics, and has ACID transactions in relational database-schemas, which
supports SQL queries. It also has transactional consistency, which is
purpose-built for external, strong global transactional consistency. So based
on all of this, our choices on Cloud Spanner and Bigtable are great. As I said
before, Cloud SQL is not a very firm option for me when it comes to serving
users globally because Cloud SQL is more of a regional database. You can have
replication across regions, but then it is a little more harder to achieve.
But as for serving users globally, I am not very sure about that. This leaves us
in a little bit of a quandary though. Because based on this table now, we can
see that Cloud SQL and Cloud Spanner definitely are ticked for relational
database and as a transaction database, whereas on the global scale, Cloud SQL
does not suit. However, given that we have to choose two products, and none of the
others qualify in any way, or they don’t qualify for most of the requirements, we
are going to choose Cloud SQL and Cloud Spanner. Now, it is not that Cloud
SQL cannot be used globally. You can have users from across the world
accessing Cloud SQL in one region, but then you’re going to have more latency
for international users. But, given that we have to choose two within these
options, these two, Cloud SQL and Cloud Spanner, are the best that I will
have. What all are the key things to study? You have to know the
storage option flow chart thoroughly, by heart that is. Let’s also, while we are in this subject,
see what are the differences between SQL, NoSQL, at what is
called NewSQL. SQL databases are typically transactional databases. This is
not necessarily true of every database, but the ones that are very popular today,
MySQL, MSSQL, Oracle, Postgres, all of them are SQL, and they happen to be
transactional databases. Transactional databases follow ACID principles, but
adhering to that means that it has reduced horizontal scalability. NoSQL
databases came into existence so that they can have horizontal scalability, and
serve users across the world, but they have initially had to forego some of the ACID
properties. An example in Google Cloud is Bigtable.
NoSQL, as the name suggests, does not use SQL to access the data. Instead, NoSQL
is a broad category of databases with different storage formats. It could
be a document database, a JSON database, a Key-Value pair. The data
inside that is non-relational, and you will have to look at each database and its
documentation to figure out what the storage format is. Examples of a
NoSQL database within Google Cloud are Datastore, Filestore, Memorystore,
Bigtable. NoSQL data is typically semi-structured. So there is going to be some
structure to the data, like for example JSON has a structure, a key-value pair
has a structure, and yet it is usually extensible. It is possible to have some
key types defined, in say Datastore and Firestore, but have
different attributes for it inside that JSON or document. This is in comparison
to what is unstructured data. Unstructured data would be like files,
PDFs, Word documents, images, videos, music, those come under the far right of
what would be unstructured data. In recent years, there’s been a new type of
database called a NewSQL database. A NewSQL database tries to combine
both transaction processing and horizontal scaling. And a NewSQL
database available within Google Cloud is Cloud Spanner. What are some of the
other key learnings that we should take away from this question? You
should know the global availability of of the storage options. Are they regional, or
are they global? Within each storage option, it’s
important to know the ceiling for the size of stored and processed data. Upto
what amount of data does it perform well, and after what point is there a
degradation in performance? Certain storage options like Bigtable will not
have a degradation in performance provided you
keep on adding nodes. So, those scale linearly. Whereas certain others like
MySQL, will hit limits in and around terabytes of data. You should know the
difference in these technologies. What is a relational vs. a non-relational
database? What are the differences between SQL, NoSQL
and NewSQL? How is an OLTP different from an OLAP? What is
structured, semi-structured and unstructured data, and which of the
storage options support which of these? Know what is a regional, a dual-regional or
multi-regional availability for different storage options, including for unstructured data like Cloud Storage and Filestore.
Also know what our ACID transactions, and what each of them stand for. With
that, I will leave you with a few references on Cloud Spanner, Bigtable, replication
of some of these technologies including Bigtable and Cloud SQL, and some of
the Wikipedia articles on database transactions and relational databases.
Now, it’s time to subscribe to all the great content we’ve got lined up for you,
to learn Google Cloud, and to help you the certifications.

Leave a Reply

Your email address will not be published. Required fields are marked *