Data Oriented Computing — DRE, not SRE

Jingdong Sun
7 min readMar 16, 2022

--

In our current cloud native world, almost everyone knows of Site Reliability Engineering (SRE). After Ben Treynor Sloss founded the first SRE team at Google in 2003, SRE has become more and more important to companies and IT markets to maintain desired reliability in IT solutions/applications. SRE was adopted across almost all IT companies. Although some may call it different names — for example, Facebook calls SRE Production Engineering — the basic responsibilities are same: to create scalable and highly reliable software systems and applications.

However, like I mentioned in my first blog of the series, it is time to evolve to be data oriented. In other words, it is time to move to Data Reliability Engineering (DRE) instead of just focusing on a site, a system, or a service.

Why?

Before answering this question, let us first discuss data reliability as a concept and scope.

Data Reliability

Data reliability means that data is complete and accurate

According to Wikipedia, data reliability may also refer to data integrity.

Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire lifecycle

In my opinion, the definition of data integrity is closer to what data reliability should be, which includes three key parts: (1) maintenance and assurance; (2) data accuracy and consistency (a reliable state); and (3) continuation through the entire lifecycle of data. For simplicity’s sake, I will still refer to data reliability in my discussion.

Based on our definition, data reliability is about data accuracy, completeness, and consistency through the whole process, from data sources, to data collection, to analytics, and to reporting to end users. Sounds clear and simple? Not really. To bring further clarity, I want to break down data reliability into two scopes:

  1. Data origination reliability
  2. Data computing lifecycle reliability

Data Origination Reliability

Any data, no matter static data or realtime data, it has a starting point — the moment the data is generated.

The first element of reliability is the accuracy and completeness of data at the time it is generated. If the data is realtime data, this means it is correct and true at the moment it is generated. If the data is static, this means it is correct and true at the time it is saved/inserted into a database or data store.

However, figuring out whether data is correct at the time of generation is not easy, simply because users do not have absolute/common standards for all cases. So based on whether standards can be defined or not, data origination reliability can also be divided into two cases:

  1. Data with human definable standards. For example, 03/01/2022 means March 1st, 2022 in the U.S., but January 3rd, 2022 in Europe. To avoid errors, defining a standard can make data accurate without confusion/error.
  2. Data without human definable standards — (1) not agreeing on a common standard due to culture, law, faith, philosophy, background, political reasons, etc. For example, there are tons of chat, news, comments, blogs, and posts, in social media. How many of them are correct and telling the truth following same, commonly agreed standards? or (2) not capable of having common standards, due to technology, human knowledge of natural sciences, etc. This case affects the majority of data generated in the current world.

What we need to do is to try to reduce the occurrence of case 2 if possible, and increase the occurrence of case 1 data generation by defining certain standards, even with a limited scope. This is what the Open Group and similar organizations are doing in IT.

Is it possible to totally avoid case 2? Ideally, YES, but in our current world, the answer is NO. This question will trigger huge discussion, and I welcome comments or messages on Linkedin. Since it is not within the scope of this blog, let us continue with our focus on data reliability.

Data Computing Lifecycle Reliability

What I want to focus on in the data oriented computing world with this blog is the maintenance and assurance of data accuracy, completeness, and consistency over its entire lifecycle, from the starting point that data comes into its computing lifecycle (no matter whether it is true or not as discussed in the “Data Origination Reliability” section) to the end point at which it is shown to the end user with values.

As the image below shows, no matter whether we call a deer a deer or call a deer a horse, when data is generated, it shall be kept accurate and consistent over its lifecycle and flow. Any update to data also need to be accurate and consistent.

Data Reliability vs System/service Reliability

Let’s continue with the “call a deer a horse” example to demo the difference of data reliability from system or service reliability that current SRE focuses on.

After I accidentally input horse for a deer image into the database, the starting point of the data lifecycle, we could see some of the below scenarios happening:

  1. I (as the database owner) see the error and correct it to “deer”.
  2. Someone else (who does not have authority) hacks into the database and corrects it to “deer” — a good change, but one that is not supposed to happen.
  3. After 1 or 2, data flows through its lifecycle and many micro-services, and is shown to the end user correctly (happy path)
  4. After 1 or 2, somehow the transaction is lost, so the end user still sees “horse”.
  5. Some computing logic or data analytics logic goes funny (for example, a new ML model is not well trained), so the end user sees “donkey”!
  6. Some services crash and cannot restart, so the end user cannot get anything.

From all the above scenarios, only 6 is system reliability related. Even if the system/service is 100% reliable and available, scenario 2, 4, and 5 can still happen. In these cases, the data is not reliable, due to security or logic issues.

So, system reliability cannot guarantee data reliability, but, data reliability surely means the system must be reliable:

Data Reliability = System Reliability (SRE) + System Security + Data Security + Data Lineage + Data Governance + Data Quality + more…

It is necessary to change

Now back to the initial question — why is it time to move to data reliability engineering instead of site/system/service reliability engineering?

The simple answer is that data is becoming the center of the computing world. For an end user, data is the most important attribute, and data reliability engineering has a more comprehensive coverage compared with site reliability engineering.

To break this down into even greater detail:

  1. Data reliability covers more than system/service reliability as mentioned above. In addition to system/service reliability, data reliability also includes data and system security and governance, data consistent logic, data lineage, data quality, etc. When just focused on system/service reliability but not synced with other areas (as we see in the current IT world, where these areas are generally covered by many different teams, with different approaches/focuses/designs), SRE will generate chaos, toil, and unnecessary workload or redundant work. In another words, putting all these areas under one pipeline and design — DRE instead of SRE — will optimize the pipeline and operation.
  2. Obviously, with comprehensive design and coverage, DRE can cover many corner cases or areas that are missed due to cross-team coordination issue, ultimately increasing (data) product quality.
  3. Optimized pipelines will also reduce hardware cost and human operation engagement.

Shall we get rid of SRE completely? No, but DRE is increasingly necessary in today’s world and can work in parallel with SRE or based on SRE. We need to adjust our coverage and job responsibilities from system reliability to data reliability, but the basic SRE principles are still valid for DRE:

  1. embracing risks,
  2. Data Level Objectives (DLOs) instead of Service Level Objectives (SLOs),
  3. eliminating toil,
  4. monitoring,
  5. automating, and
  6. maintaining simplicity.

DRE teams still need to have these principles, but will need to use them in a wider scope, focusing on data reliability in a comprehensive design/architecture/pipeline.

Conclusion — DRE, not SRE

The conclusion is simple: it is time to move to data reliability engineering instead of staying with site reliability engineering. It is time to put this shift into action, since this operational approach will fit the current needs of data oriented computing and generate optimized results for products and companies.

This is the 2nd installation of my “Data Oriented Computing” series. Stay tuned for my next one.

  1. Data Oriented Computing: Scope and Mindset
  2. Data Oriented Computing: Operations
  3. Data Oriented Computing: Historical data vs Real-time data
  4. Data Oriented Computing: Architecture
  5. Data Oriented Computing: Languages

--

--