Data Oriented Computing — Historical Data vs Real-time Data

Jingdong Sun
7 min readJun 14, 2022

--

This is the third installment in my Data Oriented Computing series.

In April this year, Google announced BigLake, which has repercussions for data oriented computing. It is now important to understand and discuss the positions of historical (batch) data and real-time data in the future of data oriented computing.

What are historical data and real-time data?

I admit that this is an oversimplification, but for the purpose of this article, any data that is not real-time is referred as historical data, no matter structured or unstructured, time-series or not.

So what is real-time data? According to Wikipedia,

Real-time data (RTD) is information that is delivered immediately after collection. There is no delay in the timeliness of the information provided.

Quite honestly, this definition is lacking. Even though immediate delivery is a clear tenet of real-time data, answers to questions like “What is used for collection?” or “How long does collection take?” are still unknown.

Another definition from Splunk makes it clearer: immediately available and not been stored at all, no batch (like Spark), not even stored with time-series database as many “real-time” solution architecture using in current market.

Real-time data is data that is available as soon as it’s created and acquired. Rather than being stored, data is forwarded to users as soon as it’s collected and is immediately available — without any lag — which is crucial for supporting live, in-the-moment decision making.

Based on these two definitions, Google BigLake is technically for historical but not real-time data:

Google BigLake unifies data warehouses and data lakes to enable organizations to store, manage and analyze their data via a single copy of data without having to duplicate or move it or to worry about the underlying storage format or system.

Google Biglake, https://www.nextplatform.com/2022/04/06/google-biglake-stretches-bigquery-across-all-data/

As a reader, till now, you may have questions in your mind: yes, I am clear about historical data and real-time data now, and Google BigLake is for historical data. Why its announcement made you decided to write this blog?

Before diving into the implications for Google Biglake, it is time for us to discuss the main topic of this blog:

Positions of historical and real-time data in future data oriented computing

There has been discussions on real-time data analytics compared to historical data analytics, as seen here:

However, I am not going to follow these references by comparing real-time analytics and historical analytics. Instead, I believe that, for the future of data oriented computing, real-time data analytics will be the only “production” data analysis to generate business insights (BI). Historical data will be used mainly for machine learning (ML) model training.

In other words, the positions of historical data and real-time data in the future of data oriented computing are:

  1. Real-time data: for data analysis and generating business results and insights
  2. Historical data: for training ML models only

It looks like below:

Why?

The main reason is simple: this is the way humans learn and work — we learn knowledge (or take historical data as reference) at school, life, and work, and make decisions based on real-time situations or environments.

Similarly, ML models will be trained (learned) using common historical data (like human pre-college study) and domain specific data (like college major or post-graduate) for job positions. Just, machines learn and gain knowledge much quicker than humans and certainly do not need to learn for 16 years (5 years elementary, 3 years middle school, 4 years high school, and 4 years college) before attaining college-level knowledge.

MLOps cycle will keep being applicable as ML models make real-time “decisions” and real-time data becomes historical data. This parallels humans, who keep learning in real-time even as real-time becomes history, gaining knowledge/skills that benefit future (real-time) decisions.

I believe, with future data oriented computing, a new language will emerge and the MLOps cycle will be a part of the language, including ML model development, training, enhancements, and the usage of historical data. I’ll elucidate further details of this new language in my next blog — Data Oriented Computing: A New Language.

Using historical data to train models and real-time data for analysis to reach business decisions not only matches human practices, but also mitigates many data computing challenges that current IT and markets face, such as:

  1. Data fabric vs Data mesh architecture: There is tons of discussion on these two architecture methods, but many teams still puzzle over how to use them for a business case, many times using a not-so-perfect solution which creates more challenges in the future. When data analytics always use real-time data, there will be a single type data source — real-time data, eliminating the need to handle multiple different structured/unstructured data stores or data lakes. Furthermore, data will no longer need to move from data lakes to data warehouses to data marts.
  2. Data transformation, ETL/ELT/reverseETL/etc: When using real-time data as the data source, no ETL/ELT/reverseETL is needed. Although real-time data still needs to go through the data process pipeline for analysis to reach the final business decision (as seen in the real-time data streams design), necessitating some operations like enrichment, transformation, filter, and merge, no data transformation and movement is needed from one data store to another.
  3. Data lakes, data warehouses, data stores: No longer needed for real-time analysis, thus management and usage of these is less an issue.
  4. Built-in data lineage: The real-time data flow path is the data lineage path.
  5. Active metadata by design: Dealing with real-time data and streaming data flow means all metadata are going with flow. In another word, we have smart real-time data tuples which not only have data, but also have their related metadata together. Please check my “smart tuple” related patents (go to this link, search “smart tuple”).
  6. Operational challenges: Traditional operational challenges like backup&restore, disaster recovery, fault tolerance, and high availability will be adjusted to fit into real-time streaming data process. Some challenges may be resolved, such as storage transaction and checkpoint for backup&restore.
  7. System of records and system of engagement challenges for some domain specific IT solutions.

The list goes on … …

Are technologies mature enough to support this?

Yes, I believe so. With the existing fast advanced computing technologies, deep learning technologies, and quantum computing, ML models can learn and go through MLOps cycles much quicker without affecting real-time usage.

With machines now able to gain all historical knowledge quickly, ML models can make predictions and assumptions using real-time data in real time.

Fast network and infrastructure further enable this possibility, further discussed in this Forrester report: The Future Of Edge ComputingIn the long term, 5G rollout, network advancement, and smaller converged devices and infrastructure will proliferate. This report examines the major expected technology changes across these horizons.

My next blog is also going to discuss a new language needed for data oriented computing, including technologies to support this.

Conclusions

Based on the current market and technologies, I strongly believe IT research and solutions need to shift and correctly position historical data and real-time data. It is time to evolve, to make data oriented computing more powerful in supporting businesses in their real-time decisions.

With this said, Google’s recent announcement of BigLake comes as a surprise to me, as I believe real-time is the future direction, I certainly think Google is going after it also. But, on the 2nd thought, even Google joining the historical data storage fight, they may also investigate to leap ahead towards real time-data, you never know before they announce :-)

This is the third installation of my “Data Oriented Computing” series. Stay tuned for the next one on Data Oriented Computing: Architecture.

  1. Data Oriented Computing: Scope and Mindset
  2. Data Oriented Computing: Operations
  3. Data Oriented Computing: Historical data vs Real-time data
  4. Data Oriented Computing: Architecture
  5. Data Oriented Computing: Languages

--

--