Data, a First Class Citizen
Sep 1, 2022When computers were born data was on the “peripherals”! We programmed arithmetic and logic into computers, and we had inputs and outputs for data. Applications and services implemented logic in maintainable and programmable units – functions, modules, and libraries. These applications produced data as a result of executing the logic and functions, which were stored into disks, transferred over networks, and displayed on screens. Later, we developed technologies to extract data from storage or while in motion, transferred, transformed, persisted, and analyzed. We designed data warehouses, data marts and efficient modelling techniques to efficiently store and analyze these data. Business could take informed decisions using business intelligence tools operating on these data stores.
Today, in the age of AI/ML, data is as important as logic. It is intrinsic to intelligent systems. Oftentimes, we derive logic from training machine learning algorithms on large datasets. Further, the logic is enhanced in an automated, continuous, and progressive manner leveraging new information streaming into data processing pipelines. Event streams and event processing are taking over the role of data extraction, transfer, and transformations. Event buckets have gone ahead and replaced temporal databases too.
Databases are an accumulation of database commit logs (transactions), presenting a historical or up-to-date view of entities stored. Event streams and buckets are time-bound accumulations of event logs, building and maintaining a temporal view of entities passed through an event pipeline. Data stores excel in long term storage and persistent views of business rules. Event streams excel in real-time visibility, just-in-time availability, and pull-based processing.
If event processing is taking over real-time aspects of data processing, we need to have a good look at how new architectures around data should be established. New business models demand transactions and business operations to happen in the context of recent information that remains active for a defined time-period. Thus, there should also be an optimal segregation and harmony in data processing between event-driven architectures, search engines, data lookups, aggregations, and historical data processing.
As an example, consider sales transactions in a retail business. We often apply a single view of truth to something that is a single entity in the database – sales invoice documents. But it is perceived and used in different forms by different business functions. In practice, store operations look at recent data in detail, warehouse & supply chain operations look at aggregate data at day/product/location levels, finance looks at consolidated data at store/department level, and merchandise planning looks at normalized data on season/week/category levels. The context provided by data for these business function varies in shape, grain, rules, accuracy, quality, speed, intervals.
And even these views drift over time. We know that business models, practices, requirements, and rules evolve over a period. Similarly, it should be possible to evolve the shape of data to suite the evolving and changing business landscape. Holding the historical data in a frozen form tends to add invisible costs in repeated processing and efforts to shape and fit data into today’s business models and requirements. With AI/ML, the shaping of data is a constant endeavour as we find new ways of modelling business problems, enriching data attributes, and leveraging complementary ecosystems. Centralized data management patterns tend to give a false confidence on data and impose repeated processing and transformations. The on-demand processing from monolithic data stores would give way to on-demand data access from up-to-date and distributed data stores organized by business functions - as data products, data apps and data services.
In future architectures of data, we would see boundaries blurred between application integration and data integration. Event streams, queues, processing systems and data stores would complement each other to provide business capabilities on data. Just like the way apps and services wrap the logic, we would have data apps and data services wrapping data entities. And, similar to a business application distributed over network as microservices, data would be distributed over network in the form of a data mesh or fabric. Data would be available over services, which could be orchestrated to create appropriate shapes, volumes and packages of data, fit-for-purpose. And, as artificially intelligent systems become more ubiquitous and accessible, we would see data standing alongside logic to enhance the experiences, not just an I/O operation from the peripherals!