Welcome to Global Research Forum — Join our growing community of researchers worldwide! Join Now →
Home Research Library Data Science Big Data, Causal Inference, and the Next Frontie…
Data Science

Big Data, Causal Inference, and the Next Frontier of Data Science

Prof. Rajesh Subramaniam
Prof. Rajesh Subramaniam
Indian Institute of Science, Bengaluru
26 May 2026
18 views

Data science has quietly become one of the most consequential disciplines of the 21st century — not because it invented new mathematics, but because it combined existing methods with cheap computation and abundant data to produce results that matter. This article traces the discipline's intellectual lineage, its current frontiers, and the structural challenges it must resolve to fulfil its promise in developing economies.

From Statistics to Data Science: A Genealogy

The intellectual ancestors of data science are well-known: Ronald Fisher's analysis of variance, Claude Shannon's information theory, John Tukey's exploratory data analysis. What changed in the 1990s was not the mathematics but the substrate. As digital traces of human behaviour accumulated — web clicks, transaction records, sensor readings — statisticians found themselves holding far more data than their classical methods had been designed to handle.

The response was pragmatic and eclectic. Researchers borrowed from computer science, machine learning, and database engineering. The term "data science" emerged to describe this synthesis — a field defined less by a unified theory than by a shared toolkit and a disposition toward empirical problem-solving.

The Big Data Inflection Point

The launch of the Hadoop framework in 2006, followed by Spark in 2009, democratised large-scale data processing. Organisations that previously required specialised supercomputing infrastructure could now process petabytes of data on commodity hardware. This technical shift had a profound epistemic consequence: the distinction between a "sample" and a "population" began to blur. When "everything" became cheap to store and process, new questions arose about what statistical inference even means when you have access to the whole population.

Simultaneously, the rise of deep learning — enabled by GPU-accelerated computing and large labelled datasets — demonstrated that neural architectures with billions of parameters could solve perception tasks that had resisted decades of hand-crafted approaches. Image classification, speech recognition, machine translation: all fell to deep learning within roughly five years (2012–2017).

Frontiers: Causal Inference and Federated Learning

Two methodological frontiers deserve particular attention. First, causal inference. Most data science tools are optimised for prediction: given inputs X, predict output Y. But most high-stakes decisions require causal understanding: does intervention X cause outcome Y? The distinction matters enormously. Judea Pearl's do-calculus and the potential outcomes framework provide the theoretical tools; integrating them with machine learning pipelines at scale is an active and important research frontier.

Second, federated learning. Many of the world's most valuable datasets — medical records, financial transactions, personal communications — are siloed for privacy and regulatory reasons. Federated learning trains models across distributed datasets without centralising sensitive data. This approach is already deployed in Google's Gboard keyboard prediction and is being piloted in clinical research consortia.

The Developing-Economy Gap

Global data science talent and infrastructure are heavily concentrated. A 2023 survey found that 67% of published data science research originates from five countries: the United States, China, the United Kingdom, Germany, and Canada. India, despite producing approximately 1.5 million STEM graduates annually, accounts for less than 4% of high-impact data science publications. The gap is not primarily one of talent but of structure: underfunded university research programmes, limited access to large proprietary datasets, and a brain drain that concentrates expertise in industry rather than academia.

Conclusion

Data science is at an inflection point. Its first generation of methods are now commoditised. The discipline's next contribution will come from harder problems: causal understanding, privacy-preserving computation, robustness under distribution shift, and equitable deployment across resource-heterogeneous environments. These are scientific challenges as much as engineering ones, and they demand the rigour and peer accountability that academic research is uniquely positioned to provide.

🔒

Continue Reading

You've reached the free preview. Create a free account to read this full research article and access thousands of peer-reviewed publications.

Free membership · No credit card required · Instant access

About the Author
Prof. Rajesh Subramaniam
Prof. Rajesh Subramaniam
Indian Institute of Science, Bengaluru , India

Professor of Data Science at IISc Bengaluru with 18 years of experience in big data analytics, distributed systems, and statistical modelling. Author of two textbooks on applied machine learning.

0 Comments
Share this Article
Share on X
Share on LinkedIn
Explore thousands of peer-reviewed articles across all disciplines.
Browse Research Library