Parallel and distributed systems are a pervasive component of the modern computing environment. Today, large-scale data-centers or supercomputing facilities have become ubiquitous, consisting of heterogeneous geo-distributed clusters with 100s of thousands of general-purpose multicores, energy-efficient cores, specialized accelerators such as GPUs, FPGAs, etc. Such computing infrastructure powers not only some of the most popular consumer applications--Internet services such as web search and social networks--but also a growing number of scientific, big data, and enterprise workloads. Due to the growing importance of these diverse applications, my research focuses on building software systems for this new computing infrastructure.
In this talk, I present an overview of my research group "Parallel and Distributed Systems" at TU Dresden. The mission of my group is to build adaptive software systems targeting parallel and distributed computing. For adaptiveness, we follow three core design principles: (1) _Resiliency_ against fail-stop and Byzantine faults for ensuring the safety and security of applications; (2) _Efficiency_ of applications by enabling a systematic trade-off between the application performance (latency/throughput) and resources utilization/energy consumption; and (3) _Scalability_ to seamlessly support ever growing application workload with increasing number of cores, and at the same time, embracing the heterogeneity in the underlying computing platform.
As I show in my talk, we follow these three design principles at all levels of the software stack covering operating systems, storage/file-systems, parallelizing compilers and run-time libraries, and all the way to building distributed middlewares. Our approach transparently supports existing applications -- we neither require a radical departure from the current models of programming nor complex, error-prone application-specific modifications.