Proceedings of the 3rd International Conference on Cloud Computing Technologies and Science (IEEE Cloudcom '11), Athens, Greece, December 2011
This paper presents the design, implementation, and evaluation of a new system for on-demand provisioning of Hadoop clusters across multiple cloud domains. The Hadoop clusters are created “on-demand” and are composed of virtual machines from multiple cloud sites linked with bandwidth-provisioned network pipes. The prototype uses an existing federated cloud control framework called Open Resource Control Architecture (ORCA), which orchestrates the leasing and configuration of virtual infrastructure from multiple autonomous cloud sites and network providers. ORCA enables computational and network resources from multiple clouds and network substrates to be aggregated into a single virtual “slice” of resources, built to order for the needs of the application. The experiments examine various provisioning alternatives by evaluating the performance of representative Hadoop benchmarks and applications on resource topologies with varying bandwidths. The evaluations examine conditions in which multi-cloud Hadoop deployments pose significant advantages or disadvantages during Map/Reduce/Shuffle operations. Further, the experiments compare multi-cloud Hadoop deployments with single-cloud deployments and investigate Hadoop Distributed File System (HDFS) performance under varying network configurations. The results show that networked clouds make cross-cloud Hadoop deployment feasible with high bandwidth network links between clouds. As expected, performance for some benchmarks degrades rapidly with constrained inter-cloud bandwidth. MapReduce shuffle patterns and certain Hadoop Distributed File System (HDFS) operations that span the constrained links are particularly sensitive to network performance. Hadoop’s topology-awareness feature can mitigate these penalties to a modest degree in these hybrid bandwidth scenarios. Additional observations show that contention among co-located virtual machines is a source of irregular performance for Hadoop applications on virtual cloud infrastructure.