Services @en

Big Data in the Cloud: AWS Data Pipeline and Amazon Redshift

Amazon powerful upgraded its cloud infrastructure for big data. With the AWS Data Pipeline now a service (currently in beta) is available to automatically move and handle data across different systems. Amazon Redshift is a data warehouse in the cloud, which will be ten times faster than previously available solutions.

AWS Data Pipeline

With the AWS Data Pipeline Amazon wants to improve the access to the steady growing data on distributed systems and in different formats. For example, the service loads textfiles from Amazon EC2, processes it and saves them on Amazon S3. The main hub is represented by the AWS Management Console. Here the pipelines including the several sources, conditions, targets and commands are defined. Based on task plans it is defined when which job will be processed. The AWS Data Pipeline determines from which system based on which condition the data is loaded and processed and where it is stored afterwards.

The data processing can be conduct directly in the Amazon cloud on EC2 instances or in the own data center. Therefore the open source tool Task Runner is used which communicates with the AWS Data Pipeline. The Task Runner must run on each system that is processing data.

Amazon Redshift

Amazon’s cloud data warehouse Amazon Redshift helps to analyze huge amount of data in a short time frame. Within it’s possible to store 1.6 petabytes of data and request them using SQL queries. Basically the service is charged by pay as you use. But customers who sign a three years contract and giving full load on their virtual infrastructure pay from 1.000 USD per terabyte per year. Amazon compares with numbers from IBM. IBM charges a data warehouse from 19.000 USD to 25.000 USD per terabyte per year.
First Amazon Redshift beta users are Netflix, JPL and Flipboard who were able to improve their requests 10 till 150 times faster compared to their current systems.

Amazon Redshift can be used as a single cluster with one server and a maximum of 2 terabyte of storage or as a multi node cluster including at least two compute nodes and one lead node. The lead node is responsible for the connection management, parsing the requests, create task plans and managing the requests for each compute node. The main processing is done on the compute node. Compute nodes are provided as hs1.xlarge with 2 terabyte storage and as hs1.8xlarge with 16 terabyte storage. One cluster has the maximum amount of 32 hs1.xlarge and 100 hs1.8xlarge compute nodes. This results in a maximum storage capacity of 64 terabyte respectively 1.6 terabyte. All compute nodes are connected over a separate 10 gigabit/s backbone.


Despite from the competition Amazon expands its cloud services portfolio. As a result, you can sometimes get the impression that all the other IaaS providers mark time – considering the innovative power of Amazon Web Services. I can only stress here once again that Value added services are the future of infrastructure-as-a-service respectively Don’t compete against the Amazon Web Services just with Infrastructure.

If we take a look at the latest developments, we see a steadily increasing demand for solutions for processing large amounts of structured and unstructured data. Barack Obama’s campaign is just one use case, which shows how important the possession of quality information is in order to gain competitive advantages in the future. And even though many see Amazon Web Services „just“ as a pure infrastructure-as-a-service provider (I don’t do that), is Amazon – more than any other (IaaS) provider – in the battle for Big Data solutions far up to play – which is not just the matter because of the knowledge from operating Amazon.com.

Von Rene Buest

Rene Buest is Gartner Analyst covering Infrastructure Services & Digital Operations. Prior to that he was Director of Technology Research at Arago, Senior Analyst and Cloud Practice Lead at Crisp Research, Principal Analyst at New Age Disruption and member of the worldwide Gigaom Research Analyst Network. Rene is considered as top cloud computing analyst in Germany and one of the worldwide top analysts in this area. In addition, he is one of the world’s top cloud computing influencers and belongs to the top 100 cloud computing experts on Twitter and Google+. Since the mid-90s he is focused on the strategic use of information technology in businesses and the IT impact on our society as well as disruptive technologies.

Rene Buest is the author of numerous professional technology articles. He regularly writes for well-known IT publications like Computerwoche, CIO Magazin, LANline as well as Silicon.de and is cited in German and international media – including New York Times, Forbes Magazin, Handelsblatt, Frankfurter Allgemeine Zeitung, Wirtschaftswoche, Computerwoche, CIO, Manager Magazin and Harvard Business Manager. Furthermore Rene Buest is speaker and participant of experts rounds. He is founder of CloudUser.de and writes about cloud computing, IT infrastructure, technologies, management and strategies. He holds a diploma in computer engineering from the Hochschule Bremen (Dipl.-Informatiker (FH)) as well as a M.Sc. in IT-Management and Information Systems from the FHDW Paderborn.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert