Endjin - Home

AWS vs Azure vs Google Cloud Platform – Analytics & Big Data

by Jess Panni

header-aws-azure-comp-series-analytics-big-data-p1-1024px

Choosing the right cloud platform provider can be a daunting task. Take the big three, AWS, Azure, and Google Cloud Platform; each offer a huge number of products and services, but understanding how they enable your specific needs is not easy. Since most organisations plan to migrate existing applications it is important to understand how these systems will operate in the cloud. Through our work helping customers move to the cloud we have compared all three provider’s offerings in relation to three typical migration scenarios:

  • Lift and shift – the cloud service can support running legacy systems with minimal change
  • Consume PaaS services – the cloud offering is a managed service that can be consumed by existing solutions with minimal architectural change
  • Re-architect for cloud – the cloud technology is typically used in solution architectures that have been optimised for cloud

Choosing the right strategy will depend on the nature of the applications being migrated, the business landscape and internal constraints.

In this series, we’re comparing cloud services from AWS, Azure and Google Cloud Platform. A full breakdown and comparison of cloud providers and their services are available in this handy poster.

We have grouped all services into 10 categories:

  1. Compute
  2. Storage and Content Delivery
  3. Database
  4. Analytics & Big Data
  5. Internet of Things
  6. Mobile Services
  7. Networking
  8. Security & Identity
  9. Management & Monitoring
  10. Hybrid

In this post we are looking at…

Analytics & Big Data

Cloud is transforming the way organisations are thinking about their data. Advanced analytics is now driving business decision making and opening up new business opportunities.

AWS

Cloud platforms need a cost effective way to process vast amounts of data they are storing. At the centre of Amazon’s analytics offerings is AWS Elastic MapReduce (EMR), a managed Hadoop, Spark and Presto solution. EMR takes care of setting up an underlying EC2 cluster and provides integration with a number of AWS services including S3 and DynamoDB. EMR pricing is based on an hourly cost for each node plus the price of the underlying EC2 instance. To reduce costs, it is possible to take advantage of Amazon’s EC2 reserved instance pricing to run jobs against a known schedule. Clusters can be created and deleted on demand to process specific jobs or kept running for extended periods of time. Clusters typically take around 15 minutes to provision before job execution begins.

Data Pipeline is a data orchestration product, that moves, copies, transforms and enriches data. Data Pipeline manages the scheduling, orchestration and monitoring of the pipeline activities as well as any logic required to handle failure scenarios. Data Pipeline can read and write data from most AWS storage services and supports a range of data processing activities including EMR, Hive, Pig, and can execute Unix/Linux shell commands.

For high frequency real-time analytics AWS has Kinesis Streams. Data consumers can push data in realtime to a Kinesis stream where it is processed by consuming applications using the Kinesis Client Library and Connector Library. Alternatively, it is possible to connect Kinesis to an Apache Storm cluster.

Kinesis Firehose can be used for large scale data ingestion. Data is pushed to a Kinesis Firehose delivery stream where it is automatically routed to an S3, Redshift or Elasticsearch service. The service supports client side compression and server side encryption.

Predictive analytics is possible through Machine Learning. AWS makes it very easy to create predictive models without the need to learn complex algorithms. To create a model users are guided through the process of selecting data, preparing data, training and evaluating models through a simple wizard based UI. It is also possible to create models via the AWS SDK. Once trained the model can be used to create predictions via online API (request / response) or a batch API for processing multiple input records.

To making sense of data through dashboards and data visualisations AWS offers QuickSight (currently in preview). Dashboards can be built from data stored across most AWS data storage services and supports a number of third party solutions. The number of third party data connectors is currently limited but is likely to grow over time.

Azure

Azure provides a suite of analytical products called Cortana Intelligence. Included is Azure’s managed Apache platform HDInsight which comes with Hadoop, Spark, Storm or HBase. The platform has a standard and premium tier, the latter including the option of running RServer, Microsoft’s enterprise solution for building and running R models at scale. Azure has a simple pricing model that is based on the number and type of nodes running. The node type governs the number of cores, RAM and disk space available on each node. HDInsight comes with a local HDFS and can also connect to blob storage or Data Lake Store. Data stored on the local HDFS is lost when the cluster is shutdown. Clusters can be automatically created and deleted on a schedule using PowerShell and Automation, alternatively on-demand HDInsight clusters can be created for specific jobs invoked through Data Factory.

Azure Data Factory is a data orchestration service that is used to build data processing pipelines. Data factory can read data from a range of Azure and third party data sources, and through Data Management Gateway, can connect and consume on-premise data. Data Factory comes with a range of activities that can run compute tasks in HDInsight, Azure Machine Learning, stored procedures, Data Lake and custom code running on Batch.

Azure has recently added Azure Data Lake Analytics to its portfolio. Data Lake is a new serverless hyper-scale data storage and analytical platform (currently in preview). It’s been designed to run analytical jobs of any scale without the need to provision or manage compute clusters. Workloads are optimised to run against data in Data Lake Store, a high performance HDFS compliant storage solution, but can also access data in Blob Storage, SQL Database and SQL Data Warehouse. The service is charged on the number of analytical units (compute containers) used and jobs completed.

For processing realtime data Azure has Stream Analytics. Stream Analytics can process data from Blob storage or streamed through Event Hubs, and IoT Hub. A SQL-like language is used to perform times series based queries and can call into Azure Machine Learning to score data in realtime.

Azure Machine Learning is a fully managed data science platform that is used to build and deploy powerful predictive and statistical models. Azure Machine Learning comes with a flexible UI canvas and a set of predefined modules that can be used to build and run powerful data science experiments. The platform comes with a series of predefined machine learning models and includes the ability to run custom R or Python code. R code can be packaged as a custom module that can then be reused between experiments if required. Trained models can be published as web services for consumption either as a realtime request/response API or for batch execution. Azure Machine Learning also comes with interactive Jupyter notebooks for recording and documenting lab notes.

Microsoft DeployR is another option when it comes to developing and operationalising high performance R models. It provides a set of web services that can be used to integrate R models into applications and is useful when control over the R execution environment and underlying compute nodes is required.

For dashboards and visualisations Azure has Power BI. Power BI can consume data from a range of Azure and third party services, as well as being able to connect to on-premise data sources. Users can choose from a set of built-in visualisations, create their own or select from a custom visuals gallery. Power BI also allows users to run R scripts and embed R generated visuals. Recently Microsoft added Power BI Embedded, a new offering that allows Power BI driven content to be embedded within custom applications.

Azure Data Catalog is a registry of data assets within an organisation. Information is captured about each source including data structures and connection information. Technical and business users can then use Data Catalog to discover datasets and their intent.

Cognitive Services is a suite of ready made intelligence APIs that make it easy to enable and integrate advanced speech, vision, and natural language into business solutions.

Google Cloud Platform

Cloud Dataproc is Google’s fully managed Hadoop and Spark offering. Google boasts an impressive 90 second lead time to start or scale Cloud Dataproc clusters, by far the quickest of the three providers. Pricing is based on the underlying Compute Engine costs plus an additional charge per vCPU per minute. An HDFS compliant connector is available for Cloud Storage that can be used to store data that needs to survive after the cluster has been shut down. There is no built-in support for on-demand clusters, however full control over the cluster is available through the gcloud cli, REST API or SDK so this can be automated if required.

Data processing pipelines can be built using Cloud Dataflow. Google has taken a different approach to AWS and Azure, both have gone with a declarative model that delegates processing work to other services such as Hadoop. Cloud Dataflow on the other hand provides a fully programmable framework, available for Java and Python, and a distributed compute platform. The programming model and SDK was recently submitted to the Apache Foundation and have become Apache Beam, which can use both Cloud Dataflow as well as Spark for pipeline execution. Cloud Dataflow supports both batch and streaming workers. By default, the number of workers is pre-defined when the service is created, although batch workers have the option to auto-scale based on demand. The latest pricing is based on the aggregate CPU, memory and storage consumed, and varies according to whether batch or streaming workers are used.

Google offers Machine Learning as a fully managed platform for training and hosting Tensorflow models. It relies on Cloud Dataflow for data and feature processing and Cloud Storage for data storage. There is also Cloud Datalab, a lab notebook environment based on Jupyter. A set of pre-trained models are also available. Vision API detects features in images, such as text, faces or company logos, Speech API converts audio to text across a range of languages, Natural Language API can be used to extract meaning from text, and there is an API for translation. In addition, Google Cloud Prediction API sits somewhere in the middle, allowing users to easily train categorical or regression models depending on the nature of the training set. This simply requires users to upload a training dataset and specify an answer column to predict, and Prediction API will do the rest.

An obvious omission from Google’s stack is any form of business facing dashboards and visualisations. For this Google relies on a range of third party partners.

Conclusion

Cloud analytics is clearly a competitive space. It is becoming a critical component of modern business and a core capability that is driving cloud adoption. All three providers offer similar building blocks; data processing, data orchestration, streaming analytics, machine learning and visualisations.

AWS certainly has all the bases covered with a solid set of products that will meet most needs. Minor omissions include pre-trained machine learning models and managed lab notebooks but otherwise AWS scores highly across the board.

Azure offers a comprehensive and impressive suite of managed analytical products. They support open source big data solutions alongside new serverless analytical products such as Data Lake. Azure is very strong in the machine learning space, offering pre-trained models through to custom R models running over big data, and is the only provider to offer the capability for organisations to track and document their data assets.

Google provide their own twist to cloud analytics with their range of services. With Dataproc and Dataflow, Google have a strong core to their proposition. Tensorflow has been getting a lot of attention recently and there will be many who will be keen to see Machine Learning come out of preview. Google has a strong rich set of pre-trained APIs but lacks BI dashboards and visualisations.

Next up we will be looking at Internet of Things.

Cloud Comparison Poster

About the author

Jess has over 18 years’ experience helping companies succeed through the smart use of technology. He has spent most of his career working for leading Microsoft partners across the UK and Australia and is now Principal at endjin, working with clients to envision and execute disciplined innovation programmes. You can follow Jess on twitter.