How to Succeed with Big Data on Azure

As your business grows and the amount of data you’re amassing grows with it, you’ll likely begin considering a move to the cloud to meet your big data analytics needs. With significant offerings available from AWS, Azure, and Google, it can be hard to know which is best for you. To help you make an informed decision, we’ll look at some services specific to Azure as well as some best practices for maximizing your system performance.

Why Use Azure?

First, big data analytics in the cloud can grant the typical benefits of flexibility, scalability, increased performance, and reduced resource cost. Secondly Azure has made an effort to make big data useable by any size business, regardless of analytics expertise. Many of the services available are managed and most can be used in combination with custom modeling and third-party tools, particularly those based on Hadoop. With these services, you can easily run real-time analytics on structured or unstructured data from a variety of sources.

If you’d prefer to outsource your analytics, you can take advantage of several different big data-as-a-service options, such as Cloudera, Qubole, and Cazena, available through Azure Marketplace.

Know Your Service Options

Microsoft offers over 50 services associated with big data, many more than can be covered here. Although it is not a comprehensive guide, this breakdown of Azure’s more robust services should give you a better idea of what’s available, and provide you some insight as to which services you should look into.

Databases

Your database options in Azure include self-managed Table storage, self-managed SQL servers hosted on VMs, and a range of managed databases, including SQL, PostgreSQL, MySQL, and MariaDB.

Azure offers a fully managed database service called Cosmos DB. Cosmos is an elastically scalable, low-latency service with global distribution and multi-master replication. It is API compatible with Cassandra, MongoDB, SQL, Gremlin, Ectd, and Table, and includes support for both Apache Spark and Jupyter notebooks.

Azure also provides services and support for SQL data warehousing and data lake storage.

Analytics

For analytics, the two most significant services are Azure Analysis Services and HDInsight.

Analysis Services is an enterprise-grade analytics engine as a service that allows you to combine data from multiple sources into an easy to use BI semantic model. With this service, you can incorporate pre-built database models and embed interactive reports and dashboards into your applications without needing to code or maintain the analytics yourself.

HDInsight is also an enterprise-grade service, but is focused on open-source analytics and includes support for many popular frameworks, including Apache Hadoop, Spark, and Kafka. It integrates with a variety of Azure services, such as Data Factory and Data Lake Storage, allowing you to easily build analytics pipelines. HDInsight supports a range of tools for custom analytics and allows the use of many common languages, including Scala, Python, R, JavaScript, and .NET.

Machine Learning

Azure offers a host of AI and Machine Learning (ML) services, from speech recognition to video indexing, many of which can be directly integrated into your applications and services. For customized ML, your best option will probably be Azure’s Machine Learning Service, however. This service allows the building, training, and deployment of models according to a spectrum of skill levels from no-code, via a visual drag and drop interface, to code-first. It includes features for automated feature engineering, algorithm selection, and hyperparameter sweeping, and has built-in support for open-source tools and frameworks, including PyTorch, TensorFlow, ONNX, and scikit-learn.

Orchestration

Finally, for orchestrating your analytics system, Azure offers two main services: Data Factory and Data Catalog.

Data Factory is a serverless integration service for data silos that works on-premise and in the cloud. It allows you to construct Extract, Load, Transform (ELT) or Extract, Transform, Load (ETL) processes with or without scripts and includes over 80 natively built connectors. Data Factory integrates with Azure Monitor, allowing you to monitor and manage pipeline performance for CI/CD and can be automated by schedule, tumbling window, or event-based triggers.

Data Catalog is a fully managed service that assists in the discovery and understanding of data sources. It uses a crowdsourcing model of metadata and annotations that allows all users to contribute knowledge, making data more accessible to all through clear identification and searchable indexing.

Best Practices

Once you’re familiar with your service options you’ll likely want to focus on how to optimize your configuration, to ensure you’re getting the benefits that big data in Azure can provide. The best practices covered here are a good place to start.

Use a Data Lake

Using a data lake allows you to store data in its native format, speeding up the storage process and reducing potentially unnecessary processing. If used in combination with schema-on-read semantics, which apply schema during processing rather than storage, you can avoid bottlenecks caused by data validation and type checking. With a data lake, you can make data more easily accessible to a variety of users as you only need to give them permission to the lake and not individual services.

Limit Your Liability

You can limit your liability with sensitive data if you scrub it of identifying information before storing it in your data lake. This will help ensure that you meet regulatory standards and reduce the amount of sensitive data that can be exposed should someone gain unauthorized access to your system.

You can further limit your liability by using Azure Backup, with perhaps a few extra precautions, to ensure that your data remains available and help protect you from data loss due to database corruption, human error, or criminal activity.

Boost Performance

You can reduce overall job times by processing queries through multiple cluster nodes, run in parallel. Separating your clusters by workload can give an additional boost and using a distributed file system, like Hadoop Distributed File System (HDFS), will allow you to optimize read/write performance.

If you partition data files and tables according to periods that match processing schedules you can simplify data ingestion, job scheduling, and troubleshooting. Additionally, by partitioning tables used in Hive, U-SQL, or SQL queries, you can improve query performance.

Orchestrate Data Ingestion

Although some applications can write data for batch processing directly into blob containers for direct use with HDInsight or Azure Data Lake Analytics, most need to be passed into a data lake first. By orchestrating with a pipeline or orchestration workflow, like Azure Data Factory or Oozie, you can centralize the management of your data ingestion and simplify the process.

Process Data In-Place

Switching to ELT as opposed to ETL will grant you better overall performance and reduce the amount of maintenance you need to perform. By loading all of your data before transformation you can make sure that your data is always available, reduce the amount of time it takes to process data, and better take advantage of the scalability of cloud services.

Conclusion

The large selection of services offered in combination with the general benefits of operating in the cloud, namely scalability and reduced cost, make Azure an appealing choice for working with big data. Whether you’re looking for a way to simplify running analytics or you’ve simply outgrown your on-premise resources, the services and best practices covered here should give you a better idea of big data processes in Azure and how to make sure you’re getting the most from your specific configuration.

Why Use Azure?

Know Your Service Options

Best Practices

Conclusion

Related Posts