Create a secure serverless streaming pipeline with Amazon MSK Serverless and Amazon EMR

The world is filled with an abundance of streaming data, and organizations everywhere are tapping into its power to drive their businesses forward. Real-time analytics are key to unlocking valuable insights from this data, which comes from various sources like social media, IoT sensors, and user interactions. By utilizing streaming data, companies can make informed decisions, stay ahead of trends, and outpace the competition.

Many streaming applications rely on Apache Kafka for data ingestion and Apache Spark Structured Streaming for processing. However, connecting and securing these components can be quite complex, requiring specialized skills and knowledge. That’s where a managed, serverless framework can make a big difference, simplifying the setup process and making it easier to integrate these critical elements seamlessly.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a powerful tool that simplifies data ingestion and processing, taking the hassle out of managing clusters and scaling operations. With Amazon MSK Serverless, you get the added benefit of AWS Identity and Access Management (IAM) integration for enhanced security. This means you no longer have to deal with the complexities of certificate and key management—everything is streamlined and protected using IAM authentication through AWS Certificate Manager. When a client interacts with the cluster, MSK Serverless verifies the client’s identity and permissions using IAM, ensuring secure data transmission.

In order to process data efficiently, Amazon EMR Serverless, coupled with a Spark application built on the Spark Structured Streaming framework, is a great solution for near real-time data processing. This setup seamlessly handles large data volumes from MSK Serverless, employing IAM authentication for swift and secure data processing.

This post outlines a comprehensive end-to-end solution for processing data from MSK Serverless, using an EMR Serverless Spark Streaming job, all secured with IAM authentication. Moreover, it illustrates how to query the processed data using Amazon Athena, offering a seamless workflow for data processing and analysis. This setup allows for near real-time querying of the most up-to-date data from MSK Serverless and EMR Serverless through Athena, enabling instant insights and analytics.

To get started, you’ll need an AWS account with billing enabled, along with an IAM user with appropriate permissions to create and manage resources like VPCs, subnets, security groups, IAM roles, EC2 instances, MSK Serverless, EMR Serverless, and S3 buckets. Although using an IAM user with administrator access will work for this tutorial, it’s always a good practice to follow the principle of least privilege by creating custom IAM policies tailored to specific needs.

In this tutorial, we’ll create the necessary resources in the us-east-2 Region using AWS CloudFormation templates. The detailed steps on configuring your resources and implementing the solution are provided in the following sections.

The process begins with setting up an MSK Serverless cluster with IAM authentication, followed by running a Python script producer.py on an Amazon EC2 instance to produce sample data for a Kafka topic within the cluster. A Spark Streaming job consumes this data from the Kafka topic, storing it in Amazon S3 and creating a corresponding table in the AWS Glue Data Catalog. The job continuously processes incoming data, staying current with the latest streaming data and enabling seamless processing with checkpointing in case of failures.

For data analysis, users can leverage Athena, a serverless query service that allows for interactive SQL-based exploration of data directly in Amazon S3 without the need for complex infrastructure management. This powerful architecture and workflow open up a world of possibilities for efficient and secure data processing and analysis.