GitHub - Blackmist/hdinsight-java-storm-eventhub: A basic example of reading and writing to Azure Event Hub from an Apache Storm topology written in Java. It also demonstrates how to write from Storm topologies to WASB, which is the default storage for Azure HDInsight clusters.

Note: This repo is no longer maintained, as I haven't worked with or had access to HDInsight for several years now. Archiving.

This example demonstrates how to read and write from Azure Event Hub using an Apache Storm topology (written in Java,) on an Azure HDInsight cluster.

##What does it do?

com.microsoft.example.EventHubWriter writes random data to an Azure Event Hub. The data is generated by a spout, and is a random device ID and device value. So it's simulating some hardware that emits a string ID and a numeric value.

com.microsoft.example.EventHubReader reads data from Event Hub (the data written by EventHubWriter,) and stores it to HDFS (WASB in this case, since this was written and tested with Azure HDInsight) in the /devicedata directory.

The data format in Event Hub is a JSON document with the following format: { "deviceId": "unique identifier", "deviceValue": some value }

The reason it's stored in JSON is compatibility. I recently ran into someone who wasn't formatting data sent to Event Hub as JSON (from a Java application,) and was reading it into a Java app. Worked fine. Then they wanted to replace the reading component with a C# application that expected JSON. Problem! Always store to a nice format that is future proofed in case your components change.

##What do I need in order to use this?

A working Java JDK install and tools
Apache Maven
An Azure Event Hub with two shared access policies; one that has listen permissions, and one that has write permissions. I will refer to these as "reader" and "writer", which is what I named mine.
The policy keys for the "reader" and "writer" policies
The name of your Event Hub.
The Service Bus namespace that your Event Hub was created in.
The number of partitions available with your Event Hub
The Azure Storage account that is the default storage for your HDInsight cluster
The access key for the storage account
The container name that is the default storage for your HDInsight cluster

You can find all this in the Azure Portal by clicking around in the config for your HDInsight cluster and the Event Hub config. For the storage account, once you find the name/container from HDInsight configuration, you can go to the storage account configuration to find the access key.

##How do I use this?

Fork & clone the repository so you have a local copy.

Use the following to install a couple components from the /lib directory of this repo into your local Maven repository. These are required for communicating with Azure Event Hubs and the WASB storage used by HDInsight. Eventually these will be included in the Hadoop/Storm bits on Maven, but not yet.

 mvn -q install:install-file -Dfile=lib/eventhubs/eventhubs-storm-spout-0.9.3-jar-with-dependencies.jar -DgroupId=com.microsoft.eventhubs -DartifactId=eventhubs-storm-spout -Dversion=0.9.3 -Dpackaging=jar

 mvn -q org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file -Dfile=lib/hadoop/hadoop-azure-3.0.0-SNAPSHOT.jar
 
 mvn -q org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file -Dfile=lib/hadoop/hadoop-client-3.0.0-SNAPSHOT.jar
 
 mvn -q org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file -Dfile=lib/hadoop/hadoop-hdfs-3.0.0-SNAPSHOT.jar

 mvn -q org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file -Dfile=lib/hadoop/hadoop-common-3.0.0-SNAPSHOT.jar -DpomFile=lib/hadoop/hadoop-common-3.0.0-SNAPSHOT.pom

 mvn -q org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file -Dfile=lib/hadoop/hadoop-project-dist-3.0.0-SNAPSHOT.pom -DpomFile=lib/hadoop/hadoop-project-dist-3.0.0-SNAPSHOT.pom
 
 mvn -q org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file -Dfile=lib/hadoop/hadoop-project-3.0.0-SNAPSHOT.pom -DpomFile=lib/hadoop/hadoop-project-3.0.0-SNAPSHOT.pom
 
 mvn -q org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file -Dfile=lib/hadoop/hadoop-main-3.0.0-SNAPSHOT.pom -DpomFile=lib/hadoop/hadoop-main-3.0.0-SNAPSHOT.pom

NOTE: If you're using Powershell, you mau have to put the -Dfoo=bar parmeters in quotes. The whole thing; "-Dfoo=bar".

Also note that I got these files from https://github.com/hdinsight/hdinsight-storm-examples, so you should look there for the latest versions.

Add the Event Hub configuration to the /conf/EventHubs.properties file. This is used to configure the spout that reads from Event Hub and the bolt that writes to it.
Add the storage account information to the /conf/core-site.xml file. This is used to tell the HDFS-bolt how to talk to HDInsight WASB, which is backed by Azure Storage.
Use mvn package to build everything.
Use the EventHubExample-1.0-SNAPSHOT.jar in the /target folder with your HDInsight cluster. You can see https://azure.microsoft.com/en-us/documentation/articles/hdinsight-storm-deploy-monitor-topology/ for the steps on how you can deploy and run/monitor this on the cluster.

Generally speaking, you want to use the web form to upload the package, and tell it to run com.microsoft.example.EventHubWriter, with a friendly name of 'writer' for the optional parameter. This will start the writer up, which will send messages to Event Hub.

Then, use the form to run the package you just uploaded again, this time using com.microsoft.example.EventHubReader, with a friendly name of 'reader' for the optional parameter. This will start the reader, which will read messages from Event Hub and write them to WASB storage.

##How do I know it's working?

You can use the Storm UI from the Storm dashboard on your cluster to monitor the topologies. If they have instances started, and acked, emitted, transferred, etc. have numbers, things are working. Any errors will show up here also.

However, since we're writing data to WASB, you can check that data is getting written using Hive. From the Storm Dashboard, you can click the Query Console tab at the top to switch over to that and run the following Hive query:

create external table devicedata (deviceid string, devicevalue int) row format delimited fields terminated by ',' stored as textfile location 'wasb:///devicedata/';
select * from devicedata limit 10;

That will define a table over the data (external table, so it just applies structure over the data in that folder,) and then returns 10 rows.

##How real world is this?

Since it's an example, there are some things that you might want to tweak. Noticably it has no error checking for someone putting bad data in Event Hub. It also has a size of 20kb for the files written to WASB storage. So I would recommend adding error checking and figuring out what the ideal file write size is for you.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
conf		conf
lib		lib
src/main/java/com/microsoft/example		src/main/java/com/microsoft/example
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conf

conf

lib

lib

src/main/java/com/microsoft/example

src/main/java/com/microsoft/example

.gitignore

.gitignore

README.md

README.md

pom.xml

pom.xml

Repository files navigation

About

Releases

Packages

Languages

Blackmist/hdinsight-java-storm-eventhub

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages