Skip to content

trevorquinn/tusk

 
 

Repository files navigation

-------------------------------------------------------------
::  Table of Contents
-------------------------------------------------------------

1)  Overview
     a.  What is Tusk?
	
     b.  What is needed to run tusk?
	
     c.  Current Known Bugs/Workarounds
	
     d.  Current Installed Software Versions
	 
2)  Setting up your Development Environment
	a.  Maven

	b.  Git
                1. Installing Git
		2. Registering a GitHub Account
		3. Generating SSH Key
		4. Configuring your Git Settings
		5. Cloneing the Tusk Repo
		6. Installing egit and m2e git scm connector for your IDE
		7. Importing the git repository into your IDE

	c.  JBoss Installation
		  1. JDK
		  2. Apache Ant
		  3. EDSP, SOA-P & BRMS

	d.  Tomcat Setup

	e.  Hadoop (HDFS, HBase, MapReduce, Hive) Setup
		  1. Hadoop
		  2. Hadoop Services
		  3. HBase DDL
		  4. Hive DDL

	f.  Cassandra Setup

3)  Building
	a.  Installing Artifacts

	b.  Deploying Artifacts

4)  Running Tusk
	

-------------------------------------------------------------
:: 1) Overview
-------------------------------------------------------------

:: (A) What is Tusk?
-------------------------------------------------------------
Tusk is an open-source project that integrates JBoss Middleware Technologies with various Big Data technologies.  It's purpose is to:

     - House bindings between various JBoss & Big Data tools.
	 
     - Provide demo environments that show the capabilities and what can be done when you integrate enterprise class middleware with industry leading Big Data tools.
	 
     - Provides ideas for ways you can leverage JBoss technologies to augment Big Data and for Big Data to augment JBoss.
	 	 
:: (B) What is needed to run tusk?
-------------------------------------------------------------
To run Tusk in a "real" environment, you need a running SOA-P server (with Enterprise Data Services Platform (EDSP) installed) and either running Hadoop Cluster or Cassanda Cluster.  This includes running the unit tests, although one of the first tasks will be to provide embedded Hadoop (HBase, Zookeper) and Cassandra servers.  Once the embedded versions are used to validate functionality, the standalone versions of these servers is adequate until the move to production or until performance testing is done.


:: (C) Current Known Bugs/Workarounds
-------------------------------------------------------------
Due to classloading conflict issues that have not yet been resolved, the current version of Tusk is unable to deploy the ISPN-integration jar file within SOA-P.  Therefor, it is running within a RESTful web service war file in Tomcat.  When the SOA-P service needs to write the message index to the Infinispan cache it makes a REST call to Tomcat.  This will hopefully be resolved fairly soon.

* As of 10/8/2011, there is an exception when the service handles the message.
This is because Infinispan requires a newer version of JBoss Logging (3.0) than
what comes with EDSP 5.1. It shouldn't be too hard to fix the deployment so that
the ESB uses the deployed version of jboss logging (e.g. jboss-logging-3.0.0.GA.jar)
instead of the one packaged in EDSP, which is version 2.1 I think.


:: (D) Current Installed Software Versions
-------------------------------------------------------------
JBoss SOA-P                                     5.2
JBoss Enterprise Data Services Platform         5.2
JBoss Business Rules Management System          5.2
Apache Ant                                      1.8.x
Apache Maven                                    3.x
Apacht Tomcat                                   6.x
JDK                                             1.6.x

Hadoop (Cloudera Distribution for Hadoop)       0.20
HBase (Cloudera Distribution for Hadoop)        0.20
Hive (Cloudera Distribution for Hadoop)         0.20

Apache Cassandra                                1.0
------------------------------
For Development Environments:
------------------------------
Eclipse Helios (or a later version)
JBoss Developer Studio
-------------------------------------------------------------

*Note: if you only want to run against Hadoop or Cassandra, you don't have to install the other one.


-------------------------------------------------------------
:: 2) Setting up your Development Environment
-------------------------------------------------------------

Eclipse Helios was used to develop the Tusk project so it (or a later version) is recommended for development.  JBoss Developer Studio will also work. 


:: (A) Maven Installation
-------------------------------------------------------------
For either environment, the M2Eclipse (m2e) plugin and Maven needs to be installed.

Maven can be installed via the command line with the following command:

"sudo yum install maven"

*Note: You can find a sample Maven settings.xml file in the conf directory.  It must have a jboss.soa.path variable that points to the jboss-as directory of your SOA-P installation.  


:: (B) Git
-------------------------------------------------------------
Tusks uses git for version control software.  The git command line client will need to be installed in order to pull day the code repository.  You can go to http://help.github.com/linux-set-up-git/ or use the following steps to install git.  The most important part of this installation is the SSH setup and your local config settings for name, email and github username.  If you are using the TuskVM Virtual Machine you will not need to perform the ssh config steps as there is already a key created in ~/.ssh/.  You will still need to set your git name, email and github username though.  


::  [1] Installing Git
-------------------------------------------------------------
To install git use the following command:

"sudo yum install git bash-completion"

::  [2] Registering a GitHub Account
-------------------------------------------------------------

If you have not already registered for a github account, one will need to be registered now.
Navigate to https://github.com and register for a github account.  Please make not of your login credentials as you will need them for the following step.

::  [3] Generating SSH Key
-------------------------------------------------------------
Once you have succesfully registered for GitHub we will need to set up your SSH key and link it to the github account you just created.  To generate a new SSH key use the following command:

"ssh-keygen -t rsa -C <your_email@redhat.com>"

Navigate back to the github website and login to your GitHub account you created in step [2].  Select 'Account Settings' --> 'SSH Keys' --> 'Add SSH Key'

In a terminal window type in 'cat .ssh/id_rsa.pub' to view your public SSH key.  Copy and paste this output into the key field and click 'Add Key' to complete this step.

To test that you have succesfully added your key use the following command:

ssh -T git@github.com 

After hitting yes to the prompt, if it responds that you are authenticated then we can move on to configuring your git settings.


::  [4] Configuring your Git Settings
-------------------------------------------------------------

To configure git, open terminal and type the following commands:

git config --global user.name "FirstName LastName"
git config --global user.email "your_email@redhat.com"
git config --global github.user "your_github_username"
git config --global push.default tracking
git config --global branch.autosetuprebase always
git config --global branch.autosetupmerge true

Now log back in to your github account.  Navigate to 'Account Settings' (middle icon top right).  On the left hand side, click 'Account Settings'.  Copy the the API token listed here to your keyboard.  Paste your token in place of the dummy token string in the command below.

git config --global github.token 0123456789yourf0123456789token


To clone the Tusk repo simply navigate to where you would like the repository to be copied to (such as (/home/user/Development) and use the following command:

git clone https://github.com/jboss-tusk/tusk.git <FolderName you Want>

**You will also need to install the egit plugin for eclipse as well as the maven m2e scm connector to work with git inside of your IDE.

::  [5] Installing egit and m2e git scm connector for your IDE
-------------------------------------------------------------

            Eclipse Helios Instructions:
-------------------------------------------------------------
-  Install egit (team provider) plugin from Eclipse Marketplace
	* Help -> Eclipse Marketplace -> Eclipse Marketplace
	* Change "All categories" dropdown to "SCM"
	* Click on "Browse for more solutions" link
	* Search on "egit"
	* Install the "EGit - Git Team Provider" plugin
	
-  Install m2e git scm connector
	* go to file->new->other
	* choose maven->checkout maven projects from scm
	* click on link to find more scm connectors from the m2e marketplace
	* type egit in filter
	* install m2e-egit connector


            JBoss Developer Studio Instructions:
-------------------------------------------------------------

	<TO BE FILLED OUT >

::  [6] Importing the git repository into your IDE
-------------------------------------------------------------
Now that we have our Git code repository pulled down locally we need to import it into the IDE you installed above.  This can be done via the following:

1. Open Eclipse Helios/JBoss Developer Studoi
2. File -> Import
3. Git-Projects from Git
4. Choose "uri"
5. Paste "git@github.com:jboss-tusk/tusk.git" into URI field
6. Choose http for protocol
7. Next, Next, Next
8. Choose "Import as general project"
9. Finish
10. Right-click project root directory and choose "Configure -> Convert to Maven Project"
11. Copy tusk/conf/example-settings.xml into ~/.m2/settings.xml
12. Make sure that the soa.path property in ~/.m2/settings.xml is a valid path to your SOA-P

* To add the project's modules (ie esb-integration) as eclipse projects, do the following:
	1. Import
	2. Maven - Existing Maven Project
	3. Browse to find the module within the tusk project. This should be in /home/tusk/git/tusk
	4. Click OK/Next/Finish until you are done

* Note:  You can perform steps 8-12 for all of the modules except conf and bin to treat each as its own project (ie so you can do maven builds on just one module instead of the entire tusk trunk).  This is not required though.

In order to push to master (assuming you have approrpiate read+write permissions) you should set up SSHH within your IDE.  For the git setup we created above, a SSH key was created and we need to tell our IDE to use that private key.  This can be done via the following:

1.  Window -> Preferences Menu
2.  General -> Network Connections -> SSH2
3.  Add the private key that you created during GitHub setup.

:: (C) JBoss Installation
-------------------------------------------------------------
The following will guide you through the installation of the required JBoss components.  

::  [1] JDK
-------------------------------------------------------------
You can download the JDK RPM installer for the latest JDK 6 from http://www.oracle.com/technetwork/java/javase/downloads/index.html.

* At the time of writing this, the latest JDK 6 rpm installation for 64-bit machines is:

	jdk-6u31-linux-x64.bin

1.  Execute the following command:  "sudo chmod 755 jdk-6u31-linux-x64.bin"
2.  Execute the following command:  "sudo ./jdk-6u31-linux-x64.bin"
3.  Now we need to set the Oracle JDK to be the default java via:

     sudo alternatives --install /usr/bin/java java /usr/java/jdk1.6.0_31/jre/bin/java 20000
     sudo alternatives --install /usr/bin/javaws javaws /usr/java/jdk1.6.0_31/jre/bin/javaws 20000
     sudo alternatives --install /usr/lib/mozilla/plugins/libjavaplugin.so libjavaplugin.so /usr/java/jdk1.6.0_31/jre/lib/i386/libnpjp2.so 20000
     sudo alternatives --install /usr/bin/javac javac /usr/java/jdk1.6.0_31/bin/javac 20000
     sudo alternatives --install /usr/bin/jar jar /usr/java/jdk1.6.0_31/bin/jar 20000

::  [2] Apache Ant
-------------------------------------------------------------
Depending on your Linux flavor and version, Apache Ant might actually already be installed -- in which case you could skip this step.  To check, enter the following command:

	"sudo yum list ant"

If not, continue to the following skips to install Apache Ant.

1.  Download the latest Apache Ant from http://ant.apache.org/bindownload.cgi
2.  Unzip the file downloaded into /opt.
3.  Create an ANT_HOME environment variable by adding the following lines to /etc/profile

        export ANT_HOME=/opt/apahce-ant-1.8.3
        export PATH=$PATH:$ANT_HOME/bin

::  [3] EDSP, SOA-P & BRMS
-------------------------------------------------------------
Red Hat JBoss Enterprise Data Services Platform (EDSP) and SOA-P will need to be downloaded from the Red Hat Customer Access Portal.  This site is accessible via <INSERT THE SITE NAME HERE>.

The following files will need to be downloaded to accomplish this step:

            SOA-P:   soa-p-5.2.0.GA.zip
            EDSP:    eds-p-5.2.0.GA.zip
            BRMS:    brms-p-5.2.0.GA-standalone.zip

1.  Start by opening up terminal and unzipping soa-p-5.2.0.GA.zip into the JBoss installation directory (referred to as $JBOSS_HOME).  For example, unzip into /usr/local/jboss/ to have $JBOSS_HOME=/usr/local/jboss/jboss-soa-p-5
2.  Unzip eds-p-5.2.0.GA.zip into $JBOSS_HOME
3.  Change directory into $JBOSS_HOME/eds.
4.  Run the following command: "ant"
     - Choose the "default" server profile.
     - Install Apache CXF
5.  Unzip jboss-brms.war into the $JBOSS_HOME/jboss-as/server/default/deploy directory.
6.  Change directory to $JBOSS_HOME/jboss-as/server/default/conf/props.
7.  Edit soa-users.properties.
     -  admin = admin
     -  user1=password
8.  Edit soa-roles.properties.
     -  admin=JBossAdmin,HttpInvoker,user,admin
     -  user1=admin,JBossAdmin,READWRITE
9.  Edit teiid-security-users.properties.
     -  user1=password
10. Edit teeid-security-roles.properties.
     -  admin=admin
     -  user1=audit,log
11. Change directory to $JBOSS_HOME/jboss-as/bin
12. Start EDSP via the following command:
      "./run.sh -c <profile>"

:: (D) Tomcat Installation
-------------------------------------------------------------
It doesn't matter where this is installed to.  Just unzip the distribution and run it on demand.  Or you can install it as a service -- whichever you want.  The easiest way to install Tomcat is via the command line:

"sudo yum install tomcat"

The most important thing is to change the configuration to use port 8888 for HTTP and port 8889 for ajp.  To do this edit the conf/server.xml file and change the HTTP/ajp connectors.  If you installed Tomcat via yum the file is located at /etc/tomcat/server.xml.

:: (E) Hadoop (HDFS, HBase, MapReduce, Hive) Setup
-------------------------------------------------------------
Once Hadoop is installed the main config files for the Hadoop services are at:

     /etc/hadoop/conf
     /etc/hbase/conf
     /etc/zookeeper
     /etc/hive/conf

::  [1] Hadoop
-------------------------------------------------------------
To install the Hadoop services perform the following steps in order:

wget http://archive.cloudera.com/redhat/cdh/cdh3-repository-1.0-1.noarch.rpm
sudo yum --nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm
sudo rpm --import http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
sudo yum install hadoop-0.20-conf-pseudo
sudo yum install hadoop-hbase

*Note:  This causes Hadoop Zookeeper to be installed in the latest version, so you might not have to install that below.  It will tell you if it is already installed when you try.

sudo yum install hadoop-hive
sudo vim /etc/security/limits.conf
       hdfs  -    nofile   32768
       hbase -    nofile   32768

sudo vim /etc/alternatives/hadoop-etc/conf/hdfs-site.xml
      <property>
         <name>dfs.datanode.max.xcievers</name>
         <value>4096</value>
      </property>

::  [2] Hadoop Services
-------------------------------------------------------------
sudo service hadoop-0.20-namenode start
sudo service hadoop-0.20-secondarynamenode start
sudo service hadoop-0.20-datanode start
sudo service hadoop-0.20-tasktracker start
sudo service hadoop-0.20-jobtracker start
sudo yum install hadoop-hbase-master

sudo vim /etc/hbase/conf/hbase-site.xml
   <configuration>
      <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
      </property>
      <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost/hbase</value>
        <description>The directory shared by RegionServers.</description>
      </property>
      <property>
        <name>dfs.replication</name>
        <value>1</value>
        <description>The replication count for HLog and HFile storage. Should not be greater than HDFS datanode count.</description>
      </property>
      <property>
        <name>hbase.zookeeper.property.maxClientCnxns</name>
        <value>200</value>
        <final>true</final>
      </property>
   </configuration>

sudo yum install hadoop-zookeper-server
sudo service hadoop-zookeper-server start
sudo vim /etc/zookeeper/zoo.cfg

     Change the following line in this file:
     -  "server.0=localhost:2888:3888"   --->   "server.0=<HOST_NAME>:2888.3888"
     
     * NOTE: <HOST_NAME> must be a valid host name that resolves to localhost in /etc/hosts.
     * You can add a host name to your hosts file if neccesary.  It CANNOT be "localhost" or 
     * "127.0.0.1"

     Add the following line in at the bottom of this file:
           maxClientCnxns=200

sudo service hadoop-hbase-master start
sudo yum install hadoop-hbase-regionserver
sudo service hadoop-hbase-regionserver start

mkdir ~/bin

Now you will need to navigate to the /bin directory within the tusk project you pulled down earlier from kit.  Copy the *-tusk.sh scripts into the ~/bin directory you just created.  Make sure their mode is 744.

Also make sure that your ~/bin directory is on your path.  You can do this by adding the following lines to your ~/.bash_profile
        PATH=$PATH:$HOME/bin
        export PATH

sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod g+w /tmp
sudo -u hdfs hadoop fs -mkdir /user/hive
sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse
sudo -u hdfs hadoop fs -chmod g+w /user/hive/warehouse

sudo vim /etc/hive/conf/hive-site.xml; add the following, updating paths as neccesary for your configuration:

       <property>
         <name>hive.aux.jars.path</name>
         <value>file:///usr/lib/hive/lib/hive-hbase-handler-0.7.1-cdh3u1.jar,file:///usr/lib/zookeeper/zookeeper-3.3.3-cdh3u1.jar,file:///usr/lib/hbase/hbase-0.90.3-cdh3u1.jar,file:///usr/lib/hbase/lib/guava-r06.jar</value>
       </property>

::  [3] HBase DDL
-------------------------------------------------------------
The following steps will help you to setup HBase DDL.

Run the following commands to create the Hbase structures:

         $ hbase shell
         > create 'messages' , 'data' , 'metadata'
         > create 'message-index' , 'fields'

::  [3] Hive DDL
-------------------------------------------------------------
**TODO:  We need to validate these as they are probably wrong

The following steps will help you setup Hive DDL

Run the following commands to create the Hive structures:

         $ hive
         > create external table hbase_message_index(key string, diseases string, groupId string, patientId string, planId string, state string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES ("hbase.columns.mapping" = ":key,fields:diseases,fields:groupId,fields:patientId,fields:planId,fields:state")  TBLPROPERTIES("hbase.table.name" = "message-index");

** NOTE:  Sometimes the hadoop-hbase-master and/or hadoop-hbase-regionserver and/or hadoop-zookeper-server daemons die and have to be started again.  If hive/hbase are not working, then the first thing to do is make sure these daemons are running via the following commands:

         $  sudo service hadoop-zoopkeper-server restart
         $  sudo service hadoop-hbase-master restart
         $  sudo service hadoop-hbase-regionserver restart

:: (F) Cassandra Setup
-------------------------------------------------------------

You can install Cassandra anywhere and run it on-demand or as a service.

If running Tusk against Cassandra, you must start the Cassandra Cluster before running the application.  Do this on a single node by running the "sudo cassandra" command in the cassandra/bin directory.  The keyspaces for the Infinispan cachestore will be created automatically if they are not there so no need to create them manually.  You must create the keyspace for the main message data store, which is "TuskData" by default (see the KeyspaceDefinition lines of code in BigDataExtractor.java in the esb-integration project).  The ispn-integration/src/main/resources/cassandra-schema.txt file contains the commands to create the Cassandra schema.

The current implementation requires a running Cassandra server.  There is an embedded Cassandra server that can be used for testing.  Check out the Cassandra cachestore module in the Infinispan codebase for the code.  If we use it we would have to pre-create the keyspace and column family automatically each time the embedded server is started.  The Infinispan Cassandra cachestore automatically creates a keyspace so we can use that as a guide.


-------------------------------------------------------------
:: 3) Building
-------------------------------------------------------------
Tusk uses Maven for builds and dependency management.  

:: (A) Installing Artifacts
-------------------------------------------------------------

To install the artifacts into the local repository without executing the unit tests use the following command:

        mvn -Dmaven.test.skip.exec=true clean install

To install the artifacts into the local repository:

        mvn clean install

:: (B) Deploying Artifacts
-------------------------------------------------------------
Once the artifacts are built, deploy them via the following steps:

1.  Start SOA-p
       ie:  ~/java/jboss-soa-p-5/jboss-as/bin/run.sh

2.  Deploy the BRMS rules
       a.  Go to http://localhost:8080/jboss-brms and log on with admin/admin.  Do not install samples.
       b.  Click on 'Administration' in the left hand accordion menu, then click on 'Import - Export'.
       c.  Browse to import ~/git/tusk/esb-integration/src/test/resources/repository_export.xml
       d.  For future rule/model changes, deploy those directly in the package.

3.  Copy the esb-integration-*.esb file into the SOA-P deploy directory.
       ie:  cp ~/git/tusk/esb-integration/target/esb-integration-0.0.1-SNAPSHOT.esb ~/java/jboss-soa-p-5/jboss-as/server/default/deploy/

4.  Copy the TuskUI.war file into the Tomcat webapps directory (must do this as root)
       ie:  sudo cp ~/git/tusk/TuskUI/target/TuskUI.war /var/lib/tomcat/webapps/

5.  Start Tomcat
       ie:  sudo service tomcat start


You can then verify that the artifacts were deployed properly by doing the following:

1.  Go to the JMX console (http://localhost:8080/jmx-console) and find the BigDataMessengerManagementBean (the second one, with the longer name) and invoke the stubMessages operation to add a message into the data intake pipeline.

2.  View the SOA-P log file to verify that the message was handled properly.  You will see something like:

        14:17:08,881 INFO    [BigDataExtractor]  Extracted index: id[0]
        14:17:08,891 INFO    [BigDataExtractor]  Extracted index: addressLine[210 N. Church Street]

3.  Copy the ~/.m2/repository/org/infinispan/infinispan-cachestore-hbase/5.1.0-SNAPSHOT directory to ~/.m2/repository/org/infinispan/infinispan-cachestore-hbase/5.1.0-SNAPSHOT

4.  Rename the ~/.m2/repository/org/infinispan/infinispan-cachestore-hbase/5.1.0-SNAPSHOT/*-5.2.0* files to *-5.1.0*

      NOTE:  Yes, this is making a copy of the 5.2 version to 5.1.  It's a hack, but it should work without backing to backport the hbase cachestore to 5.1.

      NOTE:  The configured maven repositories do not have jbossesb-rosetta.jar, so do the following:

      Copy /path/to/jboss-soa-p-5/jboss-as/server/production/deployers/esb.deployer/lib/jbossesb-rosetta.jar    TO   ~/.m2/repository/org/jboss/soa/jbossesb-rosetta/5.2.0.SOA/jbossesb-rosetta-5.2.0.SOA.jar

-------------------------------------------------------------
:: 4) Running Tusk
-------------------------------------------------------------
There are convenience scripts inside of the tusk/bin directory that start/stop/restart (some of ) the Tusk services in the correct order.  Currently these only cover the Hadoop services.  After running the start or restart script, give the services some time to get initialized before attempting to use them.  About a minute or so should do.

These scripts can be updated to manage SOA-P, Tomcat, and Cassandra if necessary.

**IMPORTANT:  The Tusk application can be run against either Hbase or Cassandra.  To change which data store it runs against, update the TuskConfiguration.java class in the common module and then redeploy (or re-run the unit tests).  The dataStore field contains the data store to run against. 

TODO:  Update this to read from a config file or run.conf

About

JBoss Middleware + Big Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 57.1%
  • Java 42.5%
  • Shell 0.4%