Search This Blog

Tuesday, July 17, 2012

Run Sqoop on Amazon Elastic MapReduce (EMR) with Amazon RDS

Amazon EMR doesn't have Sqoop installed.  It is possible to run Sqoop with Amazon EMR.  The following blog shows how to install and run Sqoop:

http://blog.kylemulka.com/2012/04/how-to-install-sqoop-on-amazon-elastic-map-reduce-emr/

However, the solution isn't perfect since the input files are usually in S3 and Sqoop doesn't support S3 directly.  Here is my script to install sqoop and export data from S3 to Amazon RDS (mysql):

#!/bin/bash
BUCKET_NAME=zli-emr-test
SQOOP_FOLDER=sqoop-1.4.1-incubating__hadoop-0.20
SQOOP_TAR=$SQOOP_FOLDER.tar.gz

##change to home directory
cd ~

##Install sqoop on emr
hadoop fs -copyToLocal s3n://$BUCKET_NAME/$SQOOP_TAR $SQOOP_TAR
tar -xzf $SQOOP_TAR

##Install jdbc driver (ex mysql-connection-java.jar) to sqoop lib folder
hadoop fs -copyToLocal s3n://$BUCKET_NAME/mysql-connector-java-5.1.19.jar ~/$SQOOP_FOLDER/lib/
##Copy input file from S3 to hdf
HADOOP_INPUT=hdfs:///user/hadoop/myinput
hadoop distcp s3://$BUCKET_NAME/myinput $HADOOP_INPUT
~/$SQOOP_FOLDER/bin/sqoop export --connect jdbc:mysql://RDS-Host-name:3306/DB_NAME --username USERNAME --password PASSWORD --table TABLE_NAME --export-dir $HADOOP_INPUT --input-fields-terminated-by='\t'
The script assumes that sqoop tar ball and mysql-connector-java.jar are in S3 bucket, as well as the input file are in S3 too.

Note that, RDS needs to be configured to allow access to the database with the following 2 EC2 security groups:
ElasticMapReduce-master
ElasticMapReduce-slave

Tuesday, July 10, 2012

Run custom jar with elastic map reduce (EMR) on command line

I followed the instruction to run a custom jar on EMR:

http://aws.amazon.com/articles/3938

I got stuck with step 5:
5. Run the job flow.
 $ ./elasticmapreduce-client.rb RunJobFlow streaming_jobflow.json 
I couldn't find file "elasticmapreduce-client.rb" at all.  After some online searches, I got it work.  The correct command is:
./elastic-mapreduce --create --json path/to/your/flow
Here is my flow file looks like:
   [
      {
         "Name": "Custom Jar Grep Example 1",
         "ActionOnFailure": "CONTINUE",
         "HadoopJarStep":
         {
            "Jar": "s3n://YOUR_BUCKET/hadoop-examples-0.20.2-cdh3u4.jar",
                       ##"MainClass": "fully-qualified-class-name", 
            "Args":
            [
               "grep",
               "s3n://YOUR_BUCKET/input/example",
               "s3n://YOUR_BUCKET/output/example",
               "dfs[a-z.]+"
            ]
         }
      }
   ]
The flow is corresponding to the following hadoop command:
hadoop jar hadoop-examples-0.20.2-cdh3u4.jar grep input output 'dfs[a-z.]+'

Some useful tips:

1. show the log:
 ./elastic-mapreduce --jobflow JOB_ID --logs