http://blog.kylemulka.com/2012/04/how-to-install-sqoop-on-amazon-elastic-map-reduce-emr/
However, the solution isn't perfect since the input files are usually in S3 and Sqoop doesn't support S3 directly. Here is my script to install sqoop and export data from S3 to Amazon RDS (mysql):
#!/bin/bash
BUCKET_NAME=zli-emr-test
SQOOP_FOLDER=sqoop-1.4.1-incubating__hadoop-0.20
SQOOP_TAR=$SQOOP_FOLDER.tar.gz
##change to home directory
cd ~
##Install sqoop on emr
hadoop fs -copyToLocal s3n://$BUCKET_NAME/$SQOOP_TAR $SQOOP_TAR
tar -xzf $SQOOP_TAR
##Install jdbc driver (ex mysql-connection-java.jar) to sqoop lib folder
hadoop fs -copyToLocal s3n://$BUCKET_NAME/mysql-connector-java-5.1.19.jar ~/$SQOOP_FOLDER/lib/
##Copy input file from S3 to hdf
HADOOP_INPUT=hdfs:///user/hadoop/myinput
hadoop distcp s3://$BUCKET_NAME/myinput $HADOOP_INPUT
~/$SQOOP_FOLDER/bin/sqoop export --connect jdbc:mysql://RDS-Host-name:3306/DB_NAME --username USERNAME --password PASSWORD --table TABLE_NAME --export-dir $HADOOP_INPUT --input-fields-terminated-by='\t'The script assumes that sqoop tar ball and mysql-connector-java.jar are in S3 bucket, as well as the input file are in S3 too.
Note that, RDS needs to be configured to allow access to the database with the following 2 EC2 security groups:
ElasticMapReduce-master
ElasticMapReduce-slave
Thank you for your article
ReplyDeleteShould one define HADOOP_INPUT or is it expected to be defined when the script runs? What should be the value?
Thank you
HADOOP_INPUT should be defined in the script. The format was messed up when I did copy and paste. I just fixed the format.
Delete