Technology Blog

Friday, December 7, 2012

my shell script cheat sheet

1. find the total line number of a fold for some files:

find test -iname *.java | xargs wc -l | tail -1

2. count # of files in each folder recursively

for t in `find . -type d -ls | awk '{print $11}'`; do echo "$t `find $t -type f | wc -l`" ; done > ~/count.txt

insert control character with vi

insert a visible ^A, we must use: <ctrl> + v + a.

Thursday, November 1, 2012

Some svn commands

Create tag from trunk or branch:

svn copy http://svn.mydomain.com/repository/myproject/trunk \
           http://svn.mydomain.com/repository/myproject/tags/release-1.0 \
      -m "Tagging the 1.0 release."

The tag created is the snapshot of the trunk at the time the "svn copy" is executed.

To be more precisely, the revision # can be passed to "svn copy":

svn copy -r 12345 http://svn.mydomain.com/repository/myproject/trunk \
           http://svn.mydomain.com/repository/myproject/tags/release-1.0 \
      -m "Tagging the 1.0 release."

Merge from Trunk to a branch:

1. check out the branch (assume the path is /path/to/mybranch)

2. go to above folder

3. run the following commmand:

svn merge http://svn.mydomain.com/repository/myproject/trunk .

It will merge all changes from trunk to mybranch since last merge.

The above command is same as:

svn merge -rLastMergedRevision:HEAD http://svn.mydomain.com/repository/myproject/trunk .

View merge history:

"svn log" doesn't display the merge history.  It only shows the merge commit:

svn log

------------------------------------------------------------------------
r196402 | liz | 2012-11-01 10:39:13 -0700 (Thu, 01 Nov 2012) | 1 line

Merging r196340 through r196401
------------------------------------------------------------------------
r196340 | liz | 2012-10-31 14:52:06 -0700 (Wed, 31 Oct 2012) | 1 line

development branch for new feature

If need to see the merge history, the option --use-merge-history (-g) can be used with svn log:

svn log -g

------------------------------------------------------------------------
r196402 | liz | 2012-11-01 10:39:13 -0700 (Thu, 01 Nov 2012) | 1 line

Merging r196340 through r196401
------------------------------------------------------------------------
r196388 | xyz | 2012-11-01 09:50:28 -0700 (Thu, 01 Nov 2012) | 2 lines
Merged via: r196402

Added new unit tests

------------------------------------------------------------------------
r196340 | liz | 2012-10-31 14:52:06 -0700 (Wed, 31 Oct 2012) | 1 line

development branch for new feature

Tuesday, July 17, 2012

Run Sqoop on Amazon Elastic MapReduce (EMR) with Amazon RDS

Amazon EMR doesn't have Sqoop installed. It is possible to run Sqoop with Amazon EMR. The following blog shows how to install and run Sqoop:

http://blog.kylemulka.com/2012/04/how-to-install-sqoop-on-amazon-elastic-map-reduce-emr/

However, the solution isn't perfect since the input files are usually in S3 and Sqoop doesn't support S3 directly. Here is my script to install sqoop and export data from S3 to Amazon RDS (mysql):

#!/bin/bash
BUCKET_NAME=zli-emr-test
SQOOP_FOLDER=sqoop-1.4.1-incubating__hadoop-0.20
SQOOP_TAR=$SQOOP_FOLDER.tar.gz

##change to home directory
cd ~

##Install sqoop on emr
hadoop fs -copyToLocal s3n://$BUCKET_NAME/$SQOOP_TAR $SQOOP_TAR
tar -xzf $SQOOP_TAR

##Install jdbc driver (ex mysql-connection-java.jar) to sqoop lib folder
hadoop fs -copyToLocal s3n://$BUCKET_NAME/mysql-connector-java-5.1.19.jar ~/$SQOOP_FOLDER/lib/

##Copy input file from S3 to hdf

HADOOP_INPUT=hdfs:///user/hadoop/myinput
hadoop distcp s3://$BUCKET_NAME/myinput $HADOOP_INPUT

~/$SQOOP_FOLDER/bin/sqoop export --connect jdbc:mysql://RDS-Host-name:3306/DB_NAME --username USERNAME --password PASSWORD --table TABLE_NAME --export-dir $HADOOP_INPUT --input-fields-terminated-by='\t'

The script assumes that sqoop tar ball and mysql-connector-java.jar are in S3 bucket, as well as the input file are in S3 too.

Note that, RDS needs to be configured to allow access to the database with the following 2 EC2 security groups:

ElasticMapReduce-master
ElasticMapReduce-slave

Tuesday, July 10, 2012

Run custom jar with elastic map reduce (EMR) on command line

I followed the instruction to run a custom jar on EMR:

http://aws.amazon.com/articles/3938

I got stuck with step 5:

5. Run the job flow.

 $ ./elasticmapreduce-client.rb RunJobFlow streaming_jobflow.json

I couldn't find file "elasticmapreduce-client.rb" at all. After some online searches, I got it work. The correct command is:

./elastic-mapreduce --create --json path/to/your/flow

Here is my flow file looks like:

   [
      {
         "Name": "Custom Jar Grep Example 1",
         "ActionOnFailure": "CONTINUE",
         "HadoopJarStep":
         {
            "Jar": "s3n://YOUR_BUCKET/hadoop-examples-0.20.2-cdh3u4.jar",

##"MainClass": "fully-qualified-class-name",

            "Args":
            [
               "grep",
               "s3n://YOUR_BUCKET/input/example",
               "s3n://YOUR_BUCKET/output/example",
               "dfs[a-z.]+"
            ]
         }
      }
   ]

The flow is corresponding to the following hadoop command:

hadoop jar hadoop-examples-0.20.2-cdh3u4.jar grep input output 'dfs[a-z.]+'

Some useful tips:

1. show the log:

./elastic-mapreduce --jobflow JOB_ID --logs

Friday, June 29, 2012

mvn test: LocalJobRunner.java with java.lang.OutOfMemoryError

When I run mvn test to unit test some hadoop related program, I got the following:

java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

I tried to set MAVEN_OPTS=-Xmx2048m, but it didn't work. After the online research, I fixed this problem by adding the following to pom:

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <configuration>
          <argLine>-Xms256m -Xmx512m</argLine>
        </configuration>
      </plugin>

Thursday, June 28, 2012

maven shade plugin: Invalid signature file digest for Manifest main attributes

If you get the following error message with maven shade plugin:

Exception in thread "main" java.lang.SecurityException: Invalid signature file digest for Manifest main attributes

You need to add the following to pom.xml:

        <configuration>
          <filters>
            <filter>
              <artifact>*:*</artifact>
              <excludes>
                <exclude>META-INF/*.SF</exclude>
                <exclude>META-INF/*.DSA</exclude>
                <exclude>META-INF/*.RSA</exclude>
              </excludes>
            </filter>
          </filters>
        </configuration>

Explanation:

The above configuration filters all files in META-INF ending with .SF, .DSA, and .RSA for all artifacts (*:*) when creating uber-jar file.

The reason java.lang.SecurityException is raised is because some dependency jar files are signed jar files. A jar file is signed by using jarsigner, which creates 2 additional files and places them in META-INF:

a signature file, with a .SF extension, and
a signature block file, with a .DSA, .RSA, or .EC extension.

Since the uber-jar file is created, the signatures and integrity of signed JAR files are no longer valid. When the uber-jar file is executed, java.lang.SecurityException is thrown.

See jarsigner for detailed explanation of JAR Signing and Verification Tool.