Technology Blog

Tuesday, April 9, 2013

Riak: configure memory_backend

Riak supports several backends:
1. bitcask
2. memory
3. leveldb
4. multi

My recent project required me to use memory backend. I have a test server with 24GB memory. I tried to use 4GB as the max_memory. It seems quite easy at beginning, and I just followed the doc from basho.com. I just replace riak_kv_bitcask_backend with riak_kv_memory_backend:

{riak_kv, [

  {storage_backend, riak_kv_memory_backend},

Then I want to see how much data I can load with default configuration. I loaded a little over 11 million records, then riak crashed. Then I set max_memory to 4GB to add the following right under {eleveldb ... configuration:

{memory_backend, [
        ...,
            {max_memory, 4096}, %% 4GB in megabytes
        ...
]}

Note that the following is copied from riak's doc. Then interesting, riak crashed with same number of records (11M+). I then changed it to 8192, but riak still crashed with the same number of records. I spent a couple days, and couldn't figure out the reason. I went to #riak IRC and asked the following question, and those folks were very responsive and solved my problem right away:

[10:30] <zhentao> hi folks, how to configure the max_memory for memory_backend?

[10:30] <@alexmoore> Hi zhentao

[10:30] <zhentao> Hi

[10:30] <zhentao> I added the following to config file:

[10:30] <zhentao> {memory_backend, [ {max_memory, 8192} %% 8GB in megabytes ]},

[10:31] <zhentao> but it didn't work

[10:31] <@alexmoore> In regards to your earlier question, if you don't specify a max_memory, or a TTL, the memory backend will continue to grow until it runs out of memory.

[10:31] <@alexmoore> Let me look at your config here

[10:31] <zhentao> it seems like that

[10:32] <zhentao> the node crashed after I loaded some data

[10:32] <zhentao> then I specify the max_memory to 8gb, and it still crashed with same amount of data

[10:32] <@alexmoore> How much RAM does the machine have?

[10:33] <sully_> Cluster health seems like it's degrading again. We're starting to see the same errors as before.

[10:34] <zhentao> 24 GB

[10:34] <sully_> We are planning to add 3 more nodes.

[10:34] <@evanmcc> zhentao: that's 8GB per vnode

[10:34] <@rzezeski> sully_ ok, can you tar.gz the latest log files again, that might allow me to find the cause before it gets rotated out by the logger

[10:34] <@jcaprice> zhentao: did you restart the node after adding the memory constraint?

[10:35] <sully_> Getting you the latest logs.

[10:35] <zhentao> yes, I restarted it

[10:35] <@evanmcc> zhentao: how many nodes, and what is your ring size?

[10:36] <@alexmoore> zhentao, how many physical machines do you have in your cluster, and what is your ring size?

[10:36] <zhentao> it is a test server and just one machine

[10:36] <@jcaprice> ring size?

[10:37] <zhentao> i am new to riak, and where to find th ring size?

[10:37] <@evanmcc> if you didn't set it, it's 64

[10:37] <@alexmoore> In your vm.args file

[10:37] <@evanmcc> so you're limiting memory to 8GB * 64

[10:38] <zhentao> let me check

[10:38] <@evanmcc> alexmoore: it's in app.config under riak_core

[10:38] <@evanmcc> ring_creation_size

[10:38] <@alexmoore> Whoops, make that the app.config

[10:39] <zhentao> it is 64

[10:39] <zhentao> {ring_creation_size, 64},

[10:39] <@evanmcc> so you want to change max memory to 128

[10:39] <@evanmcc> if you want to limit it at 8GB

[10:40] <@jcaprice> zhentao: max_memory limits the amount of memory used per vnode, not for the node itself

[10:40] <zhentao> @jcaprice, so what number I should use for max_memory for my test server?

[10:41] <@evanmcc> 128, like I said above

[10:41] <zhentao> 128 mb?

[10:41] <@evanmcc> yes

[10:41] <@jcaprice> as evanmcc said, you'll want 8192 / 64

[10:41] <zhentao> thx, let me try it

So in summary, {max_memory, 4096} is for each vnode, not for each machine. Since I wanted to limit the memory usage for one machine to 4GB, I should use the following:

{max_memory, 64}

The reason is that the ring_creattion_size is 64 which means there are 64 vnodes on my single test server.

64MB * 64 = 4096MB

After I changed max_memory to 64, riak is happy, and it didn't crash when I tried to load 20MM records.

Some interesting things I noticed:

1. since Riak use LRU for memory_backend, the old records are evicted if max_memory can't hold all records.

2. With the key as "70f21766-1cde-38d4-5920-a380003723b3", and value as "1365720608095#TQB:1.4:2:1",

4GB memory with 64 vnodes can hold about 2.5MM records

4GB memory with 4 vnodes can hold about 7MM records

Seems the number of vnodes/machine has big overhead

Thursday, February 14, 2013

eclipse JUNO hangs on start up

I use Eclipse JUNO every day. However, I couldn't start it today after I restart my Mac. Mac updated with Microsoft 2011 so I have to reboot Mac machine. The strange thing happened: Eclipse got stuck. I manually killed the start up process several times, but it didn't work. Then I googled it, and found the solution from this post, and here is what I did:

rm $WORKSPACE_DIR/.metadata/.plugins/org.eclipse.e4.workbench/workbench.xmi

Then I can start eclipse, and happy coding.

Friday, December 7, 2012

my shell script cheat sheet

1. find the total line number of a fold for some files:

find test -iname *.java | xargs wc -l | tail -1

2. count # of files in each folder recursively

for t in `find . -type d -ls | awk '{print $11}'`; do echo "$t `find $t -type f | wc -l`" ; done > ~/count.txt

insert control character with vi

insert a visible ^A, we must use: <ctrl> + v + a.

Thursday, November 1, 2012

Some svn commands

Create tag from trunk or branch:

svn copy http://svn.mydomain.com/repository/myproject/trunk \
           http://svn.mydomain.com/repository/myproject/tags/release-1.0 \
      -m "Tagging the 1.0 release."

The tag created is the snapshot of the trunk at the time the "svn copy" is executed.

To be more precisely, the revision # can be passed to "svn copy":

svn copy -r 12345 http://svn.mydomain.com/repository/myproject/trunk \
           http://svn.mydomain.com/repository/myproject/tags/release-1.0 \
      -m "Tagging the 1.0 release."

Merge from Trunk to a branch:

1. check out the branch (assume the path is /path/to/mybranch)

2. go to above folder

3. run the following commmand:

svn merge http://svn.mydomain.com/repository/myproject/trunk .

It will merge all changes from trunk to mybranch since last merge.

The above command is same as:

svn merge -rLastMergedRevision:HEAD http://svn.mydomain.com/repository/myproject/trunk .

View merge history:

"svn log" doesn't display the merge history.  It only shows the merge commit:

svn log

------------------------------------------------------------------------
r196402 | liz | 2012-11-01 10:39:13 -0700 (Thu, 01 Nov 2012) | 1 line

Merging r196340 through r196401
------------------------------------------------------------------------
r196340 | liz | 2012-10-31 14:52:06 -0700 (Wed, 31 Oct 2012) | 1 line

development branch for new feature

If need to see the merge history, the option --use-merge-history (-g) can be used with svn log:

svn log -g

------------------------------------------------------------------------
r196402 | liz | 2012-11-01 10:39:13 -0700 (Thu, 01 Nov 2012) | 1 line

Merging r196340 through r196401
------------------------------------------------------------------------
r196388 | xyz | 2012-11-01 09:50:28 -0700 (Thu, 01 Nov 2012) | 2 lines
Merged via: r196402

Added new unit tests

------------------------------------------------------------------------
r196340 | liz | 2012-10-31 14:52:06 -0700 (Wed, 31 Oct 2012) | 1 line

development branch for new feature

Tuesday, July 17, 2012

Run Sqoop on Amazon Elastic MapReduce (EMR) with Amazon RDS

Amazon EMR doesn't have Sqoop installed. It is possible to run Sqoop with Amazon EMR. The following blog shows how to install and run Sqoop:

http://blog.kylemulka.com/2012/04/how-to-install-sqoop-on-amazon-elastic-map-reduce-emr/

However, the solution isn't perfect since the input files are usually in S3 and Sqoop doesn't support S3 directly. Here is my script to install sqoop and export data from S3 to Amazon RDS (mysql):

#!/bin/bash
BUCKET_NAME=zli-emr-test
SQOOP_FOLDER=sqoop-1.4.1-incubating__hadoop-0.20
SQOOP_TAR=$SQOOP_FOLDER.tar.gz

##change to home directory
cd ~

##Install sqoop on emr
hadoop fs -copyToLocal s3n://$BUCKET_NAME/$SQOOP_TAR $SQOOP_TAR
tar -xzf $SQOOP_TAR

##Install jdbc driver (ex mysql-connection-java.jar) to sqoop lib folder
hadoop fs -copyToLocal s3n://$BUCKET_NAME/mysql-connector-java-5.1.19.jar ~/$SQOOP_FOLDER/lib/

##Copy input file from S3 to hdf

HADOOP_INPUT=hdfs:///user/hadoop/myinput
hadoop distcp s3://$BUCKET_NAME/myinput $HADOOP_INPUT

~/$SQOOP_FOLDER/bin/sqoop export --connect jdbc:mysql://RDS-Host-name:3306/DB_NAME --username USERNAME --password PASSWORD --table TABLE_NAME --export-dir $HADOOP_INPUT --input-fields-terminated-by='\t'

The script assumes that sqoop tar ball and mysql-connector-java.jar are in S3 bucket, as well as the input file are in S3 too.

Note that, RDS needs to be configured to allow access to the database with the following 2 EC2 security groups:

ElasticMapReduce-master
ElasticMapReduce-slave

Tuesday, July 10, 2012

Run custom jar with elastic map reduce (EMR) on command line

I followed the instruction to run a custom jar on EMR:

http://aws.amazon.com/articles/3938

I got stuck with step 5:

5. Run the job flow.

 $ ./elasticmapreduce-client.rb RunJobFlow streaming_jobflow.json

I couldn't find file "elasticmapreduce-client.rb" at all. After some online searches, I got it work. The correct command is:

./elastic-mapreduce --create --json path/to/your/flow

Here is my flow file looks like:

   [
      {
         "Name": "Custom Jar Grep Example 1",
         "ActionOnFailure": "CONTINUE",
         "HadoopJarStep":
         {
            "Jar": "s3n://YOUR_BUCKET/hadoop-examples-0.20.2-cdh3u4.jar",

##"MainClass": "fully-qualified-class-name",

            "Args":
            [
               "grep",
               "s3n://YOUR_BUCKET/input/example",
               "s3n://YOUR_BUCKET/output/example",
               "dfs[a-z.]+"
            ]
         }
      }
   ]

The flow is corresponding to the following hadoop command:

hadoop jar hadoop-examples-0.20.2-cdh3u4.jar grep input output 'dfs[a-z.]+'

Some useful tips:

1. show the log:

./elastic-mapreduce --jobflow JOB_ID --logs