2015年12月16日 星期三

Grizzly 2.0: HttpServer API. Asynchronous HTTP Server - Part I

注意事項:
    1. 本文轉載於 Mytec Blog for oleksiys
    2. 建議先瞭解一下 Grizzly IOStrategies 的第一種 Worker-thread 模式
    3. 僅供參考查閱使用
In my previous blog entry I described basic Grizzly HttpServer API abstractions and offered a couple of samples to show how one can implement light-weight Servlet-like web application.
Here I'll try to show how we can process HTTP requests asynchronously within HttpHandler, in other words implement asynchronous HTTP application.
What do we mean by "asynchronous"?
Normally HttpServer has a service thread-pool, whose threads are used for HTTP requests processing, which includes following steps:
  1. parse HTTP request;
  2. execute processing logic by calling HttpHandler.handle(Request, Response);
  3. flush HTTP response;
  4. return service thread to a thread-pool.
Normally the steps above are executed sequentially in a service thread. Using "asynchronous" feature it's possible to delegate execution of steps 2 and 3 to a custom thread, which will let us release the service thread faster.
Why would we want to do that?
As it was said above, the service thread-pool instance is shared between all the HttpHandlers registered on HttpServer. Assume we have application (HttpHandler) "A", which executes a long lasting task (say a SQL query on pretty busy DB server), and application "B", which serves static resources. It's easy to imagine the situation, when couple of application "A" clients block all the service threads by waiting for response from DB server. The main problem is that clients of application "B", which is pretty light-weight, can not be served at the same time because there are no available service threads. So it might be a good idea to isolate these applications by executing application "A" logic in the dedicated thread pool, so service threads won't be blocked.
Ok, let's do some coding and make sure the issue we've just described is real.
HttpServer httpServer = new HttpServer();

NetworkListener networkListener = new NetworkListener("sample-listener", "127.0.0.1", 18888);

// Configure NetworkListener thread pool to have just one thread,
// so it would be easier to reproduce the problem
ThreadPoolConfig threadPoolConfig = ThreadPoolConfig
        .defaultConfig()
        .setCorePoolSize(1)
        .setMaxPoolSize(1);

networkListener.getTransport().setWorkerThreadPoolConfig(threadPoolConfig);

httpServer.addListener(networkListener);

httpServer.getServerConfiguration().addHttpHandler(new HttpHandler() {
    @Override
    public void service(Request request, Response response) throws Exception {
        response.setContentType("text/plain");
        response.getWriter().write("Simple task is done!");
    }
}, "/simple");

httpServer.getServerConfiguration().addHttpHandler(new HttpHandler() {
    @Override
    public void service(Request request, Response response) throws Exception {
        response.setContentType("text/plain");
        // Simulate long lasting task
        Thread.sleep(10000);
        response.getWriter().write("Complex task is done!");
    }
}, "/complex");

try {
    server.start();
    System.out.println("Press any key to stop the server...");
    System.in.read();
} catch (Exception e) {
    System.err.println(e);
}
In the sample above we create, initialize and run HTTP server, which has 2 applications (HttpHandlers) registered: "simple" and "complex". To simulate long-lasting task in the "complex" application we're just causing the current thread to sleep for 10 seconds.
Now if you try to call "simple" application from you Web browser using URL: http://localhost:18888/simple - you see the response immediately. However, if you try to call "complex" application http://localhost:18888/complex - you'll see response in 10 seconds. That's fine. But try to call "complex" application first and then quickly, in different tab, call the "simple" application, do you see the response immediately? Probably not. You'll see the response right after "complex" application execution completed. The sad thing here is that service thread, which is executing "complex" operation is idle (the same situation is when you wait for SQL query result), so CPU is doing nothing, but still we're not able to process another HTTP request.
How we can rework the "complex" application to execute its task in custom thread pool? Normally application (HttpHandler) logic is encapsulated within HttpHandler.service(Request, Response) method, once we exit this method, Grizzly finishes and flushes HTTP response. So coming back to the service thread processing steps:
  1. parse HTTP request;
  2. execute processing logic by calling HttpHandler.handle(Request, Response);
  3. flush HTTP response;
  4. return service thread to a thread-pool.
we see that it wouldn't be enough just to delegate HTTP request processing to a custom thread on step 2, because on step 3 Grizzly will automatically flush HTTP response back to client at the state it currently is. We need a way to instruct Grizzly to not do 3 automatically on the service thread, instead we want to be able to perform this step ourselves once asynchronous processing is complete.
Using Grizzly HttpServer API it could be achieved following way:
  • HttpResponse.suspend(...) to instruct Grizzly to not flush HTTP response in the service thread;
  • HttpResponse.resume() to finish HTTP request processing and flush response back to client.
So asynchronous version of the "complex" application (HttpHandler) will look like:
httpServer.getServerConfiguration().addHttpHandler(new HttpHandler() {
    final ExecutorService complexAppExecutorService =
        GrizzlyExecutorService.createInstance(
            ThreadPoolConfig.defaultConfig()
            .copy()
            .setCorePoolSize(5)
            .setMaxPoolSize(5));
            
    @Override
    public void service(final Request request, final Response response) throws Exception {
                
        response.suspend(); // Instruct Grizzly to not flush response, once we exit the service(...) method 
                
        complexAppExecutorService.execute(new Runnable() {   // Execute long-lasting task in the custom thread
            public void run() {
                try {
                    response.setContentType("text/plain");
                    // Simulate long lasting task
                    Thread.sleep(10000);
                    response.getWriter().write("Complex task is done!");
                } catch (Exception e) {
                    response.setStatus(HttpStatus.INTERNAL_SERVER_ERROR_500);
                } finally {
                    response.resume();  // Finishing HTTP request processing and flushing the response to the client
                }
            }
        });
    }
}, "/complex");
  • As you might have noticed, "complex" application uses Grizzly ExecutorService implementation. This is the preferred approach, however you can still use own ExecutorService.
The three most important steps in the code above are marked red:
  1. Suspend HTTP response processing: response.suspend()
  2. Delegating task to the custom thread pool: complexAppExecutorService.execute(...)
  3. Resuming HTTP response processing: response.resume()
Now, using your browser, you can make sure "simple" and "complex" applications are not affecting each other, and the "simple" application works just fine when the "complex" application is busy.

筆記

實作中筆者刻意將 thread pool 設定為僅有一個 Worker 的情況下運行。
在此案例中,如果我們先行呼叫 complex 而後在呼叫 simple 會發現,simple 需要等 complex 做完才可以運行。
在這樣的狀況下會發生一個問題,也就是當一些複雜的 API 被呼叫時可能造成一些簡單的 Request 在排隊等待
緊接著筆者透過另起一群 Thread pool 專門處理 complex 這樣即使只有一個 Worker 也不至會造成排隊的問題。
個人對此文章的理解為:建議耗時長的API應與耗時短的API,分別處理這樣可以使伺服器運作得更為順利

Best Practices

注意事項:
    1. 本文來源為 Grizzly 官方網站文件
    2. 請以官方文件為主,此文僅供參考使用
When developing a network application, we usually wonder how we can optimize it. How should the worker thread pool be sized? Which I/O strategy to employ?
There is no general answer for that question, but we’ll try to provide some tips.
  • IOStrategy
    In the IOStrategy section, we introduced different Grizzly IOStrategies.
    By default, Grizzly Transports use the worker-thread IOStrategy, which is reliable for any possible usecase. However, if the application processing logic doesn’t involve any blocking I/O operations, the same-thread IOStrategy can be used. For these cases, the same-thread strategy will yield better performance as there are no thread context switches.
    For example, if we implement general HTTP Servlet container, we can’t be sure about nature of specific Servlets developers may have. In this case it’s safer to use the worker-thread IOStrategy. However, if application uses the Grizzly’s HttpServer and HttpHandler, which leverages NIO streams, then the same-thread strategy could be used to optimize processing time and resource consumption;
  • Selector runners count
    The Grizzly runtime will automatically set the SelectorRunner count value equal to Runtime.getRuntime().availableProcessors(). Depending on the usecase, developers may change this value to better suit their needs.
    Scott Oaks, from the Glassfish performance team, suggests that there should be one SelectorRunner for every 1-4 cores on your machine; no more than that;
  • Worker thread pool
    In the Configuration threadpool-config section, the different thread pool implementations, and their pros and cons, were discussed.
    All IOStrategies, except the same-thread IOStrategy, use worker threads to process IOEvents which occur on Connections. A common question is how many worker threads will be needed by an application?
    In his blog, Scott suggests How many is “just enough”? It depends, of course – in a case where HTTP requests don’t use any external resource and are hence CPU bound, you want only as many HTTP request processing threads as you have CPUs on the machine. But if the HTTP request makes a database call (even indirectly, like by using a JPA entity), the request will block while waiting for the database, and you could profitably run another thread. So this takes some trial and error, but start with the same number of threads as you have CPU and increase them until you no longer see an improvement in throughput.
    Translating this to the general, non HTTP usecase: If IOEvent processing includes blocking I/O operation(s), which will make thread block doing nothing for some time (i.e, waiting for a result from a peer), it’s best to have more worker threads to not starve other request processing. For simpler application processes, the fewer threads, the better.

筆記

  • Grizzly 建議的調整方向可區分成三個 IOStrategy, Selector 和 Worker thread pool。
    • IOStrategy 調整伺服器運作模式,包含 Worker-thread(預設), Same-thread, Dynamic 和 Leader-follower。
    • Selector 建議配合電腦核心數去設定,每 1 ~ 4個核心對應 1 個 Selector(此預設值為 Runtime.getRuntime().availableProcessors() 的數值)
    • 官方提供建議 thread pool 設定為夠用就好並非越大就越快,此外 Worker thread pool 必須配合 IOStrategy 決定是否有效。如:Same-thread 模式中似乎並沒有運用到 Worker。

2015年11月16日 星期一

Hadoop + HBase + Hive 建置手冊(偽叢集模式)

注意事項:
    1. 所有指令皆使用 root 身份執行,僅供練習使用。
    2. PDF 格式部分文字會失真,輸入時請注意符號是否正確。

目錄

套件清單

PackagePackage NameInstallation PathVersion
Oracle Javajdk-7u79-linux-x64.rpm/user/java/java7
Apache Hadoophadoop-2.4.1.tar.gz/opt/hadoop2.4.1
Apache HBasehbase-0.98.13-hadoop2-bin.tar.gz/opt/hbase0.98.13
Apache Hiveapache-hive-1.2.1-bin.tar.gz/opt/hive1.2.1
Apache Zookeeperzookeeper-3.4.6.tar.gz/opt/zookeeper3.4.6

環境配置

OSIPHost Name
CentOS 6.7192.168.60.101master

前置步驟

安裝 JDK

$ rpm -ivh /tmp/jdk-7u79-linux-x64.rpm
$ ln -s /usr/java/jdk1.7.0_79 /usr/java/java

編輯 profile

$ vim /etc/profile
增加內容
export JAVA_HOME=/usr/java/java
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib/rt.jar
export PATH=$PATH:$JAVA_HOME/bin

載入 profile

$ source /etc/profile

產生 SSH key

$ ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ""
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys
$ ssh localhost exit

SSH asking disable

$ vim /etc/ssh/ssh_config
修改內容
StrictHostKeyChecking no

重新啟動 SSH

$ service sshd restart

Security disable

$ setenforce 0

Permanent disable

$ vim /etc/selinux/config
修改內容
SELINUX=disabled

關閉防火牆

$ service iptables stop

開機不啟動防火牆

$ chkconfig iptables off

Apache Hadoop

安裝 Apache Hadoop

解壓縮並建立連結

$ tar -zxvf /tmp/hadoop-2.4.1.tar.gz
$ mv hadoop-2.4.1 /opt
$ ln -s /opt/hadoop-2.4.1 /opt/hadoop

建立 Hadoop 暫存目錄

$ mkdir -p /opt/hadoop/tmp

編輯 hosts

$ vim /etc/hosts
增加內容
192.168.60.101 master

編輯 profile

$ vim /etc/profile
增加內容
export HADOOP_HOME=/opt/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
## HADOOP-9450
export HADOOP_USER_CLASSPATH_FIRST=true
## Add 2016/03/14
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_PREFIX=$HADOOP_HOME

載入 profile

$ source /etc/profile

編輯 slaves

$ vim $HADOOP_HOME/etc/hadoop/slaves
覆蓋內容
master

編輯 core-site.xml

$ vim $HADOOP_HOME/etc/hadoop/core-site.xml
覆蓋內容
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
   <property>
      <name>fs.defaultFS</name>
      <value>hdfs://master:9000</value>
   </property>
   <property>
      <name>hadoop.tmp.dir</name>
      <value>/opt/hadoop/tmp</value>
   </property>
</configuration>

編輯 hdfs-site.xml

$ vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml
覆蓋內容
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.permissions</name>
      <value>false</value>
   </property>
</configuration>

編輯 mapred-site.xml

$ vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
覆蓋內容
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

編輯 yarn-site.xml

$ vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
覆蓋內容
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
   <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>master</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>
</configuration>

編輯 hadoop-env.sh

$ vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
增修內容
export JAVA_HOME=/usr/java/java
export HADOOP_LOG_DIR=/opt/hadoop/logs  

Hadoop format

$ hadoop namenode -format

Reboot all hosts

$ reboot

啟動 Apache Hadoop

$ start-dfs.sh
$ start-yarn.sh

測試

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -write
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -clean
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 2 5

Apache Zookeeper

安裝 ZooKeeper

解壓縮並建立連結

$ tar -zxvf /tmp/zookeeper-3.4.6.tar.gz
$ mv zookeeper-3.4.6 /opt
$ ln -s /opt/zookeeper-3.4.6 /opt/zookeeper

編輯 profile

$ vim /etc/profile
增加內容
export ZOOKEEPER_HOME=/opt/zookeeper
export PATH=$PATH:$ZOOKEEPER_HOME/bin

載入 profile

$ source /etc/profile

編輯 zoo.cfg

$ cp $ZOOKEEPER_HOME/conf/zoo_sample.cfg $ZOOKEEPER_HOME/conf/zoo.cfg
$ vim $ZOOKEEPER_HOME/conf/zoo.cfg
增修內容
dataDir=/opt/zookeeper
server.1=master:2888:3888

編輯 myid

$ vim /opt/zookeeper/myid
覆蓋內容
1

啟動 ZooKeeper

zkServer.sh start

Apache HBase

安裝 HBase

解壓縮並建立連結

$ tar -zxvf /tmp/hbase-0.98.13-hadoop2-bin.tar.gz
$ mv hbase-0.98.13-hadoop2 /opt
$ ln -s /opt/hbase-0.98.13-hadoop2 /opt/hbase

編輯 profile

$ vim /etc/profile
增加內容
export HBASE_HOME=/opt/hbase
export PATH=$PATH:$HBASE_HOME/bin

載入 profile

$ source /etc/profile

編輯 regionservers

$ vim $HBASE_HOME/conf/regionservers
覆蓋內容
master

編輯 hbase-env.sh

$ vim $HBASE_HOME/conf/hbase-env.sh
增加內容
export JAVA_HOME=/usr/java/java
export HBASE_MANAGES_ZK=false

編輯 hbase-site.xml

$ vim $HBASE_HOME/conf/hbase-site.xml
覆蓋內容
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
   <property>
      <name>hbase.rootdir</name>
      <value>hdfs://master:9000/hbase</value>
   </property>
   <property>
      <name>hbase.cluster.distributed</name>
      <value>true</value>
   </property>
   <property>
      <name>hbase.zookeeper.quorum</name>
      <value>master</value>
   </property>
   <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/opt/zookeeper</value>
   </property>
</configuration>

Remove hbase's log4j (bug)

rm -rf $HBASE_HOME/lib/slf4j-log4j12-1.6.4.jar

啟動 HBase

$ start-hbase.sh

Apache Hive

安裝 Hive

解壓縮並建立連結

$ tar -zxvf /tmp/apache-hive-1.2.1-bin.tar.gz
$ mv apache-hive-1.2.1-bin /opt
$ ln -s /opt/apache-hive-1.2.1-bin /opt/hive

編輯 profile

$ vim /etc/profile
增加內容
export HIVE_HOME=/opt/hive
export PATH=$PATH:$HIVE_HOME/bin

載入 profile

$ source /etc/profile

HDFS 上建立資料夾

$ hadoop fs -mkdir /tmp
$ hadoop fs -mkdir -p /user/hive/warehouse

更改資料夾權限

$ hadoop fs -chmod -R 777 /tmp
$ hadoop fs -chmod -R 777 /user/hive/warehouse

啟動 hiveserver2

$ hiveserver2 &

連線方式(新版)

$ beeline -u jdbc:hive2://master:10000