Apache Giraph – From installation to Example execution

In this article, let me look into Apache Giraph and let’s run the example “SimpleShortestPathsComputation”

First, look at the introduction of Giraph (http://giraph.apache.org/)

Welcome to Apache Giraph!
Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google and described in a 2010 paper. Both systems are inspired by the Bulk Synchronous Parallel model of distributed computation introduced by Leslie Valiant. Giraph adds several features beyond the basic Pregel model, including master computation, sharded aggregators, edge-oriented input, out-of-core computation, and more. With a steady development cycle and a growing community of users worldwide, Giraph is a natural choice for unleashing the potential of structured datasets at a massive scale.

The input to a Giraph computation is a graph composed of vertices and directed edges, see Figure 1. For example vertices can represent people, and edges friend requests. Each vertex stores a value, so does each edge. The input, thus, not only determines the graph topology, but also the initial values of vertices and edges.

Figure 1: An illustration of an execution of a single source shortest paths algorithm in Giraph. The input is a chain graph with three vertices (black) and two edges (green). The values of edges are 1 and 3 respectively. The algorithm computes distances from the leftmost vertex. The initial values of vertices are 0, ∞ and ∞ (top row). Distance upper bounds are sent as messages (blue), resulting in updates to vertex values (successive rows going down). The execution lasts three supersteps (separated by red lines)- cited from Apache Giraph introduction

It’s explained well in detail about the introduction of BSP for graph to Giraph framework to overcome the drawback of graph processing on Hadoop framework in article of  ‘s [How-to: Write and Run Apache Giraph Jobs on Apache Hadoop] in Cloudera Blog.

However, the iterative character of many graph algorithms is a poor fit for the MapReduce paradigm, especially if the graph is a very large one like the full set of interlinked Wikipedia pages. In MapReduce, a data set is streamed into the mappers and aggregation of intermediate results is done in reducers. One MapReduce job can implement one super-step. In a next super-step, the whole graph structure — together with the stored intermediate state of the previous step — have to be loaded from HDFS and stored at the end again. Between processing steps, the full graph is loaded from HDFS and stored there, which is a really large overhead. And let‘s not forget that the intermediate data requires local storage on each worker node while it passes the shuffle-sort phase.

For very large graphs, it is inefficient to repeatedly store and load the more or less fixed structural data. In contrast, the BSP approach loads the graph only once, at the outset. The algorithms assume that runtime-only messages are passed between the worker nodes. To support fault-tolerant operation the intermediate state of the graph can be stored from time to time, but usually not after each super-step.” (the zookeeper is probably used for that reason)

So, what is the BSP(Bulk Synchronous Parallel model)?

Let’s refer to the Graph page of Wekipedia in Giraph introduction page.

The Bulk Synchronous Parallel (BSP) abstract computer is a bridging model for designing parallel algorithms. It serves a purpose similar to the Parallel Random Access Machine (PRAM) model. BSP differs from PRAM by not taking communication and synchronization for granted. An important part of analyzing a BSP algorithm rests on quantifying the synchronization and communication needed.

bsp_abstract_model

Fig 2 : BSP Abstract Model proposed in 1989 byLeslie Valiant (http://www.staff.science.uu.nl/~bisse101/Book/PSC/psc1_2.pdf)

Referring to wiki,  BSP is consist of these components

  1. components capable of processing and/or local memory transactions (i.e., processors),
  2. a network that routes messages between pairs of such components,
  3. and a hardware facility that allows for the synchronization of all or a subset of components.

Fig 3. BSP structure and superstep

A BSP algorithm relies heavily on the third feature; a computation proceeds in a series of global supersteps, which consists of

three components:

    • Concurrent computation: every participating processor may perform local computations, i.e., each process can only make use of values stored in the local fast memory of the processor. The computations occur asynchronously of all the others but may overlap with communication.
    • Communication: The processes exchange data between themselves to facilitate remote data storage capabilities.
    • Barrier synchronisation: When a process reaches this point (the barrier), it waits until all other processes have reached the same barrier.

It is described about the “Single Source Shortest Path Algorithm” using Giraph in the Fig 1 above.

  • In the top row, initial values of 3 vertices are 0, ∞ and ∞. Each edge is 1, 3.
  • In the top row, BSP algorithm is executed. And the each process arrives to the Barrier, at this time the first superstep is completed.(the first red line)
  • Passing the output value as Message value, update as second vertices value.
  • Run the process again, second superstep is completed, output value as message update the third vertices.

Deployment

Test Environmnet – Virtual Hadoop Cluster

Under the 64-bit  4GB of RAM allocated virtual machines of Virtual Box

  • HDFS: Namenode+9 data nodes
  • YARN: Resourcemanager+9  nodemanager nodes
    (standalone host / pseudo mode OK for test)

Software/Hardware Specifications:

  • Hardware:
    • Dual-core 2.5 GHz CPU (64-bit architecture)
    • 4GB RAM (6GB would be much better)
    • 50GB hard drive space
    • 100 Mbps NIC
  • CentOS 7.2(64-bit)
  • Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
  • Apache Hadoop 2.7.2
  • Apache Giraph 1.1.0
  • Hadoop Owner Account: hadoop:hadoop

Hadoop Cluster Setup

Cloudera CDH version or download from Apache Hadoop

Maven Compile / Build

Giraph Building / Testing – Refer to the site http://giraph.apache.org/build.html

Be aware that Giraph 1.1 should be used to run the Giraph under Hadoop 2 (YARN Framework)

Deploy

  1. Go to the directory where giraph source will be located
  2. Download giraph source from git
  3. owner permission of downloaded source.
  4. maven build

    • Be careful of optioins – Consistent dependancy error while building source.
    •  set dependancy  option of the hadoop where giraph installed: -Dhadoop.version=2.x.x
    • “Omit the test” option of  -DskipTests clean install build. Without test skip option,  you’ll get the consistent compile error at giraph-core / giraph-kibble or giraph example continuously.
    • If it fails consistently, delete repository folder in ~/.m2 directory then clone and download source from git and compile again.
    • maven version 3 or above should be installed to use munge plugin. In a mac(OSX), it shoud be interlocking with Xcode.
  5. After the maven build is completed then,
  6. Move to Giraph Root where you downloaded the giraph
  7. Add environment variables for Giraph to the bash profile(~/.bash_profile) or /etc/profile.d/custom.sh
    예)
  • Test
  1. It is assumed that the giraph is installed on hadoop namenode.
  2. Prepare the input data.(http://giraph.apache.org/quick_start.html )
    • First, create an example graph under /tmp/tiny_graph.txt with the follwing:

      Save and close the file. Each line above has the format [source_id,source_value,[[dest_id, edge_value],...]]. In this graph, there are 5 nodes and 12 directed edges. Copy the input file to HDFS:

  3. Run the command below(probably you’ll get error)
    • Run the Simple Shortest Path Example
    • This error occurs because the  source_ id is set to 1 as hard coded in Giraph Example,so  refter to this url (http://stackoverflow.com/questions/27545664/giraph-tutorial-shortestpath-example-job-failing) modify the SimpleShortestPathsComputation.java , then build the source again.
      The location of source code  for “SimpleShortestPathComputation” exists in downloaded sources
    • Modify the source code  SimpleShortestPathsComputation.java in $GIRPH_HOME/giraph-examples/src/main/java/org/apache/giraph/eamples/
  4. Re-run the maven build again.
  5. Execute the example after rebuilding
      • At this time, if YARN server located remote place, you can assign the option like this:  -ca mapred.job.tracker=resourcemanager
      • If Zookeeper server located, you can use this option: -ca giraph.zkList=zookeeper:2081

  • Troubleshooting
    • Exception in thread “main” org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://namenode:9000/user/hadoop/output/shortestpaths already exists:
      이 경우는 이미 출력 파일이 존재하는 경우이므로, 다음 명령으로 출력 결과를 삭제하고 다시 실행한다. In this case, the output file already exists, run below to delete result:


Apache Giraph –

How-to: Write and Run Apache Giraph Jobs on Apache Hadoop – http://giraph.apache.org/

http://blog.cloudera.com/blog/2014/02/how-to-write-and-run-giraph-jobs-on-hadoop/

BSP Model –

http://www.staff.science.uu.nl/~bisse101/Book/PSC/psc1_2.pdf

1,946 total views, 7 views today

Leave a Reply