Halyard is an extremely horizontally scalable RDF store with support for named graphs, designed for storage and integration of extremely large semantic data models, and execution of SPARQL 1.1 queries of the whole linked data universe snapshots. The implementation of Halyard is based on Eclipse RDF4J framework and the Apache HBase database, and it is completely written in Java.
Build environment prerequisites are:
In the Halyard project root directory execute the command: mvn package
Optionally, you can build Halyard from NetBeans or other Java development IDE.
Halyard is expected to run on an Apache Hadoop cluster node with a configured Apache HBase client. Apache Hadoop and Apache HBase components are not bundled with Halyard. The runtime requirements are:
Note: Recommended Apache Hadoop distribution is Hortonworks Data Platform (HDP) version 2.4.2 or Amazon Elastic Map Reduce (EMR).
Hortonworks Data Platform (HDP) is a Hadoop distribution with all important parts of Hadoop included, however, it does not directly provide the hardware and core OS.
The whole HDP stack installation through Amabari is described in Hortonworks Data Platform - Apache Ambari Installation page.
It is possible to strip down the set of Hadoop components to HDFS
, MapReduce2
, YARN
, HBase
, ZooKeeper
, and optionally Ambari Metrics
for cluster monitoring.
Detailed documentation about Hortonworks Data Platform is accessible at http://docs.hortonworks.com.
Amazon Elastic MapReduce is a service providing both hardware and software stack to run Hadoop and Halyard on top of it.
Sample Amazon EMR setup is described in Amazon EMR Management Guide - Getting Started.
For the purpose of Halyard it is important to perform the first two steps of the guide:
It is possible to strip down the set of provided components during Create Cluster
by clicking on Go to advanced options
and selecting just Hadoop
, ZooKeeper
, HBase
and optionally Ganglia
for cluster monitoring.
HBase for Halyard can run in both storage modes: HDFS
or S3
.
Instance types with redundant storage space, such as d2.xlarge
, are highly recommended when you plan to bulk load large datasets using Halyard.
Instance types with enough memory and fast disks for local caching, such as i2.xlarge
, are recommended when the cluster would mainly serve data through Halyard.
Additional EMR Task Nodes can be used to host additional Halyard SPARQL endpoints.
A detailed documentation of the Amazon EMR is available at https://aws.amazon.com/documentation/emr.