A simple walkthrough to get Pig up and running on Centos 7 using an existing Hadoop install. If you have not installed Hadoop please view my post to install Apache Hadoop on Centos 7 before continuing here.
As always, this is as much documentation for me as it is intended to be a tutorial but suggested corrections, additions and omissions are welcomed.
Since the advent of the Hadoop cluster Hive has been the “path of least resistance” language when getting SQL-savvy ETL engineers or data analysts to use languages that leverage Hadoop’s MapReduce framework. Consequently, Pig is often overlooked as the proper tool for constructing data flows because of the need for resources to understand yet another another programming language.
Alan Gates, Pig architect at Yahoo!, makes a solid case for why a procedural language like Pig is preferable to a declarative one like Hive for data flows.
Furthermore, while I have use
/opt here as my installation directory, you’ll notice that none of the commands require sudo permissions. You can do an installation to any directory you have write permissions to in order to use Pig yourself.
1. Download Pig
Go to the Apache Pig download page and click “Download a release now!”. It will then suggest the correct mirror for you to use and take you to the index – select the version of Pig compatible with your version of Hadoop.
For instance, I’ve used:
cd /tmp wget http://apache.mirrors.hoobly.com/pig/pig-0.16.0/pig-0.16.0.tar.gz
For 3rd party packages I live to use the naming convention `/opt/
sudo mkdir /opt/pig sudo cp pig-0.16.0.tar.gz /opt/pig cd /opt/pig
Un-gzip and un-tar:
tar zxvf pig-0.16.0.tar.gz
Add the following to the end of your
export PIG_HOME=/opt/pig/pig-0.16.0 export PATH=$PIG_HOME/bin:$PATH export PIG_CLASSPATH=$HADOOP_HOME/conf
And you’re all set!
pig -version # Apache Pig version 0.16.0 (r1746530) # compiled Jun 01 2016, 23:10:49