Building Hadoop Cluster Using Ansible
Introduction To Hadoop -
* Big Data is not a technology we can say that it is just a umbrella of problems which occurs because of huge amount of data and in different formats.
* To solve the BigData problem Hadoop is used. Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model
what is Ansible?
Ansible is a software tool that provides simple but powerful automation for cross-platform computer support. It is primarily intended for IT professionals, who use it for application deployment, updates on workstations and servers, cloud provisioning, configuration management, intra-service orchestration, and nearly anything a systems administrator does on a weekly or daily basis. Ansible doesn’t depend on agent software and has no additional security infrastructure, so it’s easy to deploy.
Installing Ansible
Prerequisites
You install Ansible on a control node, which then uses SSH (by default) to communicate with your managed nodes (those end devices you want to automate).
Control node requirements
Currently Ansible can be run from any machine with Python 2 (version 2.7) or Python 3 (versions 3.5 and higher) installed. This includes Red Hat, Debian, CentOS, macOS, any of the BSDs, and so on. Windows is not supported for the control node
pip3 install ansible
Managed node requirements
On the managed nodes, you need a way to communicate, which is normally SSH. By default this uses SFTP. If that’s not available, you can switch to SCP in ansible.cfg. You also need Python 2 (version 2.6 or later) or Python 3 (version 3.5 or later).
our worspace: /root/ansible_playbook/Hadoop
For this practical I will use two Virtual Machine on My Local System. My NameNode is (IP — 192.168.185.3) & DataNode is (IP — 192.168.185.4).
Inventory File “/root/ip.txt” at Controller Node ( In my case I create Two groups of Managed Node First → namenode Second → datanode) -
Check List of Managed Node with ansible command -
To get all Managed Node IPs list -
# ansible all --list-hostsTo get "NameNode" group IPs list -
# ansible namenode --list-hosts To get "DataNode" group IPs list -
# ansible datanode --list-hosts
Inside both NameNode and DataNode configure Yum repo, copy Hadoop and Jdk software and install them.
- hosts: all
tasks:
- file:
state: directory
path: "/dvd1"
- mount:
src: "/dev/cdrom"
path: "/dvd1"
state: mounted
fstype: "iso9660"
- yum_repository:
baseurl: "/dvd1/AppStream"
name: "mydvd1"
description: "dvd1 for pacakge"
gpgcheck: no
- yum_repository:
baseurl: "/dvd1/BaseOS"
name: "mydvd2"
description: "dvd2 for package"
gpgcheck: no- name: "copy hadoop software"
copy:
src: "/root/hadoop-1.2.1-1.x86_64.rpm"
dest: "/root/hadoop-1.2.1-1.x86_64.rpm"- name: "copy jdk software"
copy:
src: "/root/jdk-8u171-linux-x64.rpm"
dest: "/root/jdk-8u171-linux-x64.rpm"- name: "install jdk software"
shell: "rpm -ih jdk-8u171-linux-x64.rpm "
ignore_errors: true- name: "install hadoop software"
shell: "rpm -ih hadoop-1.2.1-1.x86_64.rpm --force"
ignore_errors: true
Inside NameNode :-
- hosts: namenode
vars_prompt:
- name: namenode_ip
prompt: What is your namenode ip?
private: no
tasks:
- name: "create directory"
file:
path: /nn
state: directory- name: "configure hdfs.xml file"
blockinfile:
path: "/etc/hadoop/hdfs-site.xml"
insertafter: "<configuration>"
block:
<property>
<name>dfs.name.dir</name>
<value>/nn</value>
</property>
- name: "configure hdfs.xml file"
blockinfile:
path: "/etc/hadoop/core-site.xml"
insertafter: "<configuration>"
block:
<property>
<name>fs.default.name</name>
<value>hdfs://{{ namenode_ip }}</value>
</property>- shell: "echo Y | hadoop namenode -format"
ignore_errors: true
- selinux:
state: disabled- name: "stop firewalld"
shell: "systemctl stop firewalld"- name: "stop namenode"
shell: "hadoop-daemon.sh stop namenode"
ignore_errors: true- name: "start namenode"
shell: "hadoop-daemon.sh start namenode"
ignore_errors: true- name: "jps"
shell: "jps"
Inside DataNode
- hosts: datanodevars_prompt:- name: namenode_ip
prompt: What is your namenode ip?
private: notasks:
- name: "create directory"
file:
path: /dn
state: directory- name: "configure hdfs.xml file"
blockinfile:
path: "/etc/hadoop/hdfs-site.xml"
insertafter: "<configuration>"
block:
<property>
<name>dfs.data.dir</name>
<value>/dn</value>
</property>
- name: "configure hdfs.xml file"
blockinfile:
path: "/etc/hadoop/core-site.xml"
insertafter: "<configuration>"
block:
<property>
<name>fs.default.name</name>
<value>hdfs://{{ namenode_ip }}</value>
</property>- selinux:
state: disabled- name: "stop firewalld"
shell: "systemctl stop firewalld"- name: "stop datanode"
shell: "hadoop-daemon.sh stop datanode"
ignore_errors: true- name: "start datanode"
shell: "hadoop-daemon.sh start datanode"
ignore_errors: true- name: "jps"
shell: "jps"
to check the report
# hadoop dfsadmin -report
Hence, we have automated the complete setup using ANSIBLE.
To add more data nodes, just add the IP of the datanode in inventory of ANSIBLE in [data] group.
ThankYou
My LinkedIn: Himanshu Agrawal