Building Hadoop Cluster Using Ansible

Himanshu Agrawal
4 min readMar 23, 2021

--

Introduction To Hadoop -

* Big Data is not a technology we can say that it is just a umbrella of problems which occurs because of huge amount of data and in different formats.

* To solve the BigData problem Hadoop is used. Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model

what is Ansible?

Ansible is a software tool that provides simple but powerful automation for cross-platform computer support. It is primarily intended for IT professionals, who use it for application deployment, updates on workstations and servers, cloud provisioning, configuration management, intra-service orchestration, and nearly anything a systems administrator does on a weekly or daily basis. Ansible doesn’t depend on agent software and has no additional security infrastructure, so it’s easy to deploy.

Installing Ansible

Prerequisites

You install Ansible on a control node, which then uses SSH (by default) to communicate with your managed nodes (those end devices you want to automate).

Control node requirements

Currently Ansible can be run from any machine with Python 2 (version 2.7) or Python 3 (versions 3.5 and higher) installed. This includes Red Hat, Debian, CentOS, macOS, any of the BSDs, and so on. Windows is not supported for the control node

pip3 install ansible

Managed node requirements

On the managed nodes, you need a way to communicate, which is normally SSH. By default this uses SFTP. If that’s not available, you can switch to SCP in ansible.cfg. You also need Python 2 (version 2.6 or later) or Python 3 (version 3.5 or later).

our worspace: /root/ansible_playbook/Hadoop

For this practical I will use two Virtual Machine on My Local System. My NameNode is (IP — 192.168.185.3) & DataNode is (IP — 192.168.185.4).

Inventory File “/root/ip.txt” at Controller Node ( In my case I create Two groups of Managed Node First → namenode Second → datanode) -

Check List of Managed Node with ansible command -

To get all Managed Node IPs list -
# ansible all --list-hosts
To get "NameNode" group IPs list -
# ansible namenode --list-hosts
To get "DataNode" group IPs list -
# ansible datanode --list-hosts

Inside both NameNode and DataNode configure Yum repo, copy Hadoop and Jdk software and install them.

- hosts: all
tasks:
- file:
state: directory
path: "/dvd1"
- mount:
src: "/dev/cdrom"
path: "/dvd1"
state: mounted
fstype: "iso9660"
- yum_repository:
baseurl: "/dvd1/AppStream"
name: "mydvd1"
description: "dvd1 for pacakge"
gpgcheck: no
- yum_repository:
baseurl: "/dvd1/BaseOS"
name: "mydvd2"
description: "dvd2 for package"
gpgcheck: no
- name: "copy hadoop software"
copy:
src: "/root/hadoop-1.2.1-1.x86_64.rpm"
dest: "/root/hadoop-1.2.1-1.x86_64.rpm"
- name: "copy jdk software"
copy:
src: "/root/jdk-8u171-linux-x64.rpm"
dest: "/root/jdk-8u171-linux-x64.rpm"
- name: "install jdk software"
shell: "rpm -ih jdk-8u171-linux-x64.rpm "
ignore_errors: true
- name: "install hadoop software"
shell: "rpm -ih hadoop-1.2.1-1.x86_64.rpm --force"
ignore_errors: true

Inside NameNode :-

- hosts: namenode
vars_prompt:
- name: namenode_ip
prompt: What is your namenode ip?
private: no
tasks:
- name: "create directory"
file:
path: /nn
state: directory
- name: "configure hdfs.xml file"
blockinfile:
path: "/etc/hadoop/hdfs-site.xml"
insertafter: "<configuration>"
block:
<property>
<name>dfs.name.dir</name>
<value>/nn</value>
</property>
- name: "configure hdfs.xml file"
blockinfile:
path: "/etc/hadoop/core-site.xml"
insertafter: "<configuration>"
block:
<property>
<name>fs.default.name</name>
<value>hdfs://{{ namenode_ip }}</value>
</property>
- shell: "echo Y | hadoop namenode -format"
ignore_errors: true
- selinux:
state: disabled
- name: "stop firewalld"
shell: "systemctl stop firewalld"
- name: "stop namenode"
shell: "hadoop-daemon.sh stop namenode"
ignore_errors: true
- name: "start namenode"
shell: "hadoop-daemon.sh start namenode"
ignore_errors: true
- name: "jps"
shell: "jps"

Inside DataNode

- hosts: datanodevars_prompt:- name: namenode_ip
prompt: What is your namenode ip?
private: no
tasks:
- name: "create directory"
file:
path: /dn
state: directory
- name: "configure hdfs.xml file"
blockinfile:
path: "/etc/hadoop/hdfs-site.xml"
insertafter: "<configuration>"
block:
<property>
<name>dfs.data.dir</name>
<value>/dn</value>
</property>
- name: "configure hdfs.xml file"
blockinfile:
path: "/etc/hadoop/core-site.xml"
insertafter: "<configuration>"
block:
<property>
<name>fs.default.name</name>
<value>hdfs://{{ namenode_ip }}</value>
</property>
- selinux:
state: disabled
- name: "stop firewalld"
shell: "systemctl stop firewalld"
- name: "stop datanode"
shell: "hadoop-daemon.sh stop datanode"
ignore_errors: true
- name: "start datanode"
shell: "hadoop-daemon.sh start datanode"
ignore_errors: true
- name: "jps"
shell: "jps"

to check the report

# hadoop dfsadmin -report

Hence, we have automated the complete setup using ANSIBLE.

To add more data nodes, just add the IP of the datanode in inventory of ANSIBLE in [data] group.

ThankYou

My LinkedIn: Himanshu Agrawal

--

--

No responses yet