Hadoop Streaming with Python

Hadoop Streaming Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows developers to create an run Map/Reduce jobs with any executable or script as the ampper and/or the reducer. For example: hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar \…

Monitoring Hadoop from the browser

Hadoop provides two web interfaces that you should become familiar with, one for HDFS and the other for MapReduce. Both are useful in pseudo-distributed mode and are critical tools when you have a fully distributed setup. The HDFS web UI…

Administering Hadoop

Namenode directory structure :- —————————————— A newly formatted namenode creates the following directory structure: ${dfs.name.dir}/current/VERSION /edits /fsimage /fstime In my machine [root@myhostname current]# pwd /data/2/hadoop/tmp/dfs/name/current [root@myhostname current]# [root@myhostname current]# ll -lhtr total 16K -rw-r–r– 1 root root 110 Jul 22…

MapReduce Job [hadoop]

Running our first MapReduce job We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word…

Getting started with Hive

Hive is a data warehouse that uses MapReduce to analyze data stored on HDFS. In particular, it provides a query language called HiveQL that closely resembles the common Structured Query Language (SQL) standard. Prerequisites Unlike Hadoop, there are no Hive…

Hadoop Map-reduce

Map step: mapper.py It will read data from STDIN, split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT. The Map script will not compute an (intermediate) sum of a word’s occurrences…

Write file to HDFS/Hadoop Read File From HDFS/Hadoop Using Java

import java.io.File; import java.io.IOException; import java.net.URI; import java.net.URISyntaxException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hdfs.DistributedFileSystem; /** * * @author    Srinivas * @email     srinivas@dbversity.com * @Web       www.dbversity.com */ public class WritetoHDFSReadFromHDFSWritToLocal {     private…

Writing Hadoop MapReduce Program in Python

Map step: mapper.py It will read data from STDIN, split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT. The Map script will not compute an (intermediate) sum of a word’s occurrences…

Prevent accidental data loss in Hadoop

Some time you have accidentally delete some file which you are not suppose to do. So what will you do in that case? The option with you is to enable trash (Recycle bin) for hadoop, and define fs.trash.interval to one…

Warning: $HADOOP_HOME is deprecated

Do you have below warnings issue for every command with your Hadoop set-up ? Warning: $HADOOP_HOME is deprecated. [root@hostname logs]# hadoop dfs -ls / Warning: $HADOOP_HOME is deprecated. Found 3 items drwxr-xr-x – root supergroup 0 2014-07-09 09:06 /hadoop drwxr-xr-x…