tag:blogger.com,1999:blog-87990307581533041842024-03-14T00:11:44.081-07:00Harini9505140022Anonymoushttp://www.blogger.com/profile/02735181906041798951noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-8799030758153304184.post-35773869225238921412014-04-17T00:07:00.000-07:002014-04-17T00:07:11.661-07:00Hadoop definitive guide Harini 9505140022<div dir="ltr" style="text-align: left;" trbidi="on">
<h3 class="post-title entry-title" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 24px; font-weight: normal; margin: 0.75em 0px 0px; position: relative;">
<a href="http://www.thecloudavenue.com/2011/10/hadoop-definitive-guide-book-review.html" style="color: #888888; text-decoration: none;">'Hadoop - The Definitive Guide' Book Review</a></h3>
<div class="post-header" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px; line-height: 1.6; margin: 0px 0px 1.5em;">
<div class="post-header-line-1">
</div>
</div>
<div class="post-body entry-content" id="post-body-8973228845725892390" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 15px; line-height: 1.4; position: relative; width: 640px;">
<div dir="ltr" trbidi="on">
<div>
For those who are interested and serious in getting into Hadoop, besides going through the tons of articles and tutorials on the Internet, '<a href="http://www.amazon.com/gp/product/1449389732/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=hadtip-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1449389732" style="color: #888888; text-decoration: none;" target="_blank">Hadoop - The Definitive Guide</a>' is a must have book. Most of the tutorials stop with the 'Word Count' example, but this book goes into the next level explaining the nuts-n-bolts of the Hadoop framework with a lot of examples and references. The most interesting and important thing is that the book also mentions why certain design decisions where made in Hadoop.</div>
<div>
<br /><div style="text-align: center;">
<a href="http://www.amazon.com/gp/product/1449389732/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=hadtip-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1449389732" style="color: #888888; text-decoration: none;" target="_blank"><img height="200" src="http://2.bp.blogspot.com/-qgWBcTnGKrY/Tv_glEfrTDI/AAAAAAAAJMg/rWJhvbw7HEg/s200/hadoop-book-2ed.jpg" style="-webkit-box-shadow: rgba(0, 0, 0, 0.0980392) 1px 1px 5px; background-color: white; background-position: initial initial; background-repeat: initial initial; border: medium none; box-shadow: rgba(0, 0, 0, 0.0980392) 1px 1px 5px; padding: 5px; position: relative;" width="152" /></a></div>
</div>
<div>
Not only the book covers HDFS and MapReduce, but also gives an overview of the layers which sit on top of Hadoop like <a href="http://pig.apache.org/" style="color: #888888; text-decoration: none;">Pig</a>, <a href="http://hive.apache.org/" style="color: #888888; text-decoration: none;">Hive</a>, <a href="http://hbase.apache.org/" style="color: #888888; text-decoration: none;">HBase</a>, <a href="http://zookeeper.apache.org/" style="color: #888888; text-decoration: none;">ZooKeeper</a> and <a href="http://incubator.apache.org/projects/sqoop.html" style="color: #888888; text-decoration: none;">Sqoop</a>.<br /><br />The book could definitely have the following<br /><ul style="line-height: 1.4; margin: 0.5em 0px; padding: 0px 2.5em;">
<li style="margin: 0px 0px 0.25em; padding: 0px;">MapReduce is covered in detail, but HDFS internals and fine-tuning are at a high-level.</li>
<li style="margin: 0px 0px 0.25em; padding: 0px;">Also, to be in sync with Hadoop development and features, it's absolutely necessary to get source from <a href="http://svn.apache.org/repos/asf/hadoop/common/trunk/" style="color: #888888; text-decoration: none;">trunk</a> or from <a href="http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/" style="color: #888888; text-decoration: none;">another branch</a> and build, package and try it out.</li>
</ul>
</div>
</div>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/02735181906041798951noreply@blogger.com1tag:blogger.com,1999:blog-8799030758153304184.post-26095210716942330072014-04-16T23:52:00.001-07:002014-04-16T23:52:14.359-07:00Hadoop Blogs Harini 9505140022<div dir="ltr" style="text-align: left;" trbidi="on">
<h3 class="post-title entry-title" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 24px; font-weight: normal; margin: 0.75em 0px 0px; position: relative;">
<a href="http://www.thecloudavenue.com/2011/10/nice-blogs-on-hadoop-and-distributed.html" style="color: #888888; text-decoration: none;">Nice blogs on Hadoop and Distributed Computing</a></h3>
<div class="post-header" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px; line-height: 1.6; margin: 0px 0px 1.5em;">
<div class="post-header-line-1">
</div>
</div>
<div class="post-body entry-content" id="post-body-7836190692694795397" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 15px; line-height: 1.4; position: relative; width: 640px;">
<div dir="ltr" trbidi="on">
Here are some of the blogs I had been following to keep myself updated with the latest happening around Hadoop and Distributed Computing. If I missed any blogs, please add them in the comments and I will add them to the list.<br /><br /><br /><a href="http://www.allthingsdistributed.com/" style="color: #888888; text-decoration: none;">http://www.allthingsdistributed.com/</a><br /><br /><a href="http://allthingshadoop.com/" style="color: #888888; text-decoration: none;">http://allthingshadoop.com/</a><br /><br /><a href="http://www.asterdata.com/blog/" style="color: #888888; text-decoration: none;">http://www.asterdata.com/blog/</a><br /><br /><a href="http://atbrox.com/" style="color: #888888; text-decoration: none;">http://atbrox.com/</a><br /><br /><a href="http://www.cloudera.com/blog/" style="color: #888888; text-decoration: none;">http://www.cloudera.com/blog/</a><br /><br /><a href="http://developer.yahoo.com/blogs/hadoop/" style="color: #888888; text-decoration: none;">http://developer.yahoo.com/blogs/hadoop/</a><br /><br /><a href="http://hadoopblog.blogspot.com/" style="color: #888888; text-decoration: none;">http://hadoopblog.blogspot.com/</a><br /><br /><a href="http://www.hortonworks.com/blog/" style="color: #888888; text-decoration: none;">http://www.hortonworks.com/blog/</a><br /><br /><a href="http://www.mapr.com/blog/" style="color: #888888; text-decoration: none;">http://www.mapr.com/blog/</a><br /><br /><a href="http://www.michael-noll.com/blog/" style="color: #888888; text-decoration: none;">http://www.michael-noll.com/blog/</a><br /><br /><a href="http://platformcomputing.blogspot.com/" style="color: #888888; text-decoration: none;">http://platformcomputing.blogspot.com/</a><br /><br /><a href="http://steveloughran.blogspot.com/" style="color: #888888; text-decoration: none;">http://steveloughran.blogspot.com/</a><br /><br /><a href="http://tdunning.blogspot.com/" style="color: #888888; text-decoration: none;">http://tdunning.blogspot.com/</a></div>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/02735181906041798951noreply@blogger.com0tag:blogger.com,1999:blog-8799030758153304184.post-24463895610736868582014-04-16T23:30:00.002-07:002014-04-16T23:30:13.784-07:00Retrieving Hadoop Counters in Map/Reduce Tasks Harini 9505140022<div dir="ltr" style="text-align: left;" trbidi="on">
<h3 class="post-title entry-title" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 24px; font-weight: normal; margin: 0.75em 0px 0px; position: relative;">
<a href="http://www.thecloudavenue.com/2011/11/retrieving-hadoop-counters-in-mapreduce.html" style="color: #888888; text-decoration: none;">Retrieving Hadoop Counters in Map/Reduce Tasks</a></h3>
<div class="post-header" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px; line-height: 1.6; margin: 0px 0px 1.5em;">
<div class="post-header-line-1">
</div>
</div>
<div class="post-body entry-content" id="post-body-1100566630897281720" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 15px; line-height: 1.4; position: relative; width: 640px;">
<div dir="ltr" trbidi="on">
Hadoop uses Counters to gather metrics/statistics which can later be analyzed for performance tuning or to find bugs in the MapReduce programs. There are some predefined Counters and Custom counters can also be defined. <a href="http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21/mapreduce/src/java/org/apache/hadoop/mapreduce/JobCounter.java" style="color: #888888; text-decoration: none;">JobCounter</a> and <a href="http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21/mapreduce/src/java/org/apache/hadoop/mapreduce/TaskCounter.java" style="color: #888888; text-decoration: none;">TaskCounter</a> contain the predefined Counters in Hadoop. There are lot of <a href="http://www.blogger.com/" style="color: #888888; text-decoration: none;">tutorials</a> on incrementing the Counters from the Map and Reduce tasks. But, how to fetch the current value of the Counter from with the Map and Reduce tasks.<br /><br />Counters can be incremented using the <a href="http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/Reporter.html" style="color: #888888; text-decoration: none;">Reporter</a> for the Old MapReduce API or by using the<a href="http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/Mapper.Context.html" style="color: #888888; text-decoration: none;">Context</a> using the New MapReduce API. These counters are sent to the TaskTracker and the TaskTracker will send to the JobTracker and the JobTracker will consolidate the Counters to produce a holistic view for the complete Job. The consolidated Counters are not relayed back to the Map and the Reduce tasks by the JobTracker. So, the Map and Reduce tasks have to contact the JobTracker to get the current value of the Counter.<br /><br />StackOverflow <a href="http://stackoverflow.com/questions/8009802/is-there-a-way-to-access-number-of-successful-map-tasks-from-a-reduce-task-in-an/8013573#8013573" style="color: #888888; text-decoration: none;">Query</a> has the details on how to get the current value of a Counter from within a Map and Reduce task.<br /><br /><b>Edit:</b> Looks like it's not a good practice to retrieve the counters in the map and reduce tasks. Here is an <a href="http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201112.mbox/%3CCAFE9998.2FEF6%25evans@yahoo-inc.com%3E" style="color: #888888; text-decoration: none;">alternate approach</a> for passing the summary details from the mapper to the reducer. This approach requires some effort to code, but is doable. It would have been nice if the feature had been part of Hadoop and not required to hand code it.</div>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/02735181906041798951noreply@blogger.com0tag:blogger.com,1999:blog-8799030758153304184.post-78810177703177329172014-04-16T23:28:00.004-07:002014-04-16T23:28:18.559-07:00HDFS Name Node High Availability Harini 9505140022<div dir="ltr" style="text-align: left;" trbidi="on">
<h3 class="post-title entry-title" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 24px; font-weight: normal; margin: 0.75em 0px 0px; position: relative;">
<a href="http://www.thecloudavenue.com/2011/11/hdfs-name-node-high-availability.html" style="color: #888888; text-decoration: none;">HDFS Name Node High Availability</a></h3>
<div class="post-header" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px; line-height: 1.6; margin: 0px 0px 1.5em;">
<div class="post-header-line-1">
</div>
</div>
<div class="post-body entry-content" id="post-body-7781188256388813543" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 15px; line-height: 1.4; position: relative; width: 640px;">
<div dir="ltr" trbidi="on">
<a href="http://wiki.apache.org/hadoop/NameNode" style="color: #888888; text-decoration: none;">NameNode</a> is the heart of HDFS. It stores the namespace for the filesystem and also tracks the location of the blocks in the the cluster. The location of the blocks are not persisted in the NameNode, but the DataNodes report the blocks it has to the NameNode when the DataNode starts. If an instance of NameNode is not available, then HDFS is not accessible till it's back running.<br /><br />Hadoop 0.23 release introduced <a href="http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/Federation.html" style="color: #888888; text-decoration: none;">HDFS federation</a> where it is possible to have multiple independent NameNodes in a cluster, where in a particular DataNode can have blocks for more than one Name Node. Federation provides horizontal scalability, better performance and isolation.<br /><br />HDFS NN HA (NameNode High Availability) is an area where active work is happening. Here are the<a href="https://issues.apache.org/jira/browse/HDFS-1623" style="color: #888888; text-decoration: none;">JIRA</a>, <a href="http://www.slideshare.net/hortonworks/nn-ha-hadoop-worldfinal" style="color: #888888; text-decoration: none;">Presentation</a> and <a href="http://www.youtube.com/watch?v=D9BOcIYnVrs" style="color: #888888; text-decoration: none;">Video</a> for the same. HDFS NN HA was not cut into 0.23 release and will be part of later releases. Changes are going in the <a href="http://svn.apache.org/repos/asf/hadoop/common/branches/HDFS-1623/" style="color: #888888; text-decoration: none;">HDFS-1623</a> branch, if someone is interested in the code.</div>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/02735181906041798951noreply@blogger.com0tag:blogger.com,1999:blog-8799030758153304184.post-35648065181753897042014-04-16T23:23:00.000-07:002014-04-16T23:23:26.596-07:00RDBMS vs NoSQL Data Flow Architecture Harini 9505140022<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 15px; line-height: 21.559999465942383px;">I had an interesting conversation with someone who is an expert in Oracle Database on the difference between RDBMS and a NoSQL Database. </span><br />
<span style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 15px; line-height: 21.559999465942383px;"><br /></span>
<div class="" style="background-color: #fff9ee; clear: both; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 15px; line-height: 21.559999465942383px;">
In a traditional RDBMS, the data is first written to the database, then to the memory. When the memory reaches a certain threshold, it's written to the Logs. The Log files are used for recovering in case of server crash. In RDBMS before returning a success on an insert/update to the client, the data has to be validated against the predefined schema, indexes created and other things which makes it a bit slow compared to the NoSQL approach discussed below.</div>
<div class="" style="background-color: #fff9ee; clear: both; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 15px; line-height: 21.559999465942383px;">
<br /></div>
<div class="" style="background-color: #fff9ee; clear: both; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 15px; line-height: 21.559999465942383px;">
In case of a NoSQL database like HBase, the data is first written to the Log (<a href="http://hbase.apache.org/book.html#wal" style="color: #888888; text-decoration: none;">WAL</a>), then to the memory. When the memory reaches a certain threshold, it's written to the Database. Before returning a success for a put call, the data has to be just written to the Log file, there is no need for the data to be written to the Database and validated against the schema.</div>
<div class="" style="background-color: #fff9ee; clear: both; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 15px; line-height: 21.559999465942383px;">
<br />Log files (first step in NoSQL) are just appended at the end and is much faster than writing to the Database (first step in RDBMS). The NoSQL data flow discussed above gives a higher threshold/rate during data inserts/updates in case of NoSQL Databases when compared to RDBMS.</div>
</div>
Anonymoushttp://www.blogger.com/profile/02735181906041798951noreply@blogger.com0tag:blogger.com,1999:blog-8799030758153304184.post-74583140220427771082014-04-16T23:16:00.002-07:002014-04-16T23:21:40.423-07:009505140022 Harini<div dir="ltr" style="text-align: left;" trbidi="on">
<h3 class="post-title entry-title" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 24px; font-weight: normal; margin: 0.75em 0px 0px; position: relative;">
<a href="http://www.thecloudavenue.com/2012/12/introduction-to-apache-hive-and-pig.html" style="color: #888888; text-decoration: none;">Introduction to Apache Hive and Pig</a></h3>
<div class="post-header" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px; line-height: 1.6; margin: 0px 0px 1.5em;">
<div class="post-header-line-1">
</div>
</div>
<div class="post-body entry-content" id="post-body-3237108795296235433" style="background-color: #fff9ee; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 15px; line-height: 1.4; position: relative; width: 640px;">
<div dir="ltr" trbidi="on">
Apache <a href="http://hive.apache.org/" style="color: #888888; text-decoration: none;">Hive</a> is a framework that sits on top of Hadoop for doing ad-hoc queries on data in Hadoop. Hive supports HiveQL which is similar to SQL, but doesn't support the complete constructs of SQL.<br />
<br />
Hive coverts the HiveQL query into Java MapReduce program and then submits it to the Hadoop cluster. The same outcome can be achieved using HiveQL and Java MapReduce, but using Java MapReduce will required a lot of code to be written/debugged compared to HiveQL. So, it increases the developer productivity to use Hive.<br />
<br />
To summarize, Hive through HiveQL language provides a higher level abstraction over Java MapReduce programming. As with any other high level abstraction, there is a bit of performance overhead using HiveQL when compared to Java MapReduce. But the Hive community is working to narrow down this gap for most of the commonly used scenarios.<br />
<br />
Along the same line Pig provides a higher level abstraction over MapReduce. Pig supports PigLatin constructs, which is converted into Java MapReduce program and then submitted to the Hadoop cluster.</div>
<div dir="ltr" trbidi="on">
<br /></div>
<div dir="ltr" trbidi="on">
Harini 9505140022</div>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/02735181906041798951noreply@blogger.com0