TechMilind

  • Archive
  • RSS

Reactions to Pivotal HD

We, at Greenplum, announced Pivotal HD on February 25, 2013, including HAWQ, the superfast Greenplum MPP SQL engine running directly on top of Hadoop Distributed File System. It became “talk of the town” at O’Reilly Strata Conference. Almost every presenter at Strata that talked about their “SQL” on Hadoop got asked “How does it compare to HAWQ”?

The blogosphere, analysts, news outlets, and competitors have been busy responding, too. Here is a compilation of some of the reactions that I have read, grouped by category, in no particular order:

EMC/Greenplum:

  • EMC Press Release
  • Live from “Hadoop: The Foundation for Change”
  • Introducing Pivotal HD
  • Demonstrating High Performance Future of Hadoop at Strata
  • HAWQ: The New Benchmark for SQL on Hadoop
  • How Hadoop Can Disrupt the Database Industry
  • Dear BI Users: Your Hadoop SQL Wish Has Finally Come True
  • Hadoop is Growing Up
  • New Pivotal HD Unlocks Hadoop’s Big Data Potential 

Competitors:

  • Hortonworks: Separating Open Source Signal from Enterprise Hadoop Noise
  • Hortonworks: Did EMC Just Say Fork You To The Hadoop Community?
  • Cloudera: Open Source, Flattery, and The Platform for Big Data
  • MapR: Insights from Strata Big Data Conference
  • Steve Loughran (Hortonworks): Enterprise Hadoop: yes, but how are you going to fix it? 

Analysts:

  • Tony Baer: SQL Collides into Hadoop
  • Taneja Group: Is Hadoop the New Data Center Platform for All Data ? 
  • iStockAnalyst: Vmware-EMC’s Pivotal HD: Negative for RedHat, Inc.
  • Ventana Research: EMC Looks to be Pivotal for Big Data
  • Wikibon: EMC Integrates Greenplum DB and Hadoop with Pivotal HD
  • Redmonk: Why Open Source Matters, and the Limits of Pivotal HD
  • Forbes: Why SQL Matters, the Limits of Open Source, and Other Lessons of EMC Greenplum’s Pivotal HD
  • Merv Adrian: Open Source “Purity”, Hadoop, and Market Realities
  • Matthew Aslett: What it means to be “all in” on Hadoop

News Outlets:

  • CMSWire: EMC Introduces Pivotal HD - May Be the Most Powerful, Friendly Hadoop Distribution Yet
  • GigaOm: EMC to Hadoop competition: “See ya, wouldn’t wanna be ya.”
  • Wired: Why Hadoop Is the Future of the Database
  • InformationWeek: EMC Brings Data Analysis Breakthrough to Hadoop
  • CIO: EMC Greenplum Tackles Big Data With Hadoop Distribution
  • Slashdot BI: EMC insists that technologies baked into its new Pivotal HD make the latter a speedy data-crunching platform
  • DataCenter Knowledge: EMC Supercharges Hadoop
  • V3.co.uk: EMC boosts Hadoop with Pivotal HD rollout
  • CTR: EMC debuts new distribution of Apache Hadoop: Pivotal HD
  • GeekZone: EMC introduces new Hadoop distribution Pivotal HD
  • Register: EMC touts screeching HAWQ SQL Performance on Hadoop

Others:

  • [Cisco] Cisco to support EMC Pivotal HD on UCS
  • [Cirro] Cirro Announces Support for EMC’s Pivotal Hadoop Distribution
  • [Many] Praise for Pivotal HD
    • #hadoop
    • #greenplum
    • #pivotal
    • #EMC
    • #open-source
    • #apache
  • 2 months ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

My Personal Computer History

In less than 18 months from today, our undergraduate class of 1985, from BITS Pilani, will celebrate 25 years since we graduated. I plan to go to the silver jubilee celebrations planned for October 2014. I hope several of my batchmates dispersed across the globe will try and join this big occasion.

Looking back at the four years (1985-1989) I spent studying Computer Science at BITS, it may seem similar to the first four years several young kids spend for the first time away from the comfort of home. However, for me, and several others, this was the period when we experienced seismic shifts in computing systems.

Until just a week ago, I did not consider myself to have been part of this “great generation”, the way many alumni for many schools trumpet themselves at such reunions. Sure, every batch of students admitted to any college will be touting themselves as the most fortunate to be living in those times, twenty five years later. Immediately upon graduation, there is a feeling of “Been there, done that”. Somehow, on anniversaries that are 5-year multiples, the history improves gradually. For 10-year multiples, the history accommodates increasing percentage of fiction, and reaches peaks on 25-year multiples. I totally understand history and story-telling getting fused around such milestones, because we become conquerors of time, and conquerors (re)write history.

But, because I imagine myself to be a data-driven person, I have tried to keep history as fact-based as possible, perhaps to the point that, I am the most boring person to have around at such reunions, questioning others’ rosy interpretations of those times. I mention our dissatisfaction with teachers; hatred of hard tests, surprise quizzes, exam-driven night-long studying of completely useless trivia etc. all the time. Those were not the best years of our lives, as those who have been viewing personal histories with rose-tinted glasses, claim. Countless hours of our formative years were lost in memorizing trivia, which never helped us in real life in our productive years.

Except for the Computer Science class of about 40 students, who graduated in 1989 with a weird sounding MSc(Tech) degree in CS.

I think we were the best CS batch in the entire history of that esteemed institution. Not because we were the best coming in. Not because, somehow the quality of teachers was the best (far from it.) Not because our curriculum was the best (again, far from it.)

But, because we were fortunate to experience the computing systems that spanned three generations, within a four-year period.

And I realized this upon visiting Computer History Museum in Mountain View, CA recently. My wife’s cousin and his wife were visiting us from New Jersey. They had visited the Bay Area many times previously, and visited all the touristy places before. So, our challenge was to find other interesting spots they had not visited in Bay Area, that would interest them. Computer History Museum was a natural choice for us. It took us all of three hours to realize that it was the best choice we could have made.

There, I saw the IBM 1130 machine, that I first learned to program on, in 1986, in BITS Pilani.

In the first semester of my second year in BITS, every undergraduate student from all disciplines had to take a computer programming course. The primary programming language was Pascal. And the textbook for that course was one of the most boring text, all written in courier fixed-width font, written by an ex-professor (and two of his students) from BITS. After the semester started, I finished reading that book in 3 days, and wrote several Pascal programs on paper during first month. I would go the IPC (computer center, called Information Processing Center) after dinner, and punch out all those programs on stacks of punch cards, submit them to the operator at 11PM, just before IPC closed, and go next day at lunch time to pick up the output in the form of large perforated papers. All of input data, the program, and the output were printed on those papers.

In the remaining 4 months during that semester, having exhausted all the programming assignments from the textbook, I had to come up with many programming tasks on my own. Fortunately, I found a nice book of partial differential equations, and solving them on computer seemed like a good use of my time. (Other courses that I had to take during that semester included general biology, complex algebra etc). The sample code in that book was in Fortran 77. So, I had to learn that language, and translate those codes to Pascal. I remember learning about column-major and row-major matrices, and wondering why each language made those different ordering choices, and getting frustrated because of bugs due to those orderings.

I enjoyed punching cards, submitting to the operator, getting result back! Because of lack of real-time feedback, eliminating syntax errors, thinking through the code before submitting, were the skills I had to master. But soon, I realized that I need not have spent much time on them, because in the next semester, my beloved IBM 1130 was replaced by a minicomputer, HP 1000. That had 8 CRT terminals, where one would type the program, and immediately try to compile and run it!

This was huge! Everyone had a login name, and a password, and operator was not involved at all in getting to run your program!

There was a guy couple of years senior to me (I think his name was Suresh, but I do not remember it now), who was involved in installing and setting up our HP-1000. He discovered that there was a full-fledged micro-processor in the terminal itself, that could be programmed. We found the instruction set of that micro-processor, and started programming it. So, even if we were reserving timeslot on one of those terminals of an HP-1000, we were not running programs on its operating system (RTS-6), but, rather building simple games and editors  on the terminal microprocessor itself. Our computer operators discovered it, and banned us for a week from using HP-1000.

In the summer of my second undergraduate year, we had to undergo a 3-month long internship, called “Practice-School-1”. Ten of us were fortunate enough to get internship at Space Application Center in Ahmedabad. (Grade point average for the first two years was the criterion for selection, so we made sure to get good grades in seemingly irrelevant courses such as general biology, and even a management course.)

In those three months at Space Application Center, I was fortunate to find a project with a great mentor, K. S. Dasgupta, who was getting frustrated trying to find computing time on the center’s shared VAX11-780 for his image processing tasks. He had an IBM PC-XT in his lab, and wondered why he could not use it for analyzing satellite images instead of waiting in queue for using the VAX machine. So, a couple of us started porting his image processing programs, originally written in Fortran to Turbo Pascal on IBM PC-XT. Even if they ran twice as slow on the IBM PC (on Intel 8088 processor), not having to wait for a day to get time slice on the VAX was worth it. Another group of my fellow students got a project of writing a “file-system” on a tape drive for storing and fetching satellite images. The tape drive had a Zilog Z-80 microprocessor (which was instruction-set compatible with Intel 8085). And I enjoyed helping them, learning the instructions in hex code in the process. I still remember C3 as a “call”, and C9 as a “RET”, and several others codes.

When we came back to BITS Pilani for our third year, indeed, we had a PC Lab, with 8 PC-XTs. Having programmed the Z-80 in hex instructions during the summer, I realized that programming in assembly language was much easier, and so I launched a project on the side writing an IDE for 8085 using Turbo Pascal text-based graphics capabilities. The PC-lab was a shared environment, where one had to book 2 hours slot, and vacate it when the reservation expired. So, I had to write all the code in advance in a notebook during the day, and type it in, debug it, and run it during those two hours a day. This was a fun exercise, and once again, eliminating logic errors before typing in the code into a computer was of utmost important to make the most of those two hours of computer time.

One day, after I finished this IDE for 8085 assembly language, a fourth-year student came to talk to me. He had taken an elective course on Microprocessors, and had to submit a complete project to get a grade for that course. His project had several bugs and did not have time to finish it. So, he “bought” my assembler from me for the princely compensation of 2 plates of samosas, and coffee at “Bank Canteen”, and submitted as his project to get an A on that course. That was my first software sale.

C and Unix were introduced to us in the systems programming course in our third year, second semester. The machine running Unix was a new addition to our lab. It was built by a company called HCL, and was called Magnum. It was a six processor shared memory machine. My first parallel computer! It ran a modified version of Microsoft Xenix, where user processes used to run on 5 of those processors, and the sixth (or 0’th) processor was used for operating system. Any time any process made a system call, the process was scheduled on processor 0 to wait for the system call to complete. Obviously, processor 0 introduced a bottleneck, so getting your programs to run on HCL Magnum involved batching a number of system calls, and then, doing as much as one could at user-level.

From IBM 1130 punch-cards to 6 processor SMP HCL Magnum in less than 2 years was the journey that I and all my BITS batchmates took. I think it taught us a lot about the breakneck speed of evolution of computer technology, and therefore, I think our batch was the luckiest to be in BITS Pilani.

After those four years, I moved on to DEC-20, Convex, HP and Sun workstation networks at IIT Kanpur; India’s first indigenously built PARAM 8000 supercomputer (using Inmos 800 chips), PARAM 8600 (that combined Intel i860 processors and Inmos 800) at C-DAC; CM-5, IBM SP-2, Convex Exemplar, SGI Powerchallenge, Origin 2000, Cray T3E, ASCI Red while at UIUC; and now on to huge clusters of commodity servers running Hadoop at Yahoo, Linkedin, and Greenplum.

I have been fortunate to be hands-on with last 25 years of computer history. I consider myself to be part of not only the best CS batch in BITS Pilani, but the best generation to witness this computing revolution.

    • #bitspilani
    • #computerhistorymuseum
    • #ibm1130
    • #pc
    • #hclmagnum
  • 4 months ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Important Data Scientist Trait

It is often said that ingenuity (aka resourcefulness,) is the most important trait in great data scientists. I am often asked about real-world examples of such ingenuity. Unfortunately, almost all real examples of ingenuity among successful data scientists that I have witnessed, reveal inside information about customers, and partners with who I have worked with, in various companies, over the past few years. So, obviously, I could not give many compelling examples of ingenuity in my public talks on the subject so far.

However, I recently came across a great example of ingenuity in a “data scientist” in the book “Information” by James Gleick. This is a well-known example, in fact, and almost 170 years old, when data, let alone data science, was an alien concept.

Imagine it’s 1840 AD and someone called Samuel Morse is working towards fast transmission of information with a new hot technology, called Telegraph. You are Alfred Wail, who is working on minimizing the number of strokes on this small mouse-trap-like contraption that can transmit a dot, a dash, and can pause (Morse code is ternary, not binary.)

Your focus is on information encoded in 26 characters plus 10 digits. You determine that most frequently used characters should have the least number of strokes on this contraption, in order to allow efficient transmission by human operators on a low-bandwidth channel.

In 1840, how do we go about determining which characters are used most in information that has to be transmitted using these 36 symbols?

If you had to do this today, you will download entire English Wikipedia in a few minutes, write a 5 line pig script, execute it in a few seconds on your 1000-node Hadoop cluster, and come up with a character sequence in descending order of frequency, apply Huffman coding scheme to derive the optimal encoding. Done.

But, remember, this was 1840 AD. There was no Hadoop, no clusters, no readily analyzable text corpus, no computers, even.

But Alfred Vail had ingenuity, the most sought after trait in data scientists.

As the book, that ought to be the Bible/Koran/Geeta/Torah of Data Scientists, “How to Measure Anything” states: “Do not assume your problem is unique. Surely a similar problem has been solved before. Look for it. (paraphrasing).”

Vail visited a local newspaper printing press, in Morristown NJ. He counted the movable type-cases for each character there. After all, a newspaper printing press must have known the proportion of letters needed to print the daily newspaper, aka information to be communicated, and optimized the selection by not having a thousand Zs vs hundred Es, right?

Recently, a study about optimality of Vail’s encoding proved that he was 85% close to optimal. Not bad for a 170 year old achievement.

Looking for similar problems that have already been solved, to avoid doing the unnecessary grunt work, is the primary component of data scientist ingenuity. Do the full-scan-based work only when absolutely essential. That’s what’s important to be a successful data scientist.

(p.s. wasn’t it Larry Wall of Perl fame who first said “Laziness is a virtue” in programmers? Well, it’s a virtue in Data Scientists as well. Not quite QED, but, suggests to me that Data Science should a sub-discipline of computer programming.)

    • #datascience
    • #hadoop
    • #ingenuity
  • 10 months ago
  • 3
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Big Data Benchmarking Community

Last few months have been pretty hectic, but satisfying. The first ever Workshop on Big Data Benchmarking on May 8-9, 2012 will soon become a reality. We are very excited about this effort. Thanks to my co-organizers, and all the participating organizations. Here is my effort to build a logo-cloud (logo-sizes have been normalized to 120x60, and in no particular order) of organizations who have confirmed in this effort. I just searched for “<organization-name> logo” on Google images, and inserted the best image I could find. Thanks everyone.

Organizers:

Center for Large Data Systemsgreenplum.comCiscoSeagateBrocadeHathitrustOracleMellanoxUniversity of TorontoNSF

Participants:

University of WashingtonRed HatfacebookUC IrvineClouderaHPMapRWhamcloudTeradataIntelSDSCShellAMDConvey ComputerSASTwitterActianBMMSoftMonetDBCA LabsHortonworksScrippsNetflixJohns HopkinsvmwareLinkedinnetappDellMicrosofthuaweiGoogle

  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Workshop on Big Data Benchmarking

A constant debate in the nascent Big Data products space is how performant is *your product* versus *my product*. We take one workload, say, Terasort benchmark, and demonstrate that when Tera is interpreted as 100 Giga, someone with an expensive 100 Giga product is faster than a cheaper Tera. Anyone with a basic background in system architecture may say this is trivial, but by the time such a conclusion reaches the new broadcast media, aka Twitter, the agile analysts have already declared winners.

There is something wrong in this scenario, don’t you think ?

I have been dealing with such misinformation in the HPC world for the last 20 years (LINPACK, anyone?). So, when I met Dr. Chaitan Baru in 2008, and first discussed the need to have a rigorous benchmarking suite for Big Data, I was pleased at finding a respected academic researcher who shares my view. When we met again at at the SC’11 conference, and he described the effort he was leading towards a definitive big data benchmarking suite. I immediately joined that effort.

The first outcome of this collaboration is going to be an invitation-only workshop to be held in the Bay Area, on May 8-9 2012. I am on the program committee, along with several leads on Big Data from UCSD, Cisco, Brocade, Oracle, Mellanox and several other industry and academic sponsors. NSF (National Science Foundation) will be a co-sponsor of this workshop.

Major Big Data workloads are from sensors (including software sensors producing logs), documents, science, and graphs. I will be leading the graph benchmarks area. I have already had preliminary discussions with data scientists at Linkedin, Facebook, and Twitter; owners of large human graphs. I had contacted graph researchers in Google and Yahoo, but for some reason, they never responded to my pings. (At least from Yahoo, I have received intent of participation out of band, but nothing so far from Google.)

My pet peeve for last several years, is that vendor-defined benchmarks are useless, since they get skewed towards vendor’s features. Having an independent academic entity, such as Center for Large Data Systems (CLDS), based in San Diego Supercomputing Center, led by respectable, neutral researchers such as Dr. Chaitan Baru, and Dr. James Short, is the best possible avenue in defining a suite of benchmarks, that is actually representative of the real workloads.

The expected outcome of this workshop is a benchmark suite for Big Data, that will be proposed as one of the TPC benchmark suites to the TPC technical committee, collocated with VLDB 2012 conference at Istambul, in August, 2012.

I am very excited for the prospects of defining an industry standard big data benchmark that has a very good chance of being ratified by TPC as *the* big data benchmark.

No more running a 100GB sort in a Terabyte RAM, and declaring yourselves winner !

Join me in restoring sanity to big data product comparison.

    • #hadoop
    • #bigdata
    • #benchmarks
  • 1 year ago
  • 1
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Help me select panelists

GITPRO is a global networking platform for Indian Tech Professionals for their professional and self development and for contributing back to the development of India and their respective host society with their knowledge, skills and resources.

GITPRO World 2012 Conference would be focussed on Emerging Technologies and Opportunites for professionals and entrepreneurs. This conference would provide an unique opportunity to understand the wave of emerging technologies that would matter in 2012. This understanding would give leverage to succeed in profession and entrepreneurship. This conference would also give opportunity to meet industry leaders, executives, and fellow professionals from various leding companies and startups.

As part of this conference, I have been asked to lead the panel discussion on Big Data, Hadoop and NoSQL Ecosystem. This panel discussion will be held in Mountain View, CA (most probably at Computer History Museum) on Saturday, February 18, 2012. It will be an hour long discussion with approximately 35 minutes of Q&A initiated by the moderator (me), before the floor is opened to the audience.

There will be 4-5 panelists, excluding the moderator. The panelists would represent the leaders in this ecosystem, and should be illuminating and entertaining, the kind one would pay real money to listen to.

So, please help me choose the panel. You can comment on this post with your recommendations, or DM me on twitter (@techmilind).

(Note: Even if GITPRO is an association of Indian tech professionals, panelists do not have to be Indian tech professionals.)

    • #hadoop
    • #nosql
    • #gitpro
    • #bigdata
  • 1 year ago
  • 26
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Cloud/Hadoop Security Grand Challenge

A couple of years ago, at OpenCirrus (https://opencirrus.org/) summit, I was approached by a Professor from Carnegie Mellon University’s (CMU) Qatar campus (http://www.qatar.cmu.edu/) to conduct my Hadoop training sessions there. When I expressed interest, and we started talking, he mentioned that the Govt of Qatar has expressed a lot of interest in using the so called Big Data technologies (i.e. Hadoop) in analyzing local Oil and Gas exploration data. As our discussions progressed on the sidelines of the summit, he also casually mentioned that the leakage of Oil and Gas exploration data in Qatar is a criminal offense, punishable by death penalty in Qatar!

Needless to say, I haven’t yet been to Qatar to train potential Hadoop programmers in Oil and Gas industry. :-)

It has been two years since that conversation, but it came to the top of my mental stack a couple of times in the last few weeks.

First, when, on the Apache Hadoop mailing list, there was a post accusing a closed-source distributed file system to be “insecure”, and claiming Apache Hadoop to be completely secure.

Second, while listening to a panel discussion on “HPC in Cloud” at Supercomputing 2011 conference in Seattle, where several panelists, and many in the audience, expressed security as one of the prime concerns for serious HPC users moving to the public cloud.

In this panel, I also mentioned the role of Grand Challenge problems (http://en.wikipedia.org/wiki/Grand_Challenge) and awards to propel innovation. Looking back, many innovations in the HPC space, from vendors, academia, and government labs resulted as an answer to the grand challenge problems.

One way to bring Cloud/Hadoop security to the forefront, and to propel innovation in that space, is to construct a grand challenge problem for it, and award those handsomely that win this grand challenge.

So, this is the grand challenge: Upload the Oil and Gas exploration data from Qatar, un-encrypted, to a public Hadoop cluster or public cloud, and keep it for one year. Announce this fact all over the world. Do not officially give access to this dataset to anyone else other than who uploaded it, i.e. chmod it 0700. And have the responsible person(s) reside in Qatar for that duration, after getting them to waive off deportation from Qatar, and getting them to agree to be prosecuted under Qatar laws.

I urge those claiming that Apache Hadoop is secure, or that the Public Cloud is secure, to take this grand challenge.

(As much as I love Apache Hadoop, my position has always been that it is “securable”, not secure. Almost every computer system that I have used for the last 20 years, from my windows PC to my iPhone, is securable, but not secure. Security is achieved by processes, not software. Those who claim their software to be secure should find this grand challenge easy.)

  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Inflection points for Hadoop

Inflection points in the context of history are points on the X-Axis, especially important and relevant when the X-Axis represents a timeline. When the Y-Axis is adoption for a new technology, such as Hadoop, these inflection points show what feature went in there that increased adoption of this new technology.

Since I have been involved with Hadoop since it’s first inflection point (being adopted by Yahoo), I thought I should do a little analysis on Hadoop usage on the Y-Axis and timeline (or rather, feature set) on the X-Axis.

Inflection points are identified by change of slope of the above-mentioned graph. If I really were a “data-scientist”, I would have produced numbers to back up claims. But I am just a historian here, so I can just say things, and based on my years of involvement in Hadoop, readers are just supposed to agree :-)

So, here it goes:

The first inflection point in Hadoop came with adoption at Yahoo in March 2006. This is when there was a 600-node cluster (called kryptonite), running Hadoop 0.1 that anyone in Yahoo could get access to, if they wanted. The problem was, not a lot of people within Yahoo knew that they wanted (or needed) access to this service. A few of them did, and to our surprise, they were not from the search team, as we had expected. They were from the team called the Applied Research team. (Almost all of this team is in Microsoft now.) So, their need proved the need for Hadoop in Yahoo. Anyone who claims that Yahoo developed Hadoop for Yahoo search (in particular Webmap) is both right (yes, that was the primary need, and it even had a project name called W1W), and wrong (W1W happened finally in 2009, much later than when Hadoop was already widespread).

The second inflection point for Hadoop was the introduction of Hadoop Streaming. This was a vision of Arkady Borkovsky, now a CTO at Yandex Labs, allowing common unix tools such as grep, sed, awk etc to be used with Hadoop. He was passionate about it, and donated a full-time engineer form his team, Michel Tourn, to build it. That attracted a lot of people from the Applied research team in Yahoo to port their existing applications to Hadoop. Because now, all that one has to do is write a sequential program that reads lines from stdin, and writes lines to stdout. (I am told that even a program written in C# can do that !)

The third inflection point came from outside of Yahoo!. Tom White (who was not at Cloudera then) decided to write a file system interface for Amazon S3, and made hadoop map-reduce run inside Amazon EC2. With these patches, Hadoop could run “On-Demand” for anyone not having their own infrastructure. Today, when people equate “Hadoop” to “Cloud”, they need to be thankful to Tom, because he made it happen. I remember being part of a discussion where Yahoo! sponsored his Hadoop book project with O’Reilly. One correction I want to make to all the historians out there is that Tom was not in Cloudera when he started writing that “Hadoop Definitive Guide”. In fact, I just confirmed with him today, that when he came to my Hadoop tutorial at Apachecon 2008 in New Orleans, he was an independent contributor to Hadoop, based in UK. He joined Cloudera later. (Edit: Thanks Tom for correcting me. At Apachecon 2008, Tom had already joined Cloudera, but was working from UK.)

The fourth inflection point, I think, is formation of Cloudera. The founders included Christophe Biscilia, who was featured on the cover of Business Week, because he convinced Eric Schmidt (the CEO of Google, now investor in Christophe’s company, Odiago) that Hadoop is a good recruiting tool for Google, since it is based on Google’s infrastructure, and Open Source, therefore Google does not have to open up it’s own code, and still get bright students already trained on a distributed file system, and a mapreduce framework. Second founder was Jeff Hammerbacher, who ran facebook data team, hired a lot of ex-Yahoo folks, got them to build Hive and open-sourced Hive. Third founder was Amr, who ran the search metrics team at Yahoo. The last project he did at Yahoo was choose a vendor for a query called “Monster Query” (as it was known in GridMix2). This was a huge Click-View join, where Hadoop, Oracle, Greenplum DB, Aster etc competed. One of my team members, as well as the performance architect for Hadoop, were dedicated for this project, and proved that given enough nodes, Hadoop can actually beat all these commercial products. So, Christophe, Jeff, and Amr formed Cloudera, and started marketing Hadoop. That increased the visibility of Hadoop a lot. (Edit: Thanks again to Tom to point out that Mike Olson, the current CEO of Cloudera was the fourth founder.)

I think there were the four inflection points for Hadoop, that changed the slopes of adoption of Hadoop.

I will conclude this post with my prediction of the fifth inflection point. I think the refactoring of the mapreduce framework, which separated job scheduling from resource allocation, popularly known as YARN (yet another resource negotiator) will be the fifth inflection point. In three of the last four inflection points, there was a company who benefited with that move. This fourth inflection point is different, because the entire community is going to benefit from that.

I am very happy to be contributing to yet another hockey stick adoption curve.

(Edit: Thanks Anonymous for pointing out that my point-counting was wrong. I wasn’t using hadoop for counting :-)

  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Modeling with Hadoop Tutorial Slides

For as long as I can remember, Hadoop tutorials and seminars began and ended with the WordCount example. (Not that there is anything wrong with that :-). As Peter Norvig has pointed out on several occasions that we may call it Statistical Language Models or whatever, but in reality, we are just counting.) But for most people, making a leap from the WordCount example to real-world use cases of Hadoop requires a lot of squinting, and a lot of imagination.

Late last year, when KDD2011 tutorial organizer, Deepak Agarwal, emailed Vijay K Narayanan (Principal Scientist at Yahoo! labs) and me about the possibility of a Hadoop tutorial at KDD2011, we proposed that we would include several real-world use cases of Hadoop, and not have WordCount mentioned even once during the tutorial (I had to rename that example to Unigrams :-)

It was a 3-hour tutorial, split into 3 parts. In the first part, I gave an overview of Hadoop, discussed HDFS and MapReduce framework in brief. Since many Machine-Learning researchers are using MapReduce, in-spite of its known deficiencies (especially when it comes to fine-grained iterative algorithms), I included a section about Next-Generation MapReduce (which, IMHO, is wrongly named. It should be called “Next Generation Resource Management”), which allows for multiple parallel programming paradigms co-exist on the same multi-tenant clustered resources, as MapReduce. (Thanks to @acmurthy for sharing his powerpoint. It made my life easier.)

In the second part, Vijay described several real-world modeling algorithms, and how they are implemented using MapReduce in Hadoop. He covered Ensemble methods (random subspace bagging, robust subspace bagging, COMET), Statistical Query Models (SQM) with many classification and regression methods (linear, naive bayes, logistic regression, support vector machines, decision trees, k-means and canopy clustering).

In the third part, attendees did a hands-on exercise training a linear regression model on a real dataset from the Marine Resources Division at the department of primary industry and fisheries, Tasmania. The task was to predict the age of abalone as a function of physical attributes. Since the dataset was small, we added gaussian noise and replicated the dataset ten-fold, to give attendees a taste of real-world big datasets. These hands-on exercises were carried out on Greenplum HD (Community Edition) running in a VM.

Slides for this tutorial are now available on slideshare in two parts: part 1, and part 2.

  • 1 year ago
  • 1
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

PureStorage Opens Kimono for …. nothing

I have been following PureStorage for a long time. Their intentions as expressed in tantalizing kimono-slits (prior to opening their kimono today, finally), were to make SSD-based storage attractive.

Of course, SSD-based storage is already attractive, except for cost and reliability (number of reliable writes per page is less than other forms of storage), except for some of the prevalent data access patterns.

Storage access patterns are mainly classified into two buckets, input-output operations per seconds (IOPS), and bandwidth (MB read/written per second).

When a vendor highlights IOPS, the target audience is traditional OLTP workload, where multiple IO operations are made to access a small data access per query. Such as “select avg(user.age) from xyz where user.city==’new york’. Such a query consults an index on user.city, chooses the on-disk blocks that refer to records with this criterion, and reads only the records that satisfy this criterion. Since number of accesses required for this query are less than the total data size access, IOPS is a better way to model performance of this query.

When a vendor highlights their bandwidth, they are highlighting use cases, that read an overwhelming portion (>60%) of values from their database. For example, “select distinct(user.city) from xyz where user.age >10 and user.age <90”. Assuming that a majority of users fall within this range, the best choice is to bypass the age index, and scan all records.

What PureStorage promises is that all data is stored in SSDs, but are de-duplicated when they are stored, and restored when they are accessed. And, since in many cases, data contains a lot of duplicates (in columns, rather than in rows), such data can be stored cost-effectively on SSDs.

A few facts that I have gathered in my past experience need to be mentioned here, since these are the target audiences for PureStorage (i.e. in the Big Data space):

Duplication is rare in rows, but it is very common in columns.

If you separate columns, or column groups, that compress well with basic data-agnostic compression schemes as LZO (anyone familiar with the characteristics of fact tables versus dimension tables would concur), reduces amount of data 20x, resulting in cost-effective fast storage.

De-duplication at a block level, in row-oriented stores results in a lot less compression (unless one is dealing with trivial stores such as email collections, with a lot of forwards, replies etc.)

Cost of block-level de-duplication, (with methods such as taking a one-way hash of a block, and comparing against other hashes to determine equality of underlying data), is higher than just storing it block-compressed. (Simple math, suggests that consulting a random-access store, with TCP — and it’s three-way handshake— means that local calculation of hashes is much faster than comparing them with global de-dup index).

For large data, with variety, block-level de-duplication does not result in much saving. A sliding-window based approach to near duplicate-detection results in much more saving of saving, and enough to justify a high-cost sequential storage. Note: not for random-access storage.

So, unless data is columnar, or access is mainly based on IOPS, De-dup and SSDs are not cost-effective as of now.

Given all these observations, PureStorage is clearly a future play, based on future applications, and future access patterns. (At KDD 2011, I have come across a lot of these access patterns, that might become popular in future.) Because, current IOPS-oriented access patterns are not based on data that is block-based and is dedup-oriented. And current bandwidth-oriented access pattens do not benefit from deduping, since these datasets are row-based and denormalized.

So, in my humble opinion, Pure Storage needs to wait a while, before its open kimono becomes attractive. Unfortunately, the algorithms that are going to make this access patterns the norm in the industry, will depend on traditional disk-based storage getting accessed via async accesses, based on fast networking. More on that later :-)

  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
Page 1 of 2
← Newer • Older →

Portrait/Logo

About

I was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter-scale production system. I have been contributing and working with Hadoop since version 0.1.0. I started the Yahoo! Grid solutions team focused on training, consulting, and supporting hundreds of new migrants to Hadoop. Parallel programming languages and paradigms has been my area of focus for over 20 years. So far, I have worked at the Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo! and Linkedin. Currently, I am the Chief Architect at Greenplum Labs, part of EMC.

(Disclaimer: Opinions expressed in this blog are those of the author, and do not necessarily represent the views of any organization, past or present, the author might be affiliated with.)

Me, Elsewhere

  • @techmilind on Twitter
  • techmilind on Delicious
  • Linkedin Profile

Twitter

loading tweets…

Following

  • kurtbrown
  • RSS
  • Random
  • Archive
  • Mobile
Effector Theme by Pixel Union