SQL-On-Hadoop: Pros and Cons
Share this Session:
  Tassos Sarbanes   Tassos Sarbanes
Data Engineer
Credit Suisse


Wednesday, August 19, 2015
03:00 PM - 03:45 PM

Level:  Case Study

Organizations already using Hadoop in production are demanding interactive SQL query support and a smooth integration with existing BI tools. Most of the current map-reduce based systems for analysis including current versions of Hive, Pig, Cascading work well in the non-interactive and batch SLA domain. Many products are attempting to support real-time and interactive SLAs by offering interactive "SQL in Hadoop" solutions. Use cases for SQL-On-Hadoop solutions include supporting interactive ad-hoc queries, supporting reporting/visualization using BI systems like MicroStrategy/Tableau, and multi-source data. Many of these SQL-On-Hadoop solutions have certain aspects in common: 1.On the metadata level it seems that HCatalog/Hive Metastore establishes itself as the de-facto standard for managing schemata across different datasources. 2.Then, there are certain data formats, such as Parquet and ORC, which—for selected workloads—are becoming increasingly popular and more widely used in the wild. 3.Most of the solutions seem to support a wide range of ANSI SQL (in different versions: 1992, 1999, 2003). Above points should help users to move between different SQL-On-Hadoop solutions without too much migration headache.

There are also some notable differences as shown below:

  • Some of the solutions are Apache-backed and with that community-based (Stinger, Drill, Tajo) while others are owned by single entities (Impala, Phoenix, Presto)
  • Some are limited in terms of datasources they can query to the Hadoop ecosystem, while others are from an architectural perspective more flexible and also allow to query relational databases and NoSQL data stores in-situ (Presto, Drill)
  • Another difference is the operations allowed on the data: some are pure (distributed) query engines while others permit update operations

I am a mathematician-turned-computer-scientist, an experienced IT professional with a wealth of capabilities and knowledge acquired specifically in information systems, data management, data (big, fast, and small) analysis and data science in international roles. I am driven in by my ambition to become an established data scientist, through gaining new experience in the data analytics field and improving my skills in areas such as handling big data and new technologies and have a real love for science, technology, engineering and mathematics (STEM). Effectively combining technical expertise with commercial acumen, I particularly enjoy researching, exploring, and understanding specific areas of the financial industry, often preparing presentations, papers, and books of certain areas such as algorithmic development for mathematical and statistical problems, and optimization in numerical computing.

Close Window