Frequently Asked Hive Interview Questions Part – 2
1) Explain Hive in brief?
To query and analyze large data sets stored in HDFS, an open source data warehouse framework is developed on top of Hadoop known as Hive.
Hive helps programmers perform the following operations easy and fast
Data encapsulation
Ad hoc queries
Analysis of large datasets
2) Explain the External Table features in Hive?
Unlike RDBMS, where data and table are tightly coupled, External Table in Hive and its data in HDFS is loosely coupled. External Tables are created on top of data residing in HDFS. Even if we drop the external table in the Hive, the data mapped to it continues to reside inside HDFS.
3) Explain the Internal Table features in Hive?
Internal Table in Hive is similar to the tables we have in RDBMS. The data and table schema are tightly coupled. If we drop the internal table in the Hive, the data stored inside it will get deleted.
4) What type of Read and Write operations perform in Hive?
Hive provides READ Many WRITE Once.
5) Explain HCatalog Functionality in Hive?
HCatalog is a table and storage management layer of Hadoop, which supports reading and writing files in any file format for which a Hive SerDe (serializer-deserializer) can be written. By default, HCatalog supports following File formats. They are RCFile, CSV, JSON and Sequence. In case, we want to use a custom file format, we have to provide the InputFormat, OutputFormat and SerDe for that custom file format.
6) What SerDe means?
SerDe is a framework to serialize and deserialize IO.
7) What are windowing functions in Hive?
OVER, RANK
8) What are the different types of metastores that Hive provides?
Three Modes Embedded mode, Local mode, remote mode.
9) How client can interact with Hive?
We can interact with Hive using Web GUI and Java Database Connectivity(JDBC) interface. Generally, clients use command line interface (CLI) to interact with the Hive.
10) What are the file formats that Hive supports?
Hive supports four file formats. These file formats are TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File).
11) When will Hive use MySql database for metadata storage?
Hive use MYSQL to store Hive tables and partitions metadata, when it had to handle multiple concurrent Hive sessions.
12) What is the storage location for internal table in Hive?
/user/Hive/warehouse
13) What is the name of advanced Hive version used in cloudera distribution and list its advantages?
The advanced Hive version used in Cloudera is impala. If can produce results to a query much faster compared to the Hive. It’s a pure Cloudera enterprise edition.
14) Explain partitions in Hive?
Based on the Partition keys present in a table which is the basis for determining how the data is stored, partitions are created inside the Hive.
15) Explain the use Buckets Hive warehouse?
Buckets are used for efficient querying of the Data i.e. present in that partition. The data inside the partitions can be further divided into buckets. This division is performed based on hash of a particular column, we select in the table.
16) Explain Hive working modes?
Hive works in two modes. They are Interactive Mode and Non Interactive Mode. In interactive mode, when you type Hive, it directly goes to Hive Mode(Hive Shell). Non Interactive Mode is about executing the code directly in console file Mode.
17) Explain MetaStore in Hive?
MetaStore is used to store metadata of Hive Tables like their columns, column types and it’s partition structure in an RDBMS table. MetaStore service runs in the same JVM as the services of the Hive are running.
18) What is HiveSERVER in Hive?
HiveSERVER is an API that allows the clients(JDBC) to execute the queries on Hive Data warehouse and get the desired results. To process and execute a query, compiler and Execution engine in HiveSERVER interact with each other.
19) Some key Differences between Hive and relational databases?
In relational databases, tables are created first and then the data get inserted into that table. We can execute DML commands like Insert, Update and Delete on those tables. In hive, data is stored first and tables are created on top of it. We can’t execute, delete and update commands on Hive, because date get replicated to multiple nodes.
20) Explain the function of Execution Engine in Hive Architecture?
Execution Engine (EE) is a key component of Hive. It is used to execute the query by directly communicating with Job Tracker, Name Node and Data Nodes. When we execute a Hive query, it will generate series of MR Jobs in the backend. In this scenario, the execution engine acts as a bridge between Hive and Hadoop to process the query. For DFS operations, Execution Engine communicate with the Name Node.