Essential Apache Pig Interview Questions

Essential Apache Pig Interview Questions

Compare Apache Pig And Sql?

Apache Pig differs from SQL in its usage for ETL, lazy evaluation, store data at any given point of time in the pipeline, support for pipeline splits and explicit declaration of execution plans. SQL is oriented around queries which produce a single result. SQL has no in-built mechanism for splitting a data processing stream and applying different operators to each sub-stream.

Apache Pig allows user code to be included at any point in the pipeline whereas if SQL where to be used data needs to be imported to the database first and then the process of cleaning and transformation begins.

Explain The Need For Mapreduce While Programming In Apache Pig.?

Apache Pig programs are written in a query language known as Pig Latin that is similar to the SQL query language. To execute the query, there is a need for an execution engine. The Pig engine converts the queries into MapReduce jobs and thus MapReduce acts as the execution engine and is needed to run the programs.

Explain About The Bloommapfile.?

BloomMapFile is a class, that extends the MapFile class. It is used in HBase table format to provide quick membership test for the keys using dynamic bloom filters.

What Do You Mean By A Bag In Pig?

Collection of tuples is referred as a bag in Apache Pig.

What Is The Usage Of Foreach Operation In Pig Scripts?

FOREACH operation in Apache Pig is used to apply transformation to each element in the data bag, so that respective action is performed to generate new data items.

Syntax-

FOREACH data_bagname GENERATE exp1, exp2.

Explain About The Different Complex Data Types In Pig.?

Apache Pig supports 3 complex data types:

Maps-

These are key, value stores joined together using #.

Tuples-

Just similar to the row in a table, where different items are separated by a comma. Tuples can have multiple attributes.

Bags-

Unordered collection of tuples. Bag allows multiple duplicate tuples.

What Does Flatten Do In Pig?

Sometimes there is data in a tuple or a bag and if we want to remove the level of nesting from that data, then Flatten modifier in Pig can be used. Flatten un-nests bags and tuples. For tuples, the Flatten operator will substitute the fields of a tuple in place of a tuple, whereas un-nesting bags is a little complex because it requires creating new tuples.

How Do Users Interact With The Shell In Apache Pig?

Using Grunt i.e. Apache Pig’s interactive shell, users can interact with HDFS or the local file system.

To start Grunt, users should invoke Apache Pig with no command:

Executing the command “pig –x local” will result in the prompt –

grunt >

This is where PigLatin scripts can be run either in local mode or in cluster mode by setting the configuration in PIG_CLASSPATH.

To exit from grunt shell, press CTRL+D or just type exit.

What Are The Debugging Tools Used For Apache Pig Scripts?

describe and explain are the important debugging utilities in Apache Pig.

explain utility is helpful for Hadoop developers, when trying to debug error or optimize PigLatin scripts. explain can be applied on a particular alias in the script or it can be applied to the entire script in the grunt interactive shell. explain utility produces several graphs in text format which can be printed to a file.

describe debugging utility is helpful to developers when writing Pig scripts as it shows the schema of a relation in the script. For beginners who are trying to learn Apache Pig can use the describe utility to understand how each operator makes alterations to data. A pig script can have multiple describes.

What Is Illustrate Used For In Apache Pig?

Executing pig scripts on large data sets, usually takes a long time. To tackle this, developers run pig scripts on sample data but there is possibility that the sample data selected, might not execute your pig script properly.

For instance, if the script has a join operator there should be at least a few records in the sample data that have the same key, otherwise the join operation will not return any results.

To tackle these kind of issues, illustrate is used. illustrate takes a sample from the data and whenever it comes across operators like join or filter that remove data, it ensures that only some records pass through and some do not, by making modifications to the records such that they meet the condition. illustrate just shows the output of each stage but does not run any MapReduce task.

Explain About The Execution Plans Of A Pig Script?
Or
Differentiate Between The Logical And Physical Plan Of An Apache Pig Script?

Logical and Physical plans are created during the execution of a pig script. Pig scripts are based on interpreter checking. Logical plan is produced after semantic checking and basic parsing and no data processing takes place during the creation of a logical plan. For each line in the Pig script, syntax check is performed for operators and a logical plan is created.

Whenever an error is encountered within the script, an exception is thrown and the program execution ends, else for each statement in the script has its own logical plan.

A logical plan contains collection of operators in the script but does not contain the edges between the operators.

After the logical plan is generated, the script execution moves to the physical plan where there is a description about the physical operators, Apache Pig will use, to execute the Pig script. A physical plan is more or less like a series of MapReduce jobs but then the plan does not have any reference on how it will be executed in MapReduce.

During the creation of physical plan, cogroup logical operator is converted into 3 physical operators namely –Local Rearrange, Global Rearrange and Package. Load and store functions usually get resolved in the physical plan.

What Do You Know About The Case Sensitivity Of Apache Pig?

It is difficult to say whether Apache Pig is case sensitive or case insensitive. For instance, user defined functions, relations and field names in pig are case sensitive i.e. the function COUNT is not the same as function count or X=load ‘foo’ is not same as x=load ‘foo’. On the other hand, keywords in Apache Pig are case insensitive i.e. LOAD is same as load.

What Are Some Of The Apache Pig Use Cases You Can Think Of?

Apache Pig big data tools, is used in particular for iterative processing, research on raw data and for traditional ETL data pipelines. As Pig can operate in circumstances where the schema is not known, inconsistent or incomplete- it is widely used by researchers who want to make use of the data before it is cleaned and loaded into the data warehouse.

To build behavior prediction models, for instance, it can be used by a website to track the response of the visitors to various types of ads, images, articles, etc.

Differentiate Between Piglatin And Hiveql?

It is necessary to specify the schema in HiveQL, whereas it is optional in PigLatin.
HiveQL is a declarative language, whereas PigLatin is procedural.
HiveQL follows a flat relational data model, whereas PigLatin has nested relational data model.
Is Piglatin A Strongly Typed Language? If Yes, Then How Did You Come To The Conclusion?
In a strongly typed language, the user has to declare the type of all variables upfront. In Apache Pig, when you describe the schema of the data, it expects the data to come in the same format you mentioned.

However, when the schema is not known, the script will adapt to actually data types at runtime. So, it can be said that PigLatin is strongly typed in most cases but in rare cases it is gently typed, i.e. it continues to work with data that does not live up to its expectations.

What Do You Understand By An Inner Bag And Outer Bag In Pig?

A relation inside a bag is referred to as inner bag and outer bag is just a relation in Pig.

Explain The Difference Between Count_star And Count Functions In Apache Pig?

COUNT function does not include the NULL value when counting the number of elements in a bag, whereas COUNT_STAR (0 function includes NULL values while counting.

What Are The Various Diagnostic Operators Available In Apache Pig?

Dump Operator-

It is used to display the output of pig Latin statements on the screen, so that developers can debug the code.

Describe Operator-

Explained in apache pig interview question no- 10

Explain Operator-

Explained in apache pig interview question no -10

Illustrate Operator-

Explained in apache pig interview question no -11

How Will You Merge The Contents Of Two Or More Relations And Divide A Single Relation Into Two Or More Relations?

This can be accomplished using the UNION and SPLIT operators.

I Have A Relation R. How Can I Get The Top 10 Tuples From The Relation R.?

TOP () function returns the top N tuples from a bag of tuples or a relation. N is passed as a parameter to the function top () along with the column whose values are to be compared and the relation R.

What Are The Commonalities Between Pig And Hive?

HiveQL and PigLatin both convert the commands into MapReduce jobs.
They cannot be used for OLAP transactions as it is difficult to execute low latency queries.

What Are The Different Types Of Udf’s In Java Supported By Apache Pig?

Algebraic, Eval and Filter functions are the various types of UDF’s supported in Pig.

You Have A File Employee.txt In The Hdfs Directory With 100 Records. You Want To See Only The First 10 Records From The Employee.txt File. How Will You Do This?

The first step would be to load the file employee.txt into with the relation name as Employee.

The first 10 records of the employee data can be obtained using the limit operator –

Result= limit employee 10.

Explain About The Scalar Datatypes In Apache Pig.?

integer, float, double, long, bytearray and char array are the available scalar datatypes in Apache Pig.

How Do Users Interact With Hdfs In Apache Pig?

Using the grunt shell.

What Is The Use Of Having Filters In Apache Pig?

Just like the where clause in SQL, Apache Pig has filters to extract records based on a given condition or predicate. The record is passed down the pipeline if the predicate or the condition turn to true. Predicate contains various operators like ==, <=,!=, >=.

Example:-

X= load ‘inputs’ as(name,address)

Y = filter X by symbol matches ‘Mr.*’;

What Is A Udf In Pig?

If the in-built operators do not provide some functions then programmers can implement those functionalities by writing user defined functions using other programming languages like Java, Python, Ruby, etc. These User Defined Functions (UDF’s) can then be embedded into a Pig Latin Script.

Can You Join Multiple Fields In Apache Pig Scripts?

Yes, it is possible to join multiple fields in PIG scripts because the join operations takes records from one input and joins them with another input. This can be achieved by specifying the keys for each input and the two rows will be joined when the keys are equal.

Does Pig Support Multi-line Commands?

Yes.

Related Articles