Previous Job
Previous
Hadoop Developer
Ref No.: 18-64809
Location: Edison, New Jersey
Position Type:Direct Placement
Start Date: 09/05/2018
 
Job Title Hadoop Developer
Relevant Experience
(in Yrs)
7+
Technical/Functional Skills Strong development skills around Hadoop, Spark, Airflow and Hive
Strong SQL, Python, shell scripting, and Python
Strong understanding of Hadoop internals
Experience (at least familiarity) with data warehousing, dimensional modeling, and ETL development
Experience with AWS components and services, particularly, EMR, S3
Awareness of DevOps tools
Working experience in Agile framework.
Experience Required 7+ years overall experience with at least 4 years in Big data field.
Roles & Responsibilities Design and implement distributed data processing pipelines using Spark, Hive, SQL, and other tools and languages prevalent in the Hadoop ecosystem.
Work to build aggregates and curate data sets needed to solve BI reporting needs.
Design and implement end-to-end solutions.
Build utilities, user-defined functions, and frameworks to better enable data flow patterns.
Research, evaluate, and utilize new technologies/tools/frameworks centered around Hadoop and other elements in the Big Data space.
Work with teams to resolve operational and performance issues. • Work with architecture/engineering leads and other teams to ensure quality solutions are implemented and engineering best practices are defined and adhered to.
Work in Agile team structure
Coordinate with offshore teams for project related work and act as technical lead for onsite teams.
Generic Managerial Skills Technical lead role preferred
Education Bachelors in Computer Science/IT preferred
Start date (dd-mmm-yy) 17th Sept 2018
Duration of assignment (in Months) 18 months
Work Location (State, City and Zip) Beaverton, Oregon, 97006
Salary Full time hire - $110,000 per year
Key words to search in resume Hadoop, Spark, Hive, AWS, SQL, Python, Agile, Airflow, EMR, S3, Big Data
Prescreening Questionnaire 1.        What is bucketing?
Ans:
Bucketing is another technique for decomposing data sets into more manageable parts. Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data.
2.        What is partitioning?
Ans: Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion
3.        What is vectorization?
Ans: Vectorization allows Hive to process a batch of rows together instead of processing one row at a time.
4.        In sqoop, by default how many mappers will be used if we didn't mention anything.(Ans: 4)
5.        What are the file formats you have worked on? What is the difference between orc vs parquet file?
6.        Can we run UNIX shell commands from Hive? Can Hive queries be executed from script files? How? Give an example.
7.        What is dataframes and how its working? Can you explain?
8.        What is a bucket in AWS S3 ? Can we define a region for S3 ? ( Ans: No - it's a global service)
9.        How come we can identify an object within a S3 bucket ?
              Ans: Using KeyName and version ID.
10.        What are the different S3 object storage classes ?
Ans:
 S3-Standard
 S3-IA (Infrequently Accessed)
 S3-RRS (Reduced Redundancy Storage)
 Glacier
11.        Rank(), row_number(), dense_rank() difference.
12.        We have employee table and it has employee id and salary columns. I want to find the below:
a. Top 3 salary getting person's emp id (Ans: should use order by and limit/top
b. Find the 4th highest salary getting person. (Ans: should use rank or row_number function)
13.        In a table, male is represented as female and Female is represented as male wrongly. How to resolve this. (Ans: Use CASE statement)