Strong development skills around Hadoop, Spark, Airflow and Hive Strong SQL, Python, shell scripting, and Python Strong understanding of Hadoop internals Experience (at least familiarity) with data warehousing, dimensional modeling, and ETL development Experience with AWS components and services, particularly, EMR, S3 Awareness of DevOps tools Working experience in Agile framework.
7+ years overall experience with at least 4 years in Big data field.
Roles & Responsibilities
Design and implement distributed data processing pipelines using Spark, Hive, SQL, and other tools and languages prevalent in the Hadoop ecosystem. Work to build aggregates and curate data sets needed to solve BI reporting needs. Design and implement end-to-end solutions. Build utilities, user-defined functions, and frameworks to better enable data flow patterns. Research, evaluate, and utilize new technologies/tools/frameworks centered around Hadoop and other elements in the Big Data space. Work with teams to resolve operational and performance issues. • Work with architecture/engineering leads and other teams to ensure quality solutions are implemented and engineering best practices are defined and adhered to. Work in Agile team structure Coordinate with offshore teams for project related work and act as technical lead for onsite teams.
Generic Managerial Skills
Technical lead role preferred
Bachelors in Computer Science/IT preferred
Start date (dd-mmm-yy)
17th Sept 2018
Duration of assignment (in Months)
Work Location(State, City and Zip)
Beaverton, Oregon, 97006
Full time hire - $110,000 per year
Key words to search in resume
Hadoop, Spark, Hive, AWS, SQL, Python, Agile, Airflow, EMR, S3, Big Data
1. What is bucketing? Ans: Bucketing is another technique for decomposing data sets into more manageable parts. Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. 2. What is partitioning? Ans: Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion 3. What is vectorization? Ans: Vectorization allows Hive to process a batch of rows together instead of processing one row at a time. 4. In sqoop, by default how many mappers will be used if we didn't mention anything.(Ans: 4) 5. What are the file formats you have worked on? What is the difference between orc vs parquet file? 6. Can we run UNIX shell commands from Hive? Can Hive queries be executed from script files? How? Give an example. 7. What is dataframes and how its working? Can you explain? 8. What is a bucket in AWS S3 ? Can we define a region for S3 ? ( Ans: No - it's a global service) 9. How come we can identify an object within a S3 bucket ? Ans: Using KeyName and version ID. 10. What are the different S3 object storage classes ? Ans: S3-Standard S3-IA (Infrequently Accessed) S3-RRS (Reduced Redundancy Storage) Glacier 11. Rank(), row_number(), dense_rank() difference. 12. We have employee table and it has employee id and salary columns. I want to find the below: a. Top 3 salary getting person's emp id (Ans: should use order by and limit/top b. Find the 4th highest salary getting person. (Ans: should use rank or row_number function) 13. In a table, male is represented as female and Female is represented as male wrongly. How to resolve this. (Ans: Use CASE statement)