Previous Job
Previous
Software Engineer (Machine Learning)
Ref No.: 19-06991
Location: Redmond, Washington
Software Engineer​ (Machine Learning)
Location: Redmond, WA
Duration: Through 11/2020
Company: A leader in the Social Media Networking company

Responsibilities:
Build software to facilitate routine operations of the research groupMaintain and improve scripts to run distributed jobs on a HPC cluster of Linux-based machines connected via high speed networkTroubleshoot performance\resource related problems with payload and cluster utilizationBuild cluster utilization dashboard, monitoring and alerting system
Manage and evolve high performance compute infrastructure used by research scientists for deep learning model training

Required Skills:
Experienced C/C++, Python, Ruby software developer
Expert level knowledge of Linux-based systems and cluster management
High speed network performance profiling and optimization
Advanced understanding of Linux containers
Understand which hardware, software and software frameworks are used to speed up model training and inference why TensorFlow, PyTorch, CUDA, GPUs, NVIDIA DGX
Advanced knowledge in cluster resource managers like Slurm, Kubernetis, Docker Swarm
Basic understanding of machine learning and, more specifically, deep learning (gradient descent, stochastic gradient descent, online learning)
Understand software which is used to clusterize the related hardware (Docker, Docker Swarm, Slurm, Kubernetes

Desired Skills:
Previous experience with MPI and InfiniBand is very welcomed
Bachelor's degree in Computer Science, Mathematics, or related field or 5 years relevant experience