Turn Job Alerts On

Kasmo Global

Charlotte,NC

Description

Senior Hadoop Administrator

Looking for senior resource here. This is an Ops role, one needs to know more admin, administration of Hadoop environment. Location Charlotte, NC (Onsite); local candidates preferred. Max number of submissions: 2.

In this role, you will:

Lead complex technology initiatives towards Hadoop platform stability, automation initiatives.
Platform Incident and Problem management
- Be part of 24/7 platform support team to perform platform incident management and triaging team.
- Assess the impact of the platform issue and prioritize the triage by partnering with platform tenants, platform development and Hadoop vendor teams.
- Facilitate and perform root cause analysis of platform incidents towards determining permanent resolution involving Tenants and Hadoop vendor.
- Communicate with Stakeholders, leadership team on the triaging progress status.
- To troubleshoot tenant reported failures by deep diving onto client logs to identify the root cause.
- Use Autosys as scheduler to schedule, execute platform routine activities.
Hadoop Cluster Monitoring
- Monitor Hadoop cluster health and performance.
- Automate new monitoring opportunities leveraging available monitoring and alerting tools such as Prometheus, Thousand Eyes, Splunk etc.
- Perform troubleshooting on the actionable alerts and tuning to ensure optimal cluster performance.
Hadoop Cluster Configuration and tuning
- Periodically tune and configure Hadoop ecosystem components such as HDFS, YARN, Hive, HBase, Spark, HPOS etc.
- Schedule the Change request and track the approval process for implementation.
Platform Backup and Recovery
- Ensure Hadoop platform is resilient with data backup and recovery strategies in place.
- Perform emergency or scheduled failover of platform to Disaster recovery site and fall back.
Upgrades and Patch Management
- Plan and execute upgrades of Hadoop ecosystem components and patches.
- Ensure compatibility and smooth integration of new features and enhancements.
- Work with Operating System administrators and patching automation build team to schedule and execute patching.
- Perform Cluster health validation post patching and communicate with stakeholders.
- Identify automation and process improvement opportunities related to patching to implement it.
Documentation and Reporting
- Maintain comprehensive documentation of Hadoop cluster configurations & troubleshooting processes, and procedures.
- Develop and Schedule to Generate reports on cluster usage, performance metrics, and capacity utilization.
Collaboration and Support
- Ability to work both independently and in collaboration with Platform development and Platform tenant teams to troubleshoot and fix Hadoop platform issues, maintain production stability to enterprise-wide applications.
- Collaborate and consult with key technical experts, senior technology team, and Hadoop vendor to resolve complex technical issues and achieve goals.

Mandatory Skills and Qualifications:

Proven Experience in technology incident and problem management role.
Proven Experience in using monitoring tools like Prometheus, Splunk, SiteScope & thousand eyes etc.
Proven experience as a Unix Scripting
Proven experience in Hadoop administration or in a similar role.
Strong knowledge of Hadoop ecosystem and its components (Hive, HBase, Spark, Hue ).
Proven Experience is using Autosys or similar job scheduler.
Excellent problem-solving and troubleshooting skills.
Excellent communication and people skills for collaborating with cross-functional teams.