Senior Hadoop Administrator
Looking for senior resource here. This is an Ops role, one needs to know more admin, administration of Hadoop environment. Location Charlotte, NC (Onsite); local candidates preferred. Max number of submissions: 2.
In this role, you will:
- Lead complex technology initiatives towards Hadoop platform stability, automation initiatives.
- Platform Incident and Problem management
- Be part of 24/7 platform support team to perform platform incident management and triaging team.
- Assess the impact of the platform issue and prioritize the triage by partnering with platform tenants, platform development and Hadoop vendor teams.
- Facilitate and perform root cause analysis of platform incidents towards determining permanent resolution involving Tenants and Hadoop vendor.
- Communicate with Stakeholders, leadership team on the triaging progress status.
- To troubleshoot tenant reported failures by deep diving onto client logs to identify the root cause.
- Use Autosys as scheduler to schedule, execute platform routine activities.
- Hadoop Cluster Monitoring
- Monitor Hadoop cluster health and performance.
- Automate new monitoring opportunities leveraging available monitoring and alerting tools such as Prometheus, Thousand Eyes, Splunk etc.
- Perform troubleshooting on the actionable alerts and tuning to ensure optimal cluster performance.
- Hadoop Cluster Configuration and tuning
- Periodically tune and configure Hadoop ecosystem components such as HDFS, YARN, Hive, HBase, Spark, HPOS etc.
- Schedule the Change request and track the approval process for implementation.
- Platform Backup and Recovery
- Ensure Hadoop platform is resilient with data backup and recovery strategies in place.
- Perform emergency or scheduled failover of platform to Disaster recovery site and fall back.
- Upgrades and Patch Management
- Plan and execute upgrades of Hadoop ecosystem components and patches.
- Ensure compatibility and smooth integration of new features and enhancements.
- Work with Operating System administrators and patching automation build team to schedule and execute patching.
- Perform Cluster health validation post patching and communicate with stakeholders.
- Identify automation and process improvement opportunities related to patching to implement it.
- Documentation and Reporting
- Maintain comprehensive documentation of Hadoop cluster configurations & troubleshooting processes, and procedures.
- Develop and Schedule to Generate reports on cluster usage, performance metrics, and capacity utilization.
- Collaboration and Support
- Ability to work both independently and in collaboration with Platform development and Platform tenant teams to troubleshoot and fix Hadoop platform issues, maintain production stability to enterprise-wide applications.
- Collaborate and consult with key technical experts, senior technology team, and Hadoop vendor to resolve complex technical issues and achieve goals.
Mandatory Skills and Qualifications:
- Proven Experience in technology incident and problem management role.
- Proven Experience in using monitoring tools like Prometheus, Splunk, SiteScope & thousand eyes etc.
- Proven experience as a Unix Scripting
- Proven experience in Hadoop administration or in a similar role.
- Strong knowledge of Hadoop ecosystem and its components (Hive, HBase, Spark, Hue ).
- Proven Experience is using Autosys or similar job scheduler.
- Excellent problem-solving and troubleshooting skills.
- Excellent communication and people skills for collaborating with cross-functional teams.