Location: REMOTESalary: $130,000.00 USD Annually - $145,000.00 USD AnnuallyDescription: Our client is currently seeking a Risk & Reliability Engineer dealing with Plant Utilities (boilers, tanks, chillers fire protection, etc.) at multiple locations in a Remote, Traveling position. The Engineer will ensure that project goals are achieved within Compliance and Risk Management, Asset Integrity & Reliability, Project Management, and Engineering Support. This position requires 50-60% travel including some international travel. The Engineer will assist Audit & Inspection Teams, take their results and within project scopes ensure that the company is meeting Risk & Safety compliance while also assisting with Asset Integrity & Reliability Programs to ensure long term viability of company equipmentInterested Candidates should contact Lee Douglas @ 610 639-6573 or send a current resume to [email protected] This job will have the following responsibilities: Managing and directing regulatory-required inspections of company assets - namely tanks, piping, diking/containment, fire protection systems and boilers.Maintaining records for inspections of company assets as per regulatory requirement and company standards.Recommending actions to repair or restore assets to good working order and regulatory compliance.Ensuring critical assets are in good working order. Advise condemnation of assets that are poorly maintained, requiring major repair, not suitable for service, and/or restoration of such is financially prohibitive or infeasible.Supporting an asset lifecycle management program that ensures the reasonable, long-term reliability of property, plants, and equipment (PP&E)Assessing operations and provide recommendations for proper maintenance of company assets.Supporting the business team in scope and cost estimate development to provide best long-term value within an approved project scope.Implementation of the approved project scope utilizing the Project Work Process, including established work practices and templates.For Risk Management derived projects, organizing and managing an integrated project team that includes resources for process engineering, accounting, receiving, engineering, procurement, construction, start up, plant support and technical support.Advising on the proper maintenance and utilization of company assets and unit operations.Writing and reviewing global standard specifications, procedures, and policies.Implementing Asset Integrity, Reliability, Compliance and Risk Management action items in accordance with Project Work Process.Ensuring the business operations remain legally compliant with respect to occupational safety, environmental compliance, electrical safety, and fire protection standards. Qualifications & Requirements: 4 Year College Degree in Engineering Discipline5+ years' experience in Reliability plant work, project execution and/or Risk Assessment Understand regulation and instructing sites on the topics of electrical safety, fire protection, & spill prevention best practices in accordance with recommendation in NFPA, IFC, IBCProficient in MS programs50% of work environment will be office based and 50% will be on the plant floor with the equipment and live operationAbility to travel overnight, up to 50%PMI Risk Management Professional (PMI-RMP) Contact: [email protected] job and many more are available through The Judge Group. Find us on the web at www.judge.com

Senior Modeling, Simulation, and Reliability Engineer

Medtronic, Santa Ana

At Medtronic you can begin a life-long career of exploration and innovation, while helping champion healthcare access and equity for all. You'll lead with purpose, breaking down barriers to innovation in a more connected, compassionate world.A Day in the LifeMedtronic is looking for a highly motivated individual to join the Implant Performance Modeling, Simulation, and Reliability Engineering team at our Santa Ana, CA facility. As a Modeling, Simulation and Reliability Engineer, you will develop and deploy state-of-the-art capabilities to support product development teams in efficiently and predictably developing novel therapies for the Structural Heart and Aortic (SH&A) operating unit.Your responsibilities will include:Designing, developing, and validating new first principles and computational models to simulate and predict the performance of SH&A implantsDesigning, developing, and validating advanced test methods, measurement techniques, and fixtures to evaluate the performance of SH&A implants in clinically relevant boundary and use conditionsCreating and executing integrated strategies to evaluate the performance and reliability of SH&A implants using bench testing and modeling, where appropriate, to support the entire product development lifecycle including prototyping, concept selection, design verification/validation, and field supportContribute to the development of design solutions that address broad development problems involving class III implantable devices and accessoriesPreparing submission level documentation for the FDA and other regulatory bodies to describe performance evaluation methodology and demonstrate credibility and acceptability of resultsInterfacing with internal and external stakeholders to identify and translate customer/business needs into requirements and concepts. Collaborating with subject matter experts to ensure performance assessment methodologies are clinically relevant and credibly justifiedWorking effectively with Core and extended team members and other staff, across multiple geographic locations, to achieve goalsDeveloping subject matter expertise to support multiple projects in meeting business, technical, and regulatory requirementsDirecting the activities of technicians and engineers to ensure project milestones are completed correctly and on scheduleGenerating intellectual property (IP) and documenting invention disclosures to secure IP on therapies, procedures, and associated devicesCollaborating with internal and external stakeholders to develop material suitable for publication in external conferences and academic journalsApplying techniques from machine learning and data science to automate and accelerate existing processes or enable new capabilitiesOrganizing, preparing, and presenting technical information material in a clear, effective, and skillful manner to internal and external stakeholders including peers, management, and customersReviewing and approving document change requests for design, materials, and process documentation. Creating new documents for performance, design, and process specifications, test/modeling protocols and reports Must Have: Minimum Requirements Bachelors degree required2-4 years of relevant work experience or advanced degree with relevant trainingThe ideal candidate for this position should have the following qualifications and qualities:B.S., M.S., or Ph.D. in Biomedical, Mechanical, Aerospace, or Chemical Engineering, Physics, or Applied Mathematics, or another related fieldStrong understanding of theoretical, experimental, and computational fluid dynamics. Experience working with Ansys LS-DYNA is preferred.Strong understanding of engineering mechanics, mechanical design, and structure/properties of materialsWorking knowledge of human anatomy and physiology, particularly the cardiovascular system, and the relevant physical, chemical, and biological phenomena that impact cardiovascular implantsPrevious experience developing and validating test methods is preferredPrevious experience with computation model validation including the application of the ASME V&V 40 standard is preferredExperience creating CAD models and professional engineering drawings (SolidWorks preferred)Experience using common engineering software tools. MATLAB and Python experience is preferred.Experience working with medical imaging and nondestructive testing techniques.Strong knowledge of data analysis and statistical techniques. Experience working with Minitab or JMP is preferred. Experience with reliability engineering is preferred.Understanding of applied engineering principles and best practices including design for reliability/manufacturability, lean six sigma, etc. Demonstrated experience applying these tools and techniques is preferred.Understanding of the Product Development Process (PDP) and the principles of Engineering Design Control. Prior work experience within the PDP is preferred.Strong written and verbal communication skills.Open-minded, self-motivated, collaborative, curious, strategic, hands-on, and can succeed in both collaborative and independent environmentsPrior experience working in highly regulated industries, with experience working in the design and application of minimally invasive implantable being preferredPhysical Job RequirementsThe above statements are intended to describe the general nature and level of work being performed by employees assigned to this position, but they are not an exhaustive list of all the required responsibilities and skills of this position. The physical demands described within the Responsibilities section of this job description are representative of those that must be met by an employee to successfully perform the essential functions of this job. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions. For Office Roles: While performing the duties of this job, the employee is regularly required to be independently mobile. The employee is also required to interact with a computer, and communicate with peers and co-workers. Contact your manager or local HR to understand the Work Conditions and Physical requirements that may be specific to each role.Benefits & CompensationMedtronic offers a competitive Salary and flexible Benefits PackageA commitment to our employees lives at the core of our values. We recognize their contributions. They share in the success they help to create. We offer a wide range of benefits, resources, and competitive compensation plans designed to support you at every career and life stage.Salary ranges for U.S (excl. PR) locations (USD):$100,000 - $ 151,000The base salary range is applicable across the United States, excluding Puerto Rico and specific locations in California. The offered rate complies with federal and local regulations and may vary based on factors such as experience, certification/education, market conditions, and location. Compensation and benefits information pertains solely to candidates hired within the United States (local market compensation and benefits will apply for others).Medtronic benefits and compensation plans About MedtronicWe lead global healthcare technology and boldly attack the most challenging health problems facing humanity by searching out and finding solutions.Our Mission - to alleviate pain, restore health, and extend life - unites a global team of 90,000+ passionate people. We are engineers at heart- putting ambitious ideas to work to generate real solutions for real people. From the R&D lab, to the factory floor, to the conference room, every one of us experiments, creates, builds, improves and solves. We have the talent, diverse perspectives, and guts to engineer the extraordinary.Learn more about our business, mission, and our commitment to diversity here .It is the policy of Medtronic to provide equal employment opportunity (EEO) to all persons regardless of age, color, national origin, citizenship status, physical or mental disability, race, religion, creed, gender, sex, sexual orientation, gender identity and/or expression, genetic information, marital status, status with regard to public assistance, veteran status, or any other characteristic protected by federal, state or local law. In addition, Medtronic will provide reasonable accommodations for qualified individuals with disabilities.At Medtronic, most positions are posted on our career site for 3-7 days.

Staff Site Reliability Engineer (SRE)

Cribl, San Francisco

Cribl does differently. What does that mean? It means we are a serious company that doesn't take itself too seriously; and we're looking for people who love to get stuff done, and laugh a bit along the way. We're growing rapidly - looking for collaborative, curious, and motivated team members who are passionate about putting customers first. As a remote-first company we believe in empowering our employees to do their best work, wherever they are. As the data engine for IT and Security many of the biggest names in the most demanding industries trust Cribl to solve their most pressing data needs. Ready to do the best work of your career? Join the herd and unlock your opportunity.Why you'll love this role: Cribl Inc is seeking a Staff Site Reliability Engineer to join our mission to unlock the value of all observability data. Cribl provides users a new level of observability, intelligence and control over their real-time data. You will join a team of technical engineers who are committed to shipping only high-quality software and enjoying all the goat gifs the internet has to offer. This role is remote and you will be part of the engineering organization where you will contribute in our efforts to envision, create, deploy, test, and ship Cribl products. Not often do you get to be part of something that is fundamentally changing a technology. But here at Cribl we are building the next generation of software that puts our customers in full control of their observability data. If this is something that interests you, and you want to be truly at the center of the wheel helping make this work better every day. Then this opportunity might be something you have been waiting for to be a part of making a real impact. We are looking for Cloud Site Reliability Engineers and Developers at all levels at Cribl, who enjoy being in the thick of it. Fixing things at the operational side should always be the last resort, so our SRE engineers are involved from conception to design to development and all the way through production and beyond. You provide your creative input into all things Cloud, Scaling, Reliability, High Availability and much more. If reliability is your passion, and you have always had strong opinions on how to make things better and have the desire to build consensus around ideas. Then let's talk! As An Active Member Of Our Team, You Will... Engage with teams and improve service delivery and reliability across their entire lifecycle Measure and monitor all production systems with an eye towards availability, latency and overall system health Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability Help Identify and drive down toil with creative innovation and automation On-call responsibilities If You Got It, We Want It Extensive experience with enterprise scale continuous delivery environments 8+ years of experience with a DevOps or SRE job title Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible Experience with sustainable incident response in a blameless environment Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc. Background in Linux Systems Engineering Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc. Comfortable with a high level of autonomy and working with a distributed team Preferred Qualifications Knowledge of Cloud and application security Strong knowledge of cloud design patterns for scale, data management, resiliency, etc. A love for high quality and a knack for testing Opinions about dashboards, metrics, and SLO's Salary Range ($144,000 - $278,000) The salary for this role is dependent on geographic location. The salary offered within the range described will be based on the individual candidate's job-related knowledge, skills, and experience. In addition to a competitive salary, Cribl also offers a generous benefits package which includes health, dental, vision, short-term disability, and life insurance, paid holidays and paid time off, a fertility treatment benefit, 401(k), equity, and eligibility for a discretionary company-wide bonus. #LI-MV1 #RemoteBring Your Whole SelfDiversity drives innovation, enables better decisions to support our customers, and inspires change for the better. We're building a culture where differences are valued and welcomed, and we work together to bring out the best in each other. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying. Interested in joining the Cribl herd? Learn more about the smartest, funniest, most passionate goats you'll ever meet at cribl.io/about-us.

Datacenter Site Reliability Engineer

Kforce Inc, Los Angeles

RESPONSIBILITIES:Kforce is immediately adding a full-time Datacenter Site Reliability Engineer in support of our industry leading technology development client in Los Angeles, CA. Our client is seeking candidates who are driven to make positive changes to the way people live and work by creating breakthrough technology solutions.Overview: Data monitoring and alerting, data quality assurance and anomaly detection Document team processes and policies, including methods of engagement and SLOs Analyze, design, and implement solutions at the system level to remove bottlenecks and improve edge service performance Implement monitoring and alerting to improve issue detection and response Work in a fast-paced environment; Participate in technical operations and rotations in response to performance and reliability issues Participate in on-call rotations, responsible for resolving or escalating incoming events Maintain and operate a Linux and Kubernetes environmentREQUIREMENTS: Bachelor's degree or above, majoring in Computer Science or related fields, with at least 2 years of related work experience 3+ years of experience working with Unix Linux systems from kernel to shell and beyond with experience working with system libraries, file systems, and client-server protocols 2+ years of experience coding python scripts for platform operations Experience in networking technologies such TCP/IP, BGP, DNS, etc. in a carrier-grade environment Experience in developing and operating one or more of following systems: OpenStack, Kubernetes, Nginx, ipvs, ELK stack, Hadoop, etc. The pay range is the lowest to highest compensation we reasonably in good faith believe we would pay at posting for this role. We may ultimately pay more or less than this range. Employee pay is based on factors like relevant education, qualifications, certifications, experience, skills, seniority, location, performance, union contract and business needs. This range may be modified in the future. We offer comprehensive benefits including medical/dental/vision insurance, HSA, FSA, 401(k), and life, disability & ADD insurance to eligible employees. Salaried personnel receive paid time off. Hourly employees are not eligible for paid time off unless required by law. Hourly employees on a Service Contract Act project are eligible for paid sick leave. Note: Pay is not considered compensation until it is earned, vested and determinable. The amount and availability of any compensation remains in Kforce's sole discretion unless and until paid and may be modified in its discretion consistent with the law. This job is not eligible for bonuses, incentives or commissions. Kforce is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, pregnancy, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status.

Datacenter Site Reliability Engineer

Kforce Inc, Los Angeles

RESPONSIBILITIES:Kforce is immediately adding a full-time Datacenter Site Reliability Engineer in support of our industry leading technology development client in Los Angeles, CA. Our client is seeking candidates who are driven to make positive changes to the way people live and work by creating breakthrough technology solutions.Overview: Data monitoring and alerting, data quality assurance and anomaly detection Document team processes and policies, including methods of engagement and SLOs Analyze, design, and implement solutions at the system level to remove bottlenecks and improve edge service performance Implement monitoring and alerting to improve issue detection and response Work in a fast-paced environment; Participate in technical operations and rotations in response to performance and reliability issues Participate in on-call rotations, responsible for resolving or escalating incoming events Maintain and operate a Linux and Kubernetes environmentREQUIREMENTS: Bachelor's degree or above, majoring in Computer Science or related fields, with at least 2 years of related work experience 3+ years of experience working with Unix Linux systems from kernel to shell and beyond Experience working with system libraries, file systems, and client-server protocols Experience reading python scripts for platform operations Experience in networking technologies such TCP/IP, BGP, DNS, etc. in a carrier-grade environment Experience in developing and operating one or more of following systems: OpenStack, Kubernetes, Nginx, ipvs, ELK stack, Hadoop, etc. The pay range is the lowest to highest compensation we reasonably in good faith believe we would pay at posting for this role. We may ultimately pay more or less than this range. Employee pay is based on factors like relevant education, qualifications, certifications, experience, skills, seniority, location, performance, union contract and business needs. This range may be modified in the future. We offer comprehensive benefits including medical/dental/vision insurance, HSA, FSA, 401(k), and life, disability & ADD insurance to eligible employees. Salaried personnel receive paid time off. Hourly employees are not eligible for paid time off unless required by law. Hourly employees on a Service Contract Act project are eligible for paid sick leave. Note: Pay is not considered compensation until it is earned, vested and determinable. The amount and availability of any compensation remains in Kforce's sole discretion unless and until paid and may be modified in its discretion consistent with the law. This job is not eligible for bonuses, incentives or commissions. Kforce is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, pregnancy, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status.

Staff Site Reliability Engineer (SRE)

Cribl, San Francisco

Cribl does differently. What does that mean? It means we are a serious company that doesn't take itself too seriously; and we're looking for people who love to get stuff done, and laugh a bit along the way. We're growing rapidly - looking for collaborative, curious, and motivated team members who are passionate about putting customers first. As a remote-first company we believe in empowering our employees to do their best work, wherever they are. As the data engine for IT and Security many of the biggest names in the most demanding industries trust Cribl to solve their most pressing data needs. Ready to do the best work of your career? Join the herd and unlock your opportunity.Why you'll love this role: Cribl Inc is seeking a Senior Site Reliability Engineer to join our mission to unlock the value of all observability data. Cribl provides users a new level of observability, intelligence and control over their real-time data. You will join a team of technical engineers who are committed to shipping only high-quality software and enjoying all the goat gifs the internet has to offer. This role is remote and you will be part of the engineering organization where you will contribute in our efforts to envision, create, deploy, test, and ship Cribl products. Not often do you get to be part of something that is fundamentally changing a technology. But here at Cribl we are building the next generation of software that puts our customers in full control of their observability data. If this is something that interests you, and you want to be truly at the center of the wheel helping make this work better every day. Then this opportunity might be something you have been waiting for to be a part of making a real impact. We are looking for Cloud Site Reliability Engineers and Developers at all levels at Cribl, who enjoy being in the thick of it. Fixing things at the operational side should always be the last resort, so our SRE engineers are involved from conception to design to development and all the way through production and beyond. You provide your creative input into all things Cloud, Scaling, Reliability, High Availability and much more. If reliability is your passion, and you have always had strong opinions on how to make things better and have the desire to build consensus around ideas. Then let's talk!As An Active Member Of Our Team, You Will... Engage with teams and improve service delivery and reliability across their entire lifecycle Measure and monitor all production systems with an eye towards availability, latency and overall system health Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability Help Identify and drive down toil with creative innovation and automation On-call responsibilities If You Got It, We Want It Extensive experience with enterprise scale continuous delivery environments 5+ years of experience with a DevOps or SRE job title Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible Experience with sustainable incident response in a blameless environment Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc. Background in Linux Systems Engineering Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc. Comfortable with a high level of autonomy and working with a distributed team Preferred Qualifications Knowledge of Cloud and application security Strong knowledge of cloud design patterns for scale, data management, resiliency, etc. A love for high quality and a knack for testing Opinions about dashboards, metrics, and SLO's Salary Range ($152,000 - $230,500) The salary for this role is dependent on geographic location. The salary offered within the range described will be based on the individual candidate's job-related knowledge, skills, and experience. In addition to a competitive salary, Cribl also offers a generous benefits package which includes health, dental, vision, short-term disability, and life insurance, paid holidays and paid time off, a fertility treatment benefit, 401(k), equity, and eligibility for a discretionary company-wide bonus.#LI-MV1 #RemoteBring Your Whole SelfDiversity drives innovation, enables better decisions to support our customers, and inspires change for the better. We're building a culture where differences are valued and welcomed, and we work together to bring out the best in each other. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying. Interested in joining the Cribl herd? Learn more about the smartest, funniest, most passionate goats you'll ever meet at cribl.io/about-us.

Staff Site Reliability Engineer (SRE)

Cribl, Sacramento

Cribl does differently. What does that mean? It means we are a serious company that doesn't take itself too seriously; and we're looking for people who love to get stuff done, and laugh a bit along the way. We're growing rapidly - looking for collaborative, curious, and motivated team members who are passionate about putting customers first. As a remote-first company we believe in empowering our employees to do their best work, wherever they are. As the data engine for IT and Security many of the biggest names in the most demanding industries trust Cribl to solve their most pressing data needs. Ready to do the best work of your career? Join the herd and unlock your opportunity.Why you'll love this role: Cribl Inc is seeking a Senior Site Reliability Engineer to join our mission to unlock the value of all observability data. Cribl provides users a new level of observability, intelligence and control over their real-time data. You will join a team of technical engineers who are committed to shipping only high-quality software and enjoying all the goat gifs the internet has to offer. This role is remote and you will be part of the engineering organization where you will contribute in our efforts to envision, create, deploy, test, and ship Cribl products. Not often do you get to be part of something that is fundamentally changing a technology. But here at Cribl we are building the next generation of software that puts our customers in full control of their observability data. If this is something that interests you, and you want to be truly at the center of the wheel helping make this work better every day. Then this opportunity might be something you have been waiting for to be a part of making a real impact. We are looking for Cloud Site Reliability Engineers and Developers at all levels at Cribl, who enjoy being in the thick of it. Fixing things at the operational side should always be the last resort, so our SRE engineers are involved from conception to design to development and all the way through production and beyond. You provide your creative input into all things Cloud, Scaling, Reliability, High Availability and much more. If reliability is your passion, and you have always had strong opinions on how to make things better and have the desire to build consensus around ideas. Then let's talk!As An Active Member Of Our Team, You Will... Engage with teams and improve service delivery and reliability across their entire lifecycle Measure and monitor all production systems with an eye towards availability, latency and overall system health Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability Help Identify and drive down toil with creative innovation and automation On-call responsibilities If You Got It, We Want It Extensive experience with enterprise scale continuous delivery environments 5+ years of experience with a DevOps or SRE job title Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible Experience with sustainable incident response in a blameless environment Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc. Background in Linux Systems Engineering Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc. Comfortable with a high level of autonomy and working with a distributed team Preferred Qualifications Knowledge of Cloud and application security Strong knowledge of cloud design patterns for scale, data management, resiliency, etc. A love for high quality and a knack for testing Opinions about dashboards, metrics, and SLO's Salary Range ($152,000 - $230,500) The salary for this role is dependent on geographic location. The salary offered within the range described will be based on the individual candidate's job-related knowledge, skills, and experience. In addition to a competitive salary, Cribl also offers a generous benefits package which includes health, dental, vision, short-term disability, and life insurance, paid holidays and paid time off, a fertility treatment benefit, 401(k), equity, and eligibility for a discretionary company-wide bonus.#LI-MV1 #RemoteBring Your Whole SelfDiversity drives innovation, enables better decisions to support our customers, and inspires change for the better. We're building a culture where differences are valued and welcomed, and we work together to bring out the best in each other. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying. Interested in joining the Cribl herd? Learn more about the smartest, funniest, most passionate goats you'll ever meet at cribl.io/about-us.

Senior Site Reliability Engineer - Internal AI Research Clusters

Nvidia Corporation, Santa Clara

NVIDIA is the leader in AI, machine learning and datacenter acceleration. NVIDIA is expanding that leadership into datacenter networking with ethernet switches, NICs and DPUs NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI - the next era of computing. NVIDIA is a "learning machine" that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life's work, to amplify human imagination and intelligence. Make the choice, join our diverse team today!As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of ground breaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek an expert to identify architectural changes and/or completely new approaches for our GPU Compute Clusters. As a Site Reliability Engineer, you will help us with the strategic challenges we encounter including: compute, networking, and storage design for large scale, high performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment.What you'll be doing:In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. In this role, you will design, implement and support operational and reliability aspects of large scale large scale distributed systems with focus on performance at scale, real time monitoring, logging and alerting. Additional responsibilities include:Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external stakeholdersFinding and fixing problems before they occurBuilding automation for Cluster bring up and scaled up operation.Improving Operational Excellence and Processes.Write and review code, develop documentation and capacity plans, and debug the hardest problems, live, on some of the largest and most complex systems in the worldWhat we need to see:Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience with a minimum 5 years of experience designing and operating large scale compute infrastructure.Experience analyzing and tuning performance for a variety of AI/HPC workloads including workflows that uses MPIWorking knowledge of cluster configuration management tools such as Ansible, Puppet, Salt and experienced with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm, K8s, RTDA or LSFIn depth understating of container technologies like Docker, Singularity, Shifter, CharliecloudExperience learning development languages (Python, C++, Rust, PHP/Hack, Go, Java)Experience configuring and running infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc.Proactively create experiments and tooling to detect and diagnose hardware/firmware/software health issuesWays to stand out from the crowd:Experience with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarkingBackground with Machine Learning and Deep Learning concepts, algorithms, modelsFamiliarity with InfiniBand with IBoIP and RDMAUnderstanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloadsFamiliarity with deep learning frameworks like PyTorch and TensorFlowNVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most brilliant and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.The base salary range is 148,000 USD - 276,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Datacenter Site Reliability Engineer

Kforce Inc, Los Angeles

RESPONSIBILITIES:Kforce is immediately adding a full-time Datacenter Site Reliability Engineer in support of our industry leading technology development client in Los Angeles, CA. Our client is seeking candidates who are driven to make positive changes to the way people live and work by creating breakthrough technology solutions.Overview: Data monitoring and alerting, data quality assurance and anomaly detection Document team processes and policies, including methods of engagement and SLOs Analyze, design, and implement solutions at the system level to remove bottlenecks and improve edge service performance Implement monitoring and alerting to improve issue detection and response Work in a fast-paced environment; Participate in technical operations and rotations in response to performance and reliability issues Participate in on-call rotations, responsible for resolving or escalating incoming events Maintain and operate a Linux and Kubernetes environmentREQUIREMENTS: Bachelor's degree or above, majoring in Computer Science or related fields, with at least 2 years of related work experience 3+ years of experience working with Unix Linux systems from kernel to shell and beyond Experience working with system libraries, file systems, and client-server protocols Experience reading python scripts for platform operations Experience in networking technologies such TCP/IP, BGP, DNS, etc. in a carrier-grade environment Experience in developing and operating one or more of following systems: OpenStack, Kubernetes, Nginx, ipvs, ELK stack, Hadoop, etc. The pay range is the lowest to highest compensation we reasonably in good faith believe we would pay at posting for this role. We may ultimately pay more or less than this range. Employee pay is based on factors like relevant education, qualifications, certifications, experience, skills, seniority, location, performance, union contract and business needs. This range may be modified in the future. We offer comprehensive benefits including medical/dental/vision insurance, HSA, FSA, 401(k), and life, disability & ADD insurance to eligible employees. Salaried personnel receive paid time off. Hourly employees are not eligible for paid time off unless required by law. Hourly employees on a Service Contract Act project are eligible for paid sick leave. Note: Pay is not considered compensation until it is earned, vested and determinable. The amount and availability of any compensation remains in Kforce's sole discretion unless and until paid and may be modified in its discretion consistent with the law. This job is not eligible for bonuses, incentives or commissions. Kforce is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, pregnancy, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status. By clicking "Apply Today" you agree to receive calls, AI-generated calls, text messages or emails from Kforce and its affiliates, and service providers. Note that if you choose to communicate with Kforce via text messaging the frequency may vary, and message and data rates may apply. Carriers are not liable for delayed or undelivered messages. You will always have the right to cease communicating via text by using key words such as STOP.

Datacenter Site Reliability Engineer

Kforce Inc, Los Angeles