HiredinAI LogoHiredinAI
JobsCompaniesJob Alerts
  1. Home
  2. chevron_right
  3. AI Quality Assurance Engineer
  4. chevron_right
  5. Networking Solution Test Engineer - AI IB and Ethernet Cluster Debugging

Networking Solution Test Engineer - AI IB and Ethernet Cluster Debugging

NVIDIA
N
apartmentNVIDIAlocation_onShanghai, ChinaschedulePosted 4 days ago
Full-timeIBEthernetCluster DebuggingEnd-to-End Verification

About the Role

We are looking for a networking test engineer with strong system‑level debugging skills to join our End‑to‑End Verification team. You will work on cutting‑edge Ethernet‑based AI clusters, owning complex issues across hardware, system software and AI workloads. In this role, you will design and review test and product requirements across the InfiniBand / Ethernet / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behavior. You will build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics. A key part of your responsibility will be to own end‑to‑end cluster troubleshooting, reproducing customer scenarios, triaging across the stack, and driving issues to root cause and fix. You will also read and understand relevant source code to identify defects, validate fixes and improve logging and instrumentation. Collaboration with development teams is crucial for debugging NCCL, RoCE/RDMA and related networking components using logs, code inspection, and targeted experiments. You will define tests and guide the automation team to implement robust suites that produce actionable logs, metrics, and traces. Furthermore, you will run Regression, Performance, Functional, and Scale testing, analyze results, and provide clear, data‑driven reports to stakeholders. Profiling and benchmarking deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks, will also be a core part of your duties.

Responsibilities

  • Design and review test and product requirements across the InfiniBand / Ethernet / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behavior.
  • Build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics.
  • Own end‑to‑end cluster troubleshooting: reproduce customer scenarios, triage across the stack and drive issues to root cause and fix.
  • Read and understand relevant source code to identify defects, validate fixes and improve logging and instrumentation.
  • Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments.
  • Define tests and guide the automation team to implement robust suites that produce actionable logs, metrics and traces.
  • Run Regression, Performance, Functional and Scale testing, analyze results and provide clear, data‑driven reports to stakeholders.
  • Profile and benchmark deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks.

Requirements

  • B.A./B.Sc. in Computer Science, Electrical Engineering, or equivalent IT/Network/Systems experience.
  • 2+ years of hands‑on networking or system‑level testing and debugging on Linux.
  • Strong Linux networking and debugging skills (for example perf, tcpdump, ethtool, iproute2).
  • Proven production‑grade debugging experience: forming hypotheses, running experiments, and driving issues to root cause under pressure.
  • Expertise in host‑side NIC validation and tuning (offloads, queues, interrupts, firmware/driver interactions).
  • Strong knowledge of AI networking libraries (such as NCCL) and protocols (such as RoCE and RDMA), including performance and correctness debugging.
  • Ability to read and reason about source code (C/C++/Python or similar) and collaborate closely with developers on fixes.
  • Solid scripting and automation skills with Bash / Python / Ansible for setup, log collection, and experiment orchestration.
  • Fast learner, familiar with modern AI tools and workflows, able to adapt quickly.
  • Excellent analytical, problem‑solving and communication skills, with strong ownership and a collaborative mindset.

Qualifications

  • Hands‑on debugging of collective communication libraries (for example NCCL) or large‑scale LLM training / inference clusters.
  • Experience with large cluster environments (tens to thousands of GPUs or nodes), including incident response and post‑mortem analysis.
  • Deep expertise in tuning and debugging congestion control and lossless Ethernet for AI workloads (for example DCQCN, ECN, PFC).
  • Familiarity with NVIDIA networking technologies (for example BlueField / BF3, ConnectX NICs) and their software stack and diagnostics.
  • Experience debugging issues that span multiple layers (L2/L3, transport, AI frameworks) or contributing to open‑source networking / AI systems.

Benefits

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

notifications_active

Similar Job Alerts

Get notified about new AI Quality Assurance Engineer roles.

expand_more
expand_more
NVIDIA
N

NVIDIA

View Companyarrow_forward
AI Quality Assurance EngineerShanghai

Frequently Asked Questions

How do I apply for this Networking Solution Test Engineer - AI IB and Ethernet Cluster Debugging position?

Click the "Apply Now" button on this page to be directed to the application. You will be taken to the employer's application page.

Is this position remote?

This role is based in Shanghai, China. Check the full description for remote or hybrid options.

When was this job posted?

This position was posted 4 days ago. We recommend applying promptly as positions can fill quickly.

Explore More

attach_moneyAI Salary GuideschoolEntry Level AI JobscategoryMore AI Quality Assurance Engineer Jobs

Career Resources

article

AI Jobs Salary Guide 2026

Compensation data

article

AI-Proof Jobs: 25 Careers Safe from Automation

Career advice

article

The Complete Guide to AI Training Jobs

Industry guide

smart_toy
HiredinAI

Curated AI jobs across engineering, marketing, design, research, and more — from top companies and startups, updated daily.

alternate_emailworkcode

For Job Seekers

  • Browse Jobs
  • Job Categories
  • Companies
  • Remote AI Jobs
  • Entry Level Jobs
  • AI Salaries
  • Job Alerts
  • Career Blog

For Employers

  • Post a Job
  • Pricing
  • Employer Login
  • Dashboard

Resources

  • Blog
  • AI Glossary
  • Career Advice
  • Salary Guides
  • Industry News

AI Jobs by City

  • San Francisco
  • New York
  • London
  • Seattle
  • Toronto
  • Remote

Company

  • About Us
  • Contact
  • Privacy Policy
  • Terms of Service
  • Guidelines
  • DMCA

© 2026 HiredinAI. All rights reserved.

SitemapPrivacyTermsCookies