Sandia National Laboratories is the nation's premier science and engineering lab for national security and technology innovation, with teams of specialists focused on cutting-edge work in a broad array of areas. Some of the main reasons we love our jobs:
Challenging work with amazing impact that contributes to security, peace, and freedom worldwide
Extraordinary co-workers
Some of the best tools, equipment, and research facilities in the world
Career advancement and enrichment opportunities
Flexible work arrangements for many positions include 9/80 (work 80 hours every two weeks, with every other Friday off) and 4/10 (work 4 ten-hour days each week) compressed workweeks, part-time work, and telecommuting (a mix of onsite work and working from home)
Generous vacation, strong medical and other benefits, competitive 401k, learning opportunities, relocation assistance and amenities aimed at creating a solid work/life balance*
World-changing technologies. Life-changing careers. Learn more about Sandia at: http://www.sandia.gov
*These benefits vary by job classification.
What Your Job Will Be Like
What Your Job Will Be Like Sandia's artificial intelligence (AI) team is building the U.S. Department of Energy's (DOE) next-generation AI Platform, an integrated scientific AI capability that delivers rapid, high-impact solutions for national security, science, and applied energy missions. The Platform is based on three pillars: Models, Infrastructure, and Data. As a Postdoctoral Appointee, you will join the Infrastructure Pillar team to design, develop, and deploy the unified compute-and-data fabric that underpins all mission workflows; from AI model training and simulation steering to real-time inference at experimental and production facilities.
We anticipate multiple hires for the Infrastructure Pillar that collectively span the set of responsibilities and skills described below. Likewise, postdoctoral appointees will be expected to work in conjunction with their Sandia mentors and teams from across Sandia and other DOE laboratories to deliver on this ambitious, fast-paced project. Importantly, we anticipate that while AI Platform development will leverage existing AI and data science tools extensively, success will also require deep technical insights, considerable innovation, research, and problem solving to address the unique needs of DOE applications. If this sounds like an exciting challenge to you, we look forward to reading your application!
Key Responsibilities
Architect and implement the hybrid compute fabric
Integrate exascale HPC systems with elastic cloud resources and specialized AI accelerator clusters (on-prem and in-cloud)o Deploy ruggedized edge servers and digital-twin
Infrastructure for sub-millisecond inference and real-time physics simulations
Develop software infrastructure for agentic workflows integrating modeling and simulations with real-time inference
Design secure networking and enclaves and evaluate federated training frameworks on secure networks
Enable observability, provenance, and monitoring
Provide software-defined, dynamic security enclaves for sensitive data with attested runtime and curated egress
Deploy unified logging, metrics, dashboards, and trace-analysis across cloud and on-prem environments using OpenTelemetry, Prometheus, ELK, or equivalent
Support federated identity and access control
Integrate multiple identity providers, attribute-based access controls, and allocation models for risk-shared governance
Manage enterprise licensing, token agreements, and software audits for AI and HPC frameworks
Manage the full lifecycle of the AI platform's infrastructure, including capacity planning, upgrades, documentation, and performance monitoring
Implement and enforce security best practices within container environments, including Role-Based Access Control (RBAC), secrets management, network policies, and vulnerability scanning.
On any given day, you may be called upon to
Stand up a new GPU-accelerated cluster, configure Slurm/Kubernetes, and validate performance benchmarks
Troubleshoot cross-site data transfers over ESnet and optimize WAN throughput for a petabyte-scale lakehouse
Deploy a hardened enclave for a classified ML training job with differential-privacy egress controls
Evaluate and develop agentic tools that are optimized for exascale HPC systems
Deploy and evaluate DOE and industry training and inference frameworks on HPC systems
Collaborate with the Models team to tune network and storage parameters for distributed training jobs
Publish and present fundamental insights to laboratory and academic audiences
Present real-time infrastructure status and forecasts to stakeholders
Present prototype demos and research results to stakeholders across DOE, DoD, IC, and industry
Our AI initiative is a laboratory wide effort. Candidates may be considered for placement in other organizations throughout the labs. The selected applicant can work a combination of onsite and offsite work. The selected applicant must live within a reasonable distance for commuting to the assigned work location when necessary.
Qualifications We Require
Possess, or are pursuing, a PhD in Computer Science, Electrical Engineering, Mathematics or a related science or engineering field, PhD must be conferred within five years prior to employment
Ability to acquire and maintain a DOE Q clearance
Qualifications We Desire
Strong collaboration skills in dynamic, interdisciplinary teams and experience mentoring junior engineers
Excellent written and verbal communication skills for both technical and non-technical audiences
Proven ability to work and communicate effectively in a collaborative and interdisciplinary team environment.
Focus on high-performance or distributed computing
Experience with:
Container orchestration (Kubernetes, Docker) and infrastructure as code (Terraform, Ansible)
Performance evaluation of agentic and federated training frameworks on variety of training and inference workflows
AI Software infrastructure deployment on complex infrastructure
Deploying large-scale secure enclaves for CUI and RD applications
Integrating experimental facilities, robotics, or 3D-printing systems into automated AI workflows
DOE/NNSA compute and networking environments (Frontier, Aurora, Perlmutter, ESnet)
Networking: ESnet, VLANs, WAN overlays, encryption, and failover design
Observability toolchains: OpenTelemetry, Prometheus, Grafana, ELK, and automated log analytics
Secure enclave technologies and zero-trust security models
Implementing DevSecOps principles and security best practices in containerized infrastructure, including network policies, classification management, and vulnerability scanning
SIEM environments, such as Splunk, and operational management of application infrastructure services
Working in cross-lab federated teams with shared governance and risk models
Deep knowledge of performance issues related to training large AI models on HPC systems
Background in digital-twin or real-time simulation steering architectures
Ability to obtain and maintain a SCI clearance, which may require a polygraph test.
You will be part of a multi-disciplinary, mission-focused team delivering the computing and data backbone for transformative AI systems. Occasional travel may be required. If you're passionate about building the software infrastructure that powers cutting edge AI systems, we want to hear from you.
About Our Team
The Center for Computing Research (CCR) at Sandia creates technology and solutions for many of our nation's most demanding national security challenges. The Center's portfolio spans the spectrum from fundamental research to state-of-the-art applications. Our work includes computer system architecture (both hardware and software); enabling technology for modeling physical and engineering systems; and research in discrete mathematics, data analytics, cognitive modeling, and decision support materials.
You will be part of a multi-disciplinary, mission-focused team delivering the compute and data backbone for transformative AI systems. Occasional travel may be required. If you're passionate about building the software infrastructure that powers cutting edge AI systems, we want to hear from you.
Posting Duration
This posting will be open for application submissions for a minimum of seven (7) calendar days, including the 'posting date'. Sandia reserves the right to extend the posting date at any time.
Security Clearance
Sandia is required by DOE to conduct a pre-employment drug test and background review that includes checks of personal references, credit, law enforcement records, and employment/education verifications. Applicants for employment need to be able to obtain and maintain a DOE Q-level security clearance, which requires U.S. citizenship. If you hold more than one citizenship (i.e., of the U.S. and another country), your ability to obtain a security clearance may be impacted.
Applicants offered employment with Sandia are subject to a federal background investigation to meet the requirements for access to classified information or matter if the duties of the position require a DOE security clearance. Substance abuse or illegal drug use, falsification of information, criminal activity, serious misconduct or other indicators of untrustworthiness can cause a clearance to be denied or terminated by DOE, resulting in the inability to perform the duties assigned and subsequent termination of employment.
EEO
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, or veteran status and any other protected class under state or federal law.
NNSA Requirements for MedPEDs
If you have a Medical Portable Electronic Device (MedPED), such as a pacemaker, defibrillator, drug-releasing pump, hearing aids, or diagnostic equipment and other equipment for measuring, monitoring, and recording body functions such as heartbeat and brain waves, if employed by Sandia National Laboratories you may be required to comply with NNSA security requirements for MedPEDs.
If you have a MedPED and you are selected for an on-site interview at Sandia National Laboratories, there may be additional steps necessary to ensure compliance with NNSA security requirements prior to the interview date.
Position Information
This postdoctoral position is a temporary position for up to one year, which may be renewed at Sandia's discretion up to five additional years. The PhD must have been conferred within five years prior to employment.
Individuals in postdoctoral positions may bid on regular Sandia positions as internal candidates, and in some cases may be converted to regular career positions during their term if warranted by ongoing operational needs, continuing availability of funds, and satisfactory job performance.
Sandia National Laboratories is one of the country’s largest research and engineering laboratories, employing 8,500 people at major facilities in Albuquerque, New Mexico and Livermore, California. We apply our world class scientific and engineering creativity and expertise to comprehensive, timely and cost effective solutions to our nation’s greatest challenges. Please visit our website at www.sandia.gov.
We use cookies on this site to enhance your experience. By using our website you accept our use of cookies.
Cookies
YourMembership uses cookies for your convenience and security. Cookies are text files stored on the browser of your computer and are used to make your experience on web sites more personal and less cumbersome. You may choose to decline cookies if your browser permits, but doing so may affect your ability to access or use certain features of this site. Please refer to your web browser's help function for assistance on how to change your preferences.