From initial deployment to ongoing managed operations. Every engagement is vendor-neutral no hardware margin, no supplier bias.
From first conversation to accepted, benchmarked, production-ready cluster.
Monthly retainer. Your cluster stays healthy without a full-time hire.
Review and rewrite your HPC RFP before it goes to tender.
We break your cluster first. Deliberately. In a controlled maintenance window.
Practical training for researchers, administrators, and IT teams.
Research labs, AI startups, pharma R&D teams, engineering simulation groups, and IT resellers delivering HPC projects.
6–11 weeks from purchase order to acceptance, depending on cluster size and site readiness.
A full-time HPC administrator costs ₹8–15L/year in salary. It's also a difficult hire to make and retain. A monthly retainer at ₹30K–60K provides specialist expertise on-demand without the hiring risk.
University labs, smaller research institutions, and startups that have deployed HPC but don't have dedicated operations expertise in-house.
Government and academic HPC RFPs in India consistently produce underperforming systems. The root cause is almost never hardware quality — it's specification quality. Specs written by procurement teams without HPC domain expertise create perverse vendor incentives.
A reviewed and rewritten specification with measurable acceptance criteria. Vendors can't win on paper and underdeliver in production.
Most HPC clusters are accepted on HPL benchmarks tested under ideal conditions, then handed over. They are never deliberately broken. The first real failure happens in production, at the worst possible moment.
We break your cluster first. Deliberately. In a controlled maintenance window. Across 8–12 specific failure scenarios node dropout, InfiniBand degradation, storage failure, scheduler stress, memory overcommit, login node loss. We document how your system behaves under each. We fix what's fixable. You get a resilience report before production does.
Pull a compute node mid-job. Does the job fail cleanly, hang silently, or reschedule?
Take a Lustre OST or BeeGFS chunk offline. Does the application crash, hang, or corrupt output silently
Submit jobs exceeding requested memory. Does cgroup enforcement kill cleanly or does the node swap?
Force a port to failure under thermal load. Does MPI degrade gracefully or catastrophically?
Submit 500 jobs simultaneously. Kill and restart slurmctld. Does the queue recover cleanly?
Take the primary login node offline. Can users reach the cluster? Is there a secondary path?
Every scenario classified: recovers automatically · requires manual intervention · causes data loss · fails silently. Delivered as a written resilience report with remediation steps. The test your vendor never ran.
Half-day or full-day workshops. Remote via video call or on-site across India. Custom curricula available for specific workload environments (ML/AI, CFD, molecular dynamics, genomics).
Research labs post-deployment, university IT teams taking on HPC responsibility, and PhD/postdoc researchers who use HPC but weren't trained on it.
Emergency HPC incident response. We're on a call within the hour — not a ticket system. Available evenings & weekends IST.
Copyright © 2025 init6 - All Rights Reserved.
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.