CareerCross uses cookies to enhance your experience on our websites. If you continue to view our sites without changing your browser settings, then it is assumed that we have your consent to collect and utilise your cookies. If you do not want to give us your consent, then please change the cookie settings on your browser. Please refer to our privacy policy for more information.

Login

Login

Companies log in with your username, job seekers log in with your registered email.

Keep me logged in
Forgot your password?

Or login with

You can import your profiles by logging in with these social accounts

Keep me logged in Stay logged in?

Recommended for trusted devices only

Get logged out after 1 month of inactivity

When using a public or shared device, remember to logout once finished

Best for public and shared devices

Get logged out automatically after 30 minutes of inactivity

Recommended for trusted devices only

Get logged out after 1 month of inactivity

When using a public or shared device, remember to logout once finished
Register

IMPORTANT: Please be cautious of messages from accounts claiming to be "CareerCross"

Job ID : 1573183 Date Updated : January 22nd, 2026

Minatomirai | Infrastructure Platform Engineer @ AI startup

Hiring Company	Unsung Fields Corp.
Location	Kanagawa Prefecture, Yokohama-shi Nishi-ku
Job Type	Permanent Full-time
Salary	8 million yen ~ 14 million yen

Work Style

Casual Clothing Minimal Overtime

Job Description

Infrastructure Platform Engineer

Role overview

As an Infrastructure Engineer, you will own the GPU platform that runs production inference: cluster architecture, deployment reliability, observability, capacity management, and incident response mechanisms. Your job is to make the platform predictable and reliable—even as we scale hardware, models, tenants, and traffic patterns.

You’ll work closely with serving/runtime and gateway teams to ensure the platform enforces the right isolation, exposes the right telemetry, and supports safe changes without downtime. This role blends strong systems intuition with real production discipline: reliable rollouts, clean operational tooling, and fast incident response.

Responsibilities

Own GPU cluster architecture and operations: provisioning, node images, driver/runtime lifecycle, GPU plugin/operator lifecycle, and standardized deployment patterns for serving pools and system services.
Define and maintain the production baseline: golden node configurations, cluster hardening, upgrade paths, and “known good” compatibility matrices (drivers ↔ CUDA ↔ runtime ↔ kernel).
Build reliability into the platform: SLOs/SLIs, alerting quality, runbooks, incident tooling, and postmortems with real follow-through (automation, guardrails, and elimination of repeat incidents).
Enable safe delivery: canary deploys, progressive rollouts, rollback paths, and configuration safety (validation, guardrails, change controls, and safe defaults).
Own fleet health and maintenance workflows: node draining, GPU quarantining, automated remediation, scheduled maintenance, and safe “break-glass” procedures with auditability.
Capacity and utilization: scheduling constraints, binpacking/fragmentation management, warm pools, autoscaling primitives, and quota enforcement hooks that align with product tiers and fairness goals.
Observability: metrics/logs/tracing across gateway → serving → GPU; latency breakdowns, saturation signals, queue depth, GPU memory/compute metrics, and fleet health dashboards that help correlate customer symptoms to root causes.
Production readiness for heterogeneous environments: manage differences across hardware generations and evolving server platforms, minimizing reliability risk while improving utilization.
Security baseline: secrets management, least-privilege access, audit trails for operator actions, and secure operational workflows.
Partner with networking: topology, failure domains, load balancing, and performance-sensitive traffic paths that impact tail latency and availability.
Build operational tooling: fleet management, debugging workflows, safe admin actions, capacity tooling, and maintenance automation that reduces MTTR and improves operator efficiency.
Collaborate across teams: align rollout plans, health semantics, capacity signals, and failure handling so the entire platform behaves predictably under load.

[Employment Type]
Full-time employee
*Probationary period: 3 months

[Salary]
Annual Salary: ¥8,000,000 - ¥14,000,000
Monthly Salary: ¥666,667 - ¥1,166,667 (Monthly Base Salary: ¥666,667 - ¥1,166,667)
■Salary Increases: Available

[Working Hours]
9:00 AM - 6:00 PM (60-minute break)

[Work Location]
Queen's Tower A, 10th Floor, 2-3-1 Minatomirai, Nishi-ku, Yokohama, Kanagawa Prefecture, 220-6010
■Access: 7-minute walk from Sakuragicho Station (all lines), direct access from Minatomirai Station (Toyoko Line, Minatomirai Line)
■Non-smoking workplace
■Changes to work location: Company-designated offices
■Transfers/Secondments: None

[Holidays and Leave]

120 days off per year Days
Full two-day weekend
Annual paid vacation (minimum 10 days after the seventh month of employment)

[Benefits]
Partial transportation allowance (up to ¥15,000 per month)https://www.careercross.com/login
Social insurance (health insurance, employee pension insurance, employment insurance, workers' compensation insurance)
Overtime pay: Standard overtime pay

General Requirements

Minimum Experience Level	Over 3 years
Career Level	Mid Career
Minimum English Level	Business Level
Minimum Japanese Level	None
Minimum Education Level	Bachelor's Degree
Visa Status	No permission to work in Japan required

Required Skills

Requirements

5+ years in infrastructure/SRE/platform engineering for production distributed systems.
Strong Kubernetes experience in production (or equivalent orchestration), with real ops ownership.
Experience operating GPU clusters or other high-performance compute fleets (or similarly performance-sensitive infrastructure).
Strong debugging skills across Linux, networking, and distributed systems failure modes.
Strong operational discipline: automation-first mindset, measurable reliability, careful change management, clear communication during incidents.
Willing to participate in an on-call rotation for owned systems.

Nice to have

Experience with high-throughput gateways/service meshes (e.g., Envoy), OpenTelemetry, and multi-region architectures.
Experience with Slurm/HPC-style scheduling, RDMA/IB, or performance-sensitive networking.
Experience building internal developer platforms and “golden paths” for consistent deploy/rollback workflows.
Experience managing GPU driver/runtime upgrades safely across a fleet (compatibility testing + staged rollouts).
Familiarity with observability patterns for latency-sensitive systems (request correlation, sampling strategy, high-cardinality metrics control).

Job Location

Kanagawa Prefecture, Yokohama-shi Nishi-ku

Work Conditions

Job Type	Permanent Full-time
Salary	8 million yen ~ 14 million yen
Industry	Software

Job Category

Company Details

Company Type

Small/Medium Company (300 employees or less)

Some similar jobs others are looking at

Login

Or login with

Keep me logged in Stay logged in?