Skip to content

GameDuo

DEV Team Platform Part Leader 2026.04 — Present

DEV Team Server Developer 2025.01 — 2026.03

2025.01 — Present

Mobile game development and publishing company. Shared backend platform design for 6 live game environments (12 branches), data pipeline engineering, and common package system operations

Highlights

DEV Team Platform Part Leader

  • Newly formed platform team lacked task tracking and sizing framework — Designed 7-stage workflow (BACKLOG→SIZE REVIEW→DONE), handling rules for 4 task types, transition validators, and ticket template library, Standard operating framework deployed immediately upon team formation — task-type handling rules and decision criteria established simultaneously
  • Existing sizing criteria did not reflect AI-assisted development time reduction with Claude Code — Established 5-dimension scoring framework and per-domain Claude Code time-discount rules (test writing ~80%, pattern replication ~70%), Team-consensus AI time-discount rules established, ensuring equitable performance measurement before and after Claude Code adoption
  • Significant variance in AI agent productivity across team — context and learnings discovered by individuals did not propagate, causing repeated trial-and-error — Chose Git-based config-as-code over individual documentation — docs must be re-read by humans, but codified config integrates directly into the AI runtime and applies automatically. Built a team-wide AI development context platform unifying principles, coding standards, domain knowledge, agents, and hooks in one git repo, Established team-wide standardized AI context via git pull — converted individual learnings into team assets and eliminated AI productivity gap (397 domain knowledge files accumulated within ~2 weeks)

DEV Team Server Developer

  • Marketing data query cost surge (full-scan billing model) — Evaluated partitioning but rejected due to Write IOPS increase; adopted gRPC distributed reads (Storage Read API), Cut costs 82% ($6.25→$1.1 per TB)
  • Batch processing collection limits (60 days) — Migrated from batch processing to EDA-based on-demand pipeline — distributing peak load and enabling throughput scaling, Expanded metrics range 6x (60→360 days), processing time 2h→5min
  • Code duplication and convention inconsistency across 7 projects — Replaced manual code sync with shared NestJS package system (10 utility modules + Game Server Kit, auto-deploy), Automated upgrades cutting deployment time 3hrs→15min
  • Marketing query latency (18s) from fragmented ad platform data (Google Ads/Meta/TikTok) — Normalized read path + index tuning to avoid full-scan on 100-column table, Performance improved 97% (18s→0.5s)
  • Regulatory risk from game probability disclosure requirements — Integrated DynamicModule-based probability package across multi-game projects + CDK-based audit log pipeline, Achieved 94%+ integration test coverage (12 suites, 115 tests)

Projects

Real-Time Game Chat Server (Multi-Tenant)

2026.04 ~ Present

Designed a multi-tenant real-time chat server shared across live games, validating a 6,000 CCU / 18,000 RPS capacity target through load testing

  • No capacity model for chat infrastructure shared across games, blocking server sizing and SLO targets — Based the capacity model on commercial chat solutions' published rate-limit/PCU specs instead of internal guesswork, defining a sizing formula for a single multi-tenant server frame, Set 6,000 CCU (sustained) / 18,000 RPS / 90,000 RPS burst targets, adopted into quarterly OKRs
  • Redis Streams WAL hit a single-threaded write bottleneck at 6K msg/s (EngineCPU 99.9%, ack p95 767ms) — Distributed the WAL stream across a 4-shard cluster and compacted entries into native fields (removing Lua serialization) — chose horizontal sharding over vertical scaling, since single-threaded Redis serializes writes and cannot scale them vertically, ack p95 767ms→6.6ms (~116x), WAL backlog 80,000→2,838 cleared, 99.99% accept rate at scale
  • Messages were 100% rejected under concurrent multi-tenant load due to a missing scope key — Built a 5-tenant concurrent load-test environment and removed rejects via hot-path log removal and message scope keys, Achieved ack p95 42ms (target <100ms) and 0.04% error rate (target <0.1%) at 299K messages sustained over 5 minutes
  • High CPU/memory utilization on x86 Fargate under 6,000 CCU load drove up operating cost — Migrated to ARM64 (Graviton) on measured evidence rather than assumption — compared resource efficiency against x86 under identical 6,000 CCU load, using multi-platform image builds to support both architectures, Reduced CPU utilization 15% and memory 54% at identical load with neutral unit pricing — established operational cost-saving basis

Tech Rationale

Adopted a Redis Streams-based WAL (Write-Ahead Log) for message ordering, and distributed it across a 4-shard Valkey cluster instead of vertical scaling — a single-threaded Redis cannot resolve write bottlenecks by adding cores. ECS Fargate ARM64 migration was adopted only after validating resource efficiency under identical load.

TypeScript NestJS Redis Streams Valkey (Cluster) DynamoDB AWS ECS Fargate ARM64 (Graviton) NLB Terraform k6 Grafana

Marketing Integrated Platform

2025.02 ~ Present

Unified marketing platform managing Google Ads/Meta/TikTok campaigns, creatives, and metrics in a single system, built on dedicated marketing DB separation

  • Fragmented ad platform management across Google Ads/Meta/TikTok — Delivered unified campaign automation API for creation, deployment, modification, and retention metrics in a single platform, Eliminated marketer console switching across 3 ad platforms via unified campaign automation
  • Meta asset sync performance bottleneck — Replaced per-row ORM saves — which incurred heavy transaction overhead — with bulk INSERT and batch deletes, Image sync reduced 72% (25.7s→7.1s), DB transactions reduced 95% (10~16s→0.3~0.5s)
  • High BigQuery query costs for marketing data reads — Adopted BigQuery Storage Read API with gRPC streaming-based high-performance data reading, Reduced BigQuery costs by 82% ($6.25→$1.1 per TB)
  • Marketing query latency (18s) from 100+ column denormalized table — Normalized table into main/time-series/prediction, tuned indexes + cursor pagination, Performance improved 97% (18s→0.5s)
  • Large marketing-metrics archive backfill stalled at 63s Lambda runtime and 99.8% RDS CPU, ETA 15 hours — Diagnosed the root cause as N+1 queries and a non-SARGable DATE() predicate (index miss) and removed both — chose a single 30-day batch fetch with in-memory latest-version dedup over per-row reads that were saturating RDS, Improved throughput 26x (9.3→247/min), Lambda runtime 63s→1.36s, RDS CPU 99.8%→5~7%
  • Marketing RCP data was spreadsheet-only, and the BigQuery aggregation table combined multiple sources — blocking unified console queries — Switched to a single normalized BigQuery source table over combining multiple sources — which had produced value mismatches against the manual Excel flow — and added Criteria-Result metadata APIs to remove hardcoded display formatting, Embedded unified RCP queries into the console and replaced hardcoded currency/decimal display with BE metadata-driven formatting

Tech Rationale

Reduced BigQuery costs 82% via Storage Read API migration. Designed hybrid GCP Pub/Sub → AWS Lambda/SQS pipeline for event-driven processing with Outbox pattern for non-blocking async event publishing.

AWS Lambda Apache Arrow BigQuery GCP Pub/Sub Google Ads API Meta API MySQL NestJS S3 SQS BigQuery (Storage Read API) TikTok API TypeORM TypeScript gRPC

Internal Common Library System

2025.07 ~ Present

Development and operation of NestJS utility (10 modules) + Game Server Kit (2 packages) for multi-project code consistency

  • Common code duplication across services increasing maintenance cost — Designed 10-module library system: core, repository, cache, lock, slack, crypto, smb, hash, type, iac, Unified shared code across 7 projects with standardized dependency management
  • Repository module lacking bulk operations and audit logging capabilities — Replaced a read-then-write pattern that caused write-race duplicate-key conflicts with an atomic upsert plus automatic audit-log branching; added type narrowing via overloading and generics, and split the oversized Repository into purpose-scoped modules per SRP, Refactored 2000+ lines into shared Repository abstraction layer eliminating CRUD duplication
  • Concurrency conflicts in multi-instance environments — Abstracted ElastiCache (Redis) distributed locking into an AOP decorator, isolating lock logic from business code, Applied distributed lock decorator consistently for multi-instance concurrency control
  • Manual library upgrades taking 3 hours across 7 projects — Added workflow_dispatch+matrix and changed-package CI tests on GitHub Packages, Reduced 7-project upgrade runtime 3h→15min
  • Slow CI pipeline (15m47s) due to test infrastructure inefficiency — Chose tsc-native ts-jest isolatedModules with explicit Entity types over @swc/jest, which had weaker decorator-metadata compatibility, Cut CI time 61% (15m47s→6m06s); passed 81 suites/978 tests
  • Game-server shared code split across 5 branches + one game's 17 game-data definition modules were 2+ years diverged from the published package — Extracted the game-data definition modules + 5 submodules into a standalone package, then migrated 610 files / 310+ imports in one game and shipped 9 follow-up hotfixes (config injection, missing interface methods), Package system established with first-game adoption — 9 hotfixes shipped with zero incidents; game-data module updates now flow through a single npm version bump (manual migration of 610 files → 0)
  • Game-server cache logic was duplicated per project, increasing consistency and maintenance burden — Packaged a hybrid strategy combining local LRU (hot-path latency) and distributed cache (cross-instance consistency) as a decorator-based AOP library — chose hybrid over pure-distributed (network round-trips) and pure-local (cross-instance inconsistency), splitting tag index and single-flight into separate modules per SRP, Released cache library v0.5.0 — 262 tests across 28 suites passing, 100% code-standard compliance

Tech Rationale

Implemented ElastiCache (Redis) distributed lock as AOP decorator to separate lock logic from business code. Evaluated 5 options for Jest 30 VM isolation and adopted poolSize=2.

ElastiCache (Redis) GitHub NPM NestJS Slack API TypeORM TypeScript

CS Support Chat Server

2026.04 ~ Present

Fully isolated the customer-support (CS) chat server onto ECS from a shared deployment, and validated authentication, security, and load ceilings

  • CS chat was coupled to a shared deployment, causing resource contention and deployment coupling — Fully isolated the service onto ECS Fargate and automated deployment via Terraform IaC (S3 state), GitHub Actions OIDC, and an observability sidecar — adopted a centralized game-token verification path over per-game distributed checks, Completed independent service cutover with modularized Terraform, an 11-panel Grafana dashboard, and 8 alerts
  • Layer coupling and ad-hoc token lifetimes in the new CS chat domain posed security and maintenance risks — Separated responsibilities into a 4-layer Clean Architecture and introduced OAuth2 access (5min)/refresh (1h) tokens with ETag/304 polling — sending only deltas instead of full payloads per poll, Passed 39 tests across 12 suites; applied IaC plan (+10 resources) with zero incidents
  • 1.0 vCPU ceiling, 40~90s cold start, and bloated images blocked spike-traffic handling — Redesigned task resources and ALB RPS autoscaling, and cut cold start via native ARM64 builds and Dockerfile slimming with migration separation, Reduced image 381→254MB (-127MB) / build 5→2min / cold start by 30%; 0% failure on a k6 500-VU spike
  • No cross-project data isolation, request rate limiting, or attachment access control — Introduced cross-project isolation (triple-validation helper), Redis-backed rate limiting (auth 5/min, refresh 10/min, message 30/min), and an attachment FK integrity migration — rejected in-memory throttling for its multi-instance bypass risk, Merged the security integration with zero type errors and established cross-org/project isolation specs

Tech Rationale

Isolated the service onto ECS (Terraform IaC + GitHub Actions OIDC) to break resource contention and deployment coupling, and adopted a centralized game-token verification path over per-game distributed checks. Rate limiting was implemented with a Redis-backed throttler for multi-instance consistency.

TypeScript NestJS TypeORM OAuth2 AWS ECS Fargate ARM64 (Graviton) Terraform GitHub Actions OIDC Redis k6 Grafana

AWS Lambda Migration & Event-Driven Architecture

2025.06 ~ 2025.08

Resolved batch job limitations from marketing metrics 60-day → 360-day expansion and serverless transition for batch processing

  • Full serverless migration risked operational stability — Designed hybrid architecture keeping API server on EC2 while separating batch/job processing to Lambda, Preserved API availability with zero impact, isolating only batch workloads to contain operational risk
  • Batch processing limited to 60-day collection range with 2-hour runtime — Established Event-Driven flow with SQS+Lambda+EventBridge, Reduced batch time 2h→5min, expanded collection 6x (60→360 days)
  • Data consistency risk during event publishing — Adopted Transactional Outbox Pattern for scheduled and delayed event publishing, Ensured data consistency across distributed event processing
  • Lambda throttling, high log costs, and build OOM issues — Applied Batch Size bulk processing, CloudWatch log optimization, and build OOM remediation, Deployment automation for isolated Lambda workloads — 3h→15m (92% reduction)
  • DB connection exhaustion during massive Lambda execution — Introduced RDS Proxy connection pooling, Resolved connection exhaustion and stabilized database access

Tech Rationale

Chose Lambda to decouple batch/event workloads bound to the monolith server, enabling independent deployment. Adopted SQS for async processing to resolve scaling limitations under traffic fluctuation.

TypeScript NestJS AWS Lambda SQS SNS EventBridge RDS Proxy AWS CDK

In-Game Multi-Language Translation System

2026.02 ~ Present

Redesigned the in-game notification translation domain model — decoupling AI translation from library-based conversion into a 3-stage pipeline

  • Simplified Chinese was modeled inside the AI translation settings domain despite not being AI-supported, and settings storage/retrieval depended on the Traditional Chinese AI translation state — making independent toggling impossible without a breaking change — Extracted the derived-conversion module from the AI translation module into 9 use cases with dedicated entity/repository/scheduler, and applied the global-inheritance pattern for effective settings, Clarified domain boundaries by extracting 9 use cases and enabled independent toggling of derived conversion rules
  • Decoupling required a breaking API change — Simultaneously deployed FE/BE/Gateway with role permission migration for zero-downtime transition, Completed the breaking API change without incident and cut effective-settings DB queries by 50% (4→2)
  • The detect cron lacked version priority, delaying translation of the latest version; 30-second interval further slowed response — Added version ID-based priority ordering to detect SQL, shortened the processing interval from 30s to 10s, and raised batch limit from 100 to 200, Eliminated latest-version translation lag; progress UI plus total-count caching improved user visibility
  • Detect scheduler scanned 159 versions in one shot (3.27M-row full scan) — 60s login outages from RDS IOPS saturation — Ported the per-version loop from the AI detect module, added distributed-lock timeout/recovery, and replaced the NOT EXISTS full scan with an indexed staged lookup, Cut query time 260x (full scan → 39ms indexed lookup), removed driving filesort, restored login latency

Tech Rationale

Materialized the responsibility boundary between AI translation and library-based derived conversion inside the code structure. Split the Detection → Processing (AI) → Conversion (library) pipeline into independent modules and secured concurrent request integrity with a 2-stage race guard plus pessimistic write lock

TypeScript NestJS TypeORM MySQL opencc

Marketing Platform Audit Log System

2025.01 ~ 2025.04

Resolved delayed balance issue response caused by inability to track data change history during game operations

  • Inability to track data change history across environments and projects — Designed Git-like version control system with UUID-based cross-environment/project entity tracking, Ensured data consistency across 6 games
  • Manual entity change recording prone to omission — Applied Event Sourcing-based change tracking with Auditable decorator + TypeORM Subscriber pattern, Automated entity change tracking with consistent logging across all game environments
  • Multi-environment version conflicts during data merge — Developed 3-Way Merge Engine with parent/child entity conflict detection and unique constraint handling, Enabled reliable version merging across 6 games
  • Expensive full-snapshot comparison for every version diff — Designed Version Diff Engine with dual strategy: incremental comparison and snapshot comparison based on Base Audit availability, Dual-strategy version comparison (incremental + snapshot) for performance optimization
  • Entity tracking failures due to PK dependency during migration and merge — Introduced shared identifier-based entity tracking decoupled from PK dependency, Ensured accurate entity tracking during migration, comparison, and merge

Tech Rationale

Adopted TypeORM EntitySubscriberInterface after dedicated analysis of subscriber behavior and constraints. Designed AOP-based approach combining Auditable decorator + Subscriber to automatically collect entity changes into a standardized audit pipeline.

TypeScript NestJS TypeORM MySQL

Cloud Data Sync System

2025.08 ~ Present

Built S3-based sync and automated DDL management system to resolve dynamic game data inconsistency across environments

  • Dynamic game data inconsistency across environments — Built S3-based cross-environment data synchronization across development/staging/production, Unified game data state across all environments
  • Manual DDL schema management causing sync failures — Created automated DDL management engine with dynamic PK column type resolution, column type mismatch detection with MODIFY, and automatic index creation/RENAME, Automated database schema drift detection and reconciliation
  • Large-scale Cloud Data ingestion and S3 upload bottlenecks — Analyzed and optimized ingestion and upload pipeline, TRUNCATE→DELETE optimization for batch operations — 77% latency reduction (57.5s→12.9s)
  • Sync job instability causing operational issues — Implemented job separation, transitioned scheduling approach, tuned timeouts, and introduced non-blocking processing, Data layer resilience through schema optimization and safe migration strategies
  • Four data-sync service bugs: Redis cache invalidation, DDL-SKIP metadata, copy-key deletion, and data-protection option propagation — Diagnosed and fixed all four issues systematically, Diagnosed and fixed 4 critical bugs to stabilize data-sync operations
  • Risk of unintended full data deletion when a protection option was not forwarded on some migration paths — Forwarded the data-protection option across the four POST migration paths that were missing it, Blocked unintended full data deletion risk

Tech Rationale

Applied S3 Lifecycle policies (30d Glacier IR, 90d expiry) for cost optimization. Switched from event-triggered to scheduled execution to reduce sync miss risk.

TypeScript NestJS TypeORM MySQL S3

Probability Calculation & Audit Log Analytics Pipeline

2026.02 ~ Present

Built probability calculation package and CDK-based audit log analytics infrastructure to address regulatory risk from lack of game probability verification

  • No reusable probability calculation module across 6 live game environments (12 branches) — Packaged NestJS DynamicModule with 5 probability functions + Kinesis logging as shared probability package, Deployed across 6 live game titles (12 branches) — end-to-end audit log pipeline for probability verification
  • No audit log analytics infrastructure for probability verification — Codified CDK analytics pipeline: Kinesis→Firehose (Dynamic Partitioning+Parquet)→S3→Glue→Athena, Deployed across 6 live titles (12 branches) — CDK-based analytics unlocking end-to-end probability verification
  • Mock-based tests lacking regression confidence for infrastructure code — Replaced mocks with LocalStack + Testcontainers integration tests, Secured 94%+ coverage (12 suites/115 tests)

Tech Rationale

Codified Kinesis → Firehose (Dynamic Partitioning + Parquet) → S3 → Glue → Athena pipeline with CDK. Chose Parquet columnar format for Athena SQL cost optimization. Replaced mock-based tests with LocalStack Testcontainers for integration testing.

Contribution: Restructured initial implementation into a package. Independently handled P0 bug fixes, test stabilization (94%+ coverage), CDK infrastructure, and deployment across 6 live game environments (12 branches).
TypeScript NestJS AWS CDK Kinesis Firehose S3 Athena Glue Parquet LocalStack Testcontainers

Marketing Data Archiving & Custom Dashboard

2026.04 ~ Present

Built daily-snapshot archiving of marketing metrics (with immutability), a 4K-row custom dashboard, and alert observability

  • Displaying 4K-row marketing metrics in one dashboard caused scroll lag and cell-sync bugs — Split BE (Raw SQL DISTINCT JOIN avoiding ORM hydration + in-memory cache) and FE (visible-row lazy fetch + 200-row chunks balancing network round-trips and render latency + per-pair caching) — rejected full loading due to response size and render cost, Displayed 3,341 of 4,458 rows mapped, stabilized response at 573ms cold / 221ms warm
  • Sync paths used the latest snapshot instead of the reference-date cutoff, violating archive immutability (data contamination) — Redesigned the usecase/repository to honor cutoff semantics, made the stale-writer guard transactional, and introduced tombstone states for sync, Rebuilt 15,333 live archives (zero contamination), validated 287,020 archive-date records
  • A transient log-query stack (Loki) outage fired 16 marketing alerts at once, flooding ops channels — After diagnosing the root cause, migrated all 16 alerts to a single analytics DB (ClickHouse) query, rewrote them via the dashboard API, and set a legacy log-stack decommission policy, Migrated 16/16 alerts (25 min), preventing alert-flood recurrence during log-stack outages

Tech Rationale

Archives must be immutable snapshots keyed by a reference-date cutoff rather than read-time state, so integrity was enforced with a stale-writer guard and tombstone states. The large grid uses visible-range lazy fetch with chunk caching instead of full loading.

TypeScript NestJS React AWS Lambda BigQuery ClickHouse Grafana MySQL

Activities

Boosting Developer Productivity with Amazon Q Developer

Conference Talk
2025.10

Games on AWS 2025 Customer Session

Shared a case study of a single engineer scaling data pipeline capacity 270x in 10 days, measured by hourly concurrent job throughput

Technical Skills

Node.js TypeScript NestJS TypeORM gRPC OAuth2 MySQL ElastiCache (Redis) Redis Streams Valkey (Cluster) BigQuery ClickHouse Apache Arrow Athena AWS Lambda SQS SNS EventBridge Kinesis / Firehose S3 RDS Proxy AWS CDK ARM64 (Graviton) GCP Pub/Sub Ad Platform APIs (Google, Meta, TikTok) Slack API GitHub NPM Datadog