We are seeking a highly skilled and experienced Senior/Architect Data Engineer to lead the end-to-end architecture of a Databricks-centric multi-agent processing engine. This role will leverage cutting-edge technologies such as Mosaic AI, Model Serving, MLflow, Unity Catalog, Delta Lake, and Feature Store to automate decoder processes at scale. The ideal candidate will have a strong background in data engineering, cloud security, and MLOps, with a proven track record of architecting solutions on the Databricks Lakehouse platform.
Key responsibilities:
- Architecture Leadership: Lead the design and implementation of a Databricks-centric multi-agent processing engine, utilizing Mosaic AI, Model Serving, MLflow, Unity Catalog, Delta Lake, and Feature Store for scalable decoder automation.
- Data Ingestion and Processing: Design governed data ingestion, storage, and real-time processing workflows using Delta Lake, Structured Streaming, and Databricks Workflows, ensuring enterprise security and full data lineage.
- Model Lifecycle Management: Own the model lifecycle with MLflow, including experiment tracking, registry/versioning, A/B testing, drift monitoring, and automated retraining pipelines.
- Low Latency Model Serving: Architect low latency model serving endpoints with auto-scaling and confidence-based routing for sub-second agent decisioning.
- Data Governance: Establish robust data governance practices with Unity Catalog, including access control, audit trails, data quality, and compliance across all environments.
- Performance and Cost Optimization: Drive performance and cost optimization strategies, including auto-scaling, spot usage, and observability dashboards for reliability and efficiency.
- Production Release Strategies: Define production release strategies (blue-green), monitoring and alerting mechanisms, operational runbooks, and Service Level Objectives (SLOs) for dependable operations.
- Collaboration: Partner with engineering, MLOps, and product teams to deliver human-in-the-loop workflows and dashboards using Databricks SQL and a React frontend.
- Change Management: Lead change management, training, and knowledge transfer while managing a parallel shadow processing path during ramp-up.
- Project Coordination: Plan and coordinate phased delivery, success metrics, and risk mitigation across foundation, agent development, automation, and production rollout.