Overview

This role owns the ingestion layer and the NLP systems that turn raw enterprise and consumer data into usable signal — and makes sure the infrastructure that surrounds it is ready for sale into Fortune 500 buyers. You'll be designing pipelines that handle volume and variety, building models that extract meaning from unstructured text, and implementing the security, identity, and compliance primitives that enterprise procurement will actually scrutinize.

Responsibilities

Design and operate the data ingestion architecture: streaming and batch pipelines that pull behavioral, transactional, and third-party data into our platform with schema validation, idempotency, and observability built in.
Build NLP systems for entity extraction, classification, embeddings, intent detection, and semantic search across structured and unstructured sources.
Own enterprise readiness end-to-end: role-based access control (RBAC), single sign-on (SAML 2.0 and OIDC), SCIM directory sync, audit logging, and tenant isolation.
Drive SOC 2 Type II readiness in partnership with leadership — evidence collection, control implementation, policy tooling, and vendor assessments.
Implement data governance primitives: encryption at rest and in transit, key management, PII handling, data residency controls, and retention policies.
Collaborate with the CRO and customer-facing teams on security questionnaires, architecture reviews, and procurement diligence — and translate what you hear back into the roadmap.
Partner with the consumer-facing ML engineer to ensure ingestion outputs feed predictive models cleanly and consistently.

Qualifications

5+ years building production data infrastructure, with at least 2 years in an enterprise SaaS environment selling to large organizations.
Strong command of modern data stack tools: Kafka or equivalent streaming, dbt, Airflow or Dagster, Snowflake/BigQuery/Databricks, and object storage on AWS or GCP.
Hands-on NLP experience: working with transformer models, embedding stores, vector databases (pgvector, Pinecone, Weaviate), and building retrieval or classification systems in production.
Deep understanding of enterprise identity: OAuth 2.0, OIDC, SAML, SCIM, and the real-world quirks of integrating with Okta, Azure AD, Google Workspace, and Ping.
Practical experience shipping toward SOC 2 (Type I or Type II), ISO 27001, or equivalent frameworks.
Fluency in Python; comfortable in at least one of Go, Rust, or TypeScript for services work.
Strong written communication — you can write a design doc, a security response, and a commit message that explain themselves.

Nice to have

Experience with data clean rooms, consent management platforms, or privacy-enhancing technologies.
Exposure to HIPAA, GDPR, or CCPA implementations.
Prior work on a platform that passed enterprise procurement at a Fortune 500 or global financial services firm.
Open-source contributions in the data or ML infrastructure space.