Key Responsibilities:
- Design, build, and optimize scalable data pipelines and ETL/ELT workflows for large, complex datasets.
- Design and implement foundational data architecture supporting identity resolution and ID graph systems.
- Develop and enhance systems supporting identity resolution and ID graph construction (data ingestion, normalization, matching, and deduplication).
- Process and unify multi-source datasets (cookies, device IDs, behavioral data, third-party and proprietary data).
- Write efficient, testable, and maintainable code using Python and SQL for large-scale data processing.
- Optimize data models, queries, and storage strategies for performance, scalability, and cost efficiency.
- Build and maintain data validation, monitoring, and alerting systems to ensure data quality and reliability.
- Troubleshoot, debug, and improve existing data pipelines and infrastructure.
- Own and drive complex data problems end-to-end, from initial design through production deployment.
- Make and influence key technical decisions related to data architecture, scalability, and system design.
- Collaborate with data, platform, DevOps, and product teams to deliver scalable, production-ready solutions.
- Translate business and product requirements into practical, performant data solutions.
- Document data pipelines, systems, and workflows clearly.
- Continuously improve system performance, data quality, and pipeline resilience.
- Contribute to building new capabilities that improve how customers understand and leverage data insights
Key Skills:
- 8-12+ years of hands-on experience in data engineering or large-scale data processing.
- Proven experience building and maintaining production-grade data pipelines and distributed systems.
- Demonstrated experience architecting and delivering large-scale data platforms or mission-critical data systems.
- Strong expertise in: SQL and relational databases (Postgres, BigQuery, Redshift, etc.), Python for data processing and analysis.
- Experience with Google Cloud Platform (BigQuery, Dataflow, Pub/Sub, Cloud Storage, Cloud Functions) and/or AWS (S3, Redshift, EMR, RDS).
- Experience working with large-scale datasets (hundreds of millions to billions of records).
- Strong understanding of data modeling, partitioning, indexing, and query optimization.
- Experience with distributed data processing and parallelization techniques.
- Experience moving large volumes of data across systems and architectures.
- Familiarity with CI/CD, containerization, and orchestration tools (Docker, Kubernetes, GitHub Actions, etc.).
- Strong debugging and troubleshooting skills in complex data environments.
- Experience with version control (Git) and Agile tools (Jira, Confluence, etc.).
- Highly analytical with strong attention to detail and a data-driven mindset.
- Ability to hit the ground running, quickly understand systems, and deliver independently.
- Comfortable working in a remote, fast-paced, and collaborative environment.
- Proven ability to drive system design and implementation.
Preferred:
- Experience with identity graphs, entity resolution, or record linkage systems.
- Background in AdTech, digital identity, cookies, or audience data platforms.
- Experience with real-time or streaming data systems.
- Familiarity with data quality, observability, and monitoring frameworks.
- Experience with data visualization tools (Looker, Tableau, Power BI).
- Knowledge of data privacy, compliance, and governance considerations.
- Experience with modern data platforms such as Snowflake and Databricks.
- Exposure to AI/ML technologies, including experience working with or integrating agentic frameworks.