The employer is a decentralized, Solana -based web -scraping network that allows users to monetize their unused internet bandwidth. By installing a browser extension, users securely share bandwidth to help AI companies crawl the web for public data, receiving Points (convertible to crypto tokens) as compensation.
They also operate a massive distributed crawler, giving them unique access to high -quality public web data at global scale.
They are hiring a Research Crawling Engineer (Full remote - USA/EU 6 hour overlap with EST)
You will join a company at the forefront of developing a web -scale crawler and knowledge graph that improves access to public web data and extends the value of AI to the people.
As a Research Crawling Engineer, you will design and operate large -scale web data acquisition systems for research and model development. You will work will span distributed systems, scraping infrastructure, and data pipelines.
This Role Involves:
- Operating at the boundary of scale and reliability
- Adapting to constantly changing web environments
- Balancing throughput, coverage, and data quality
- Owning end -to -end data acquisition pipelines
MISSIONS
- Design high -throughput, fault -tolerant systems for data collection (millions to billions of URLs/day)
- Handle anti -bot systems, rate limits, and dynamic/JS -heavy sites
- Develop pipelines for cleaning, deduplication, filtering, and normalisation
- Construct and maintain datasets for research and model training
- Monitor crawl performance, coverage, and data quality; iterate quickly
- Collaborate with research teams to align data collection with modeling needs
- Optimize infrastructure for cost, latency, and reliability
Example Projects you could work on :
- Build a distributed crawler for a continuously updated, high -quality web project
- Design a system to classify and filter billions of pages for pretraining
- Extract structured data from dynamic, JS -heavy sites at scale
- Improve deduplication and quality scoring across multimodal datasets