<aside>
💡
How to apply: email [email protected] with a simple reason as to why we should hire you and any relevant links
</aside>
Findr is seeking an experienced AI Inference Engineer to join our team and help optimize and scale the deployment of advanced machine learning models for real-time inference. At Findr, we’re building a revolutionary personalised AI search assistant that helps people effortlessly interact with and search their personal information and memories. We’re unlocking the gateway to infinite memory. We want AI to consciously take off our cognitive load and remember things for us.
Responsibilities
- Design and develop APIs for AI inference to serve internal and external users with high reliability and low latency.
- Benchmark, diagnose, and address bottlenecks across our inference stack to ensure optimal performance.
- Implement cutting-edge LLM inference optimizations, such as quantization, continuous batching, and memory-efficient techniques.
- Improve the observability and robustness of our systems, ensuring seamless scalability and responding effectively to outages.
- Stay up-to-date with the latest advancements in AI inference and collaborate with our research and engineering teams to apply novel techniques.
Qualifications
- Proven experience with ML systems and deep learning frameworks (e.g., PyTorch, TensorFlow, ONNX).
- Familiarity with large language model architectures and inference optimization techniques (e.g., quantization, pruning, and batching).
- Hands-on experience deploying reliable, distributed, real-time model serving systems at scale.
- Knowledge of performance profiling and optimization of inference pipelines.
- (Optional) Experience with GPU kernel programming using CUDA or familiarity with GPU architectures for machine learning workloads.
- Strong programming skills in languages like Python and C++; experience with Kubernetes and Docker is a plus.
- A passion for shipping scalable, high-quality AI products.
Our Tech Stack