AI Benchmarking & Evaluation
Evaluating agentic coding systems across SWE-bench and Terminal-bench. Structured trajectories (trajectory.md, trajectory.jsonl), Docker-based oracle verification, and evaluation rubrics for OpenCode baselines.
>_ Hi, my name is
role: Python Developer_
Python engineer with 8 years shipping production systems across fintech and AI research — from LLM-powered annotator services and NLP pipelines to agentic-coding benchmarks on SWE-bench and Terminal-bench. I care deeply about reproducible, well-tested, container-first engineering.
I'm a Python developer based in Bangalore with eight years of experience spanning the entire project life cycle — planning, evaluation, requirements, design, development, testing, and deployment. I'm equally at home debugging a production incident, refactoring a legacy service, or designing an evaluation harness for an agentic coding system.
At Morgan Stanley I work on wealth- and investment-management tech, shipping microservices for Fixed Income, hedging and portfolio data. Before that, I spent a year and a half at Wolters Kluwer building an annotator template on top of Flask, LangChain, and GPT-4 / GPT-4 32K — with custom SpaCy and fine-tuned BERT models powering named-entity recognition and POS tagging over real business datasets.
In parallel, I contribute to AI benchmarking — evaluating agentic coding systems on SWE-bench and Terminal-bench, generating structured trajectories and patches, and validating fixes through Docker-based oracle verification. It's the kind of quiet, detail-obsessed work that makes LLM evaluation actually trustworthy.
Currently open to roles that push the frontier of LLM systems, agentic evaluation, and reproducible ML infrastructure.
A blend of applied LLM/NLP engineering, benchmarking discipline, and production backend work.
Evaluating agentic coding systems across SWE-bench and Terminal-bench. Structured trajectories (trajectory.md, trajectory.jsonl), Docker-based oracle verification, and evaluation rubrics for OpenCode baselines.
Production annotator template on top of Flask, LangChain, and Cookiecutter — shipping prompts for summarization (stuff / refine / map-reduce), QA, and translation using GPT-3.5 Turbo, GPT-4, and GPT-4 32K.
Custom SpaCy NER models trained on domain data, fine-tuned BERT for POS tagging, and Recognizers-Text integrations for temporal/date extraction. Bug fixes and workflows for RDF-based semantic content enrichment.
Flask → FastAPI migrations, Kerberos → OAuth, near-real-time position polling from Aladdin into DB2, Muni/Corp bond order-submission apps, and HTTP-REST APIs for Timeseries ingestion.
Containerized, reproducible workflows with Docker. Cloud-native deployments across AWS (MQ, EC2, CloudFormation, S3) and Azure (Queue, Container Registry), integrated with Bitbucket and Git-based CI.
API and asset-health monitoring with Prometheus + Prometheus Pushgateway, dashboards in Grafana, log analysis in Kibana. Built reporting pipelines over Oracle/SQL/MySQL/PostgreSQL for risk and regulatory reporting.
A timeline of roles, from intern to manager, across fintech and AI research.
Evaluating and improving agentic coding systems across SWE-bench and Terminal-bench environments.
trajectory.md, trajectory.jsonl) and patches by solving real repository-level issues using Claude Code and Cursor with SpecStory tracking.Wealth and investment management technology — microservices for Fixed Income, interest-rate hedging, and currency repatriation.
Annotator template on Flask/LangChain for semantic analysis, summarization, and extraction across diverse business domains.
Monitoring utility modules with Prometheus for industrial asset performance — anomaly and error detection via counter and gauge metrics.
Central services optimizing Transaction and Static Reference Data caches for Middle/Back Office — Risk, Finance, Operations.
SIBs (Service Independent Building Block APIs) for Real-Time Charging Control (RTCC), an online charging system over IP Multimedia Systems core.
Development, maintenance, and bug-fixing on Alcatel-Lucent (Nokia) Open Services Platform — the substrate for flexible deployment of Intelligent Networks services.
The tools I reach for most often — battle-tested across fintech, enterprise NLP, and AI research.
KIIT University
CBSE
CBSE
I'm open to conversations about LLM systems, agentic evaluation, and reproducible ML infrastructure. The fastest way to reach me is email — I read everything.