Technical Portfolio | Flávia Gaia

Portfolio map

Choose the area that best fits the conversation

Data, audit and analytics Documents, NLP and OCR Search, RAG and assistants Agents, automation and platform

Featured projects

If you only have five minutes, start here

Four projects that summarize my work, one per technical front. The full catalog follows below.

Generative AI and RAG

Generative-AI (RAG) assistant for public-sector support

I designed and shipped a production generative-AI (RAG) assistant supporting 20+ scholarship, benefit and accountability programs in the federal public sector, across two channels: an assisted-consultation app and an Outlook reply-drafting assistant.

What I built: a full RAG architecture with a curated-answer layer (anti-hallucination), a curation queue, automatic routing, per-program observability and hybrid search with model fallback.

View details

Data and analytics

Invoice audit with PySpark, Databricks and Genie

Big data analytical engineering to investigate transaction inconsistencies, with analytical tables, PySpark queries and a dashboard.

Impact: around 20% of inconsistencies detected and audit time cut in half.

See details

Documents and NLP

Contract reading and payment calendar with AI

Extraction of financial clauses from PDF contracts, expected payment calendar construction and divergence monitoring.

Impact: contracts paid on the correct dates, generating direct financial savings and tighter internal audit control.

View on GitHub See details

Search and RAG

RAG NLP SQL with LangChain, OpenAI and SQLite

Natural language questions over a relational database, combining semantic schema retrieval with SQL generation and execution.

View on GitHub See details

Agents and platform

MCP Docs Assistant with FastMCP and BM25 search

Read-only MCP server for querying local documentation, exposing resources, tools and prompts to MCP-compatible clients.

View on GitHub See details

Data, audit and analytics

Quality, operational analytics, tabular machine learning and analytical indicators

This front groups PySpark, Databricks, operational analytics, inconsistency detection, territorial analysis and tabular ML work.

Synthetic invoice audit dashboard screen

Invoice audit with PySpark, Databricks and Genie

Analytical tables and PySpark queries built in Databricks notebooks to investigate transaction issues affecting invoice analysis.

Goal: structure a reliable view of inconsistencies in high-volume data.
Delivery: analytical tables, PySpark queries and views by inconsistency type.
Stack: PySpark, Databricks notebooks, analytical modeling, Genie and dashboard layer.
Technical highlight: big data analytics engineering, data quality and automated refresh.

Bank transaction audit with Random Forest

Machine learning project focused on prioritizing inconsistencies and risk signals in transactional data to support audit and analysis.

Goal: classify risk signals and prioritize transactions for review.
Delivery: supervised classification with tabular preparation and derived features.
Stack: Python, pandas, scikit-learn, matplotlib, pyarrow, joblib and unittest.
Technical highlight: supervised tabular ML and explainability for analytical support.

View on GitHub

Outlier Detection Lab for inconsistencies and anomalies

Outlier and anomaly detection lab on large datasets to identify extremes, unlikely combinations and cases worth manual review.

Goal: identify anomalies and extreme cases for review.
Delivery: comparison of statistical and unsupervised approaches.
Stack: Python, pandas, scikit-learn, robust statistics and unsupervised ML.
Technical highlight: combining statistical methods and unsupervised models for audit support.

View on GitHub

CadÚnico profile analytics

Project inspired by sampled public microdata to analyze income, registration status, family vulnerability and territorial prioritization.

Goal: transform social microdata into territorial and managerial analysis.
Delivery: vulnerability indicators, profile views and territorial analytics.
Stack: Python, pandas, numpy, Streamlit and Plotly.
Technical highlight: social indicators and territorial prioritization.

View on GitHub

Social indicators and territorial analysis

Dashboards and analyses built from public data for territorial reading and program comparison.

Bolsa Família vs BPC by territory

Territorial comparison to understand social spending composition and program dependency.

View on GitHub

BPC judicialization and concentration map

Municipal view of benefit concentration and judicialization signals.

View on GitHub

Bolsa Família territorial evolution

Territorial follow-up combining social and operational views.

View on GitHub

Classical ML, time series and tabular experiments

Cases with regression, classification, forecasting and anomaly labs on structured datasets.

Covid-19 deaths linear regression baseline

Daily time series built from public data for an interpretable baseline model.

View on GitHub

Loan Default XGBoost

Default prediction on tabular data with a risk-oriented classification approach.

View on GitHub

Anomaly Detection Lab sklearn

Complementary anomaly lab with statistical techniques and unsupervised algorithms.

View on GitHub

Sales Forecasting GRU

Sales forecasting experiment comparing sequence architectures.

View on GitHub

Documents, NLP and OCR

Document reading, text classification, scraping and information extraction

This front brings together PDFs, OCR, classification, document consistency, NLP and text monitoring.

Contract reading, payment calendar and escalation with AI

Solution for reading contract PDFs, extracting financial clauses, building expected payment calendars and tracking divergences.

Goal: transform unstructured documents into a monitorable financial flow.
Delivery: contract reading, clause extraction and expected payment calendar generation.
Stack: Python, regex, PDF processing, analytical architecture and agent layer.
Technical highlight: contract extraction, financial reconciliation and workflow automation.

View on GitHub

Technical Request Document Assistant

Integrated flow for technical request reading, structured PDF field extraction and related reference retrieval.

Goal: structure document reading and reference lookup in a single interface.
Delivery: structured PDF extraction and reference retrieval in one app.
Stack: Python, reportlab, pypdf, pandas, scikit-learn and Streamlit.
Technical highlight: structured extraction, semantic retrieval and document assistant design.

View on GitHub

Engineering Document Consistency AI

Pipeline for clause extraction, semantic search, inconsistency detection and human review in a dashboard.

Goal: compare documents, retrieve relevant passages and support inconsistency review.
Delivery: document comparison and semantic retrieval of related passages.
Stack: Python, reportlab, pypdf, pandas, scikit-learn, Streamlit and Plotly.
Technical highlight: document governance, semantic comparison and review workflow.

View on GitHub

Political and Economic News Intelligence Dashboard

Web scraping project using `newspaper3k` to collect news, structure an analytical dataset, apply NLP and publish an interactive dashboard.

Goal: collect, enrich and visualize news content in an analytical structure.
Delivery: scraping, NLP and dashboard with thematic and entity views.
Stack: Python, newspaper3k, pandas, spaCy, Streamlit and Plotly.
Technical highlight: automated collection, NLP over news and executive visualization.

View on GitHub

Extraction, classification and text understanding

Cases built around tags, text classification, routing and operational reading of requests and documents.

LLM Tag Extraction Lab

Comparison between rigid baselines, fuzzy matching, few-shot prompting and human validation.

View on GitHub

Maintenance Request Classification

Supervised classification for routing maintenance requests based on text and operational attributes.

View on GitHub

Ticket Classification Pipeline

Classification pipeline for tickets and queue organization by category and priority.

View on GitHub

Fake News Detection

Binary text classification with PyTorch and a sequence-based model.

View on GitHub

OCR, legal and document automation

Projects oriented to OCR, structured extraction, automatic filling and legal workflows.

Document Auto Fill OCR

OCR pipeline for field extraction and automatic pre-filling from photos and scans.

View on GitHub

Processo Judicial OCR

OCR over legal documents with structured output for more operational analysis.

View on GitHub

Judicial Settlement MVP

MVP for settlement evaluation with OCR, enrichment and explainable structuring.

View on GitHub

Invoice Processing UiPath

Document automation for accounts payable with OCR and operational routines.

View on GitHub

Search, RAG and assistants

Context retrieval, ranking and evidence-grounded answer systems

This front concentrates retrieval experiments, assisted generation, document Q&A and hybrid search pipelines.

Generative-AI (RAG) assistant for public-sector support

I designed and shipped, end to end, a production Retrieval-Augmented Generation system supporting 20+ scholarship, benefit and accountability programs in the federal public sector. It works across two channels: an assisted-consultation app for the team and an Outlook-integrated assistant that summarizes messages and drafts replies, always reviewed by a person before sending.

What I built: the full RAG architecture, from data to interface, focused on accuracy and reliability.
Anti-hallucination: a curated-answer layer before generation and a curation queue (review and publish), ensuring accuracy on amounts, deadlines and legislation.
Routing and human-in-the-loop: automatic routing to the responsible area and mandatory review of drafts before sending.
Observability: per-program performance dashboards, interaction telemetry and unanswered-question mapping for continuous improvement.
Retrieval: hybrid vector search (dense and sparse), resilient model fallback with circuit breaker and confidence scoring.
Platform: role-based access control, an internal knowledge base and a team chat.
Stack: Python, Streamlit, Databricks (Apps, Vector Search, Delta Lake, SQL Warehouse), BGE-large embeddings and LLMs with fallback.

View presentation Internal project (source not public)

RAG NLP SQL with LangChain, OpenAI and SQLite

Python application that answers natural language questions over a SQL database by combining schema-aware retrieval and SQL generation.

Goal: allow natural language analytics over relational data.
Delivery: app for question answering over SQL with semantic context and assisted query generation.
Stack: Python, LangChain, SQLAlchemy, SQLite, Streamlit and BM25Retriever.
Technical highlight: RAG applied beyond documents, improving structured navigation.

View on GitHub

Search Performance Assistant for retrieval evaluation

Python application to study document retrieval with TF-IDF, vector indexing, fallback behavior and evidence-based assistant responses.

Goal: explore retrieval in a transparent and comparable way.
Delivery: ingestion, indexing, retrieval and visual explanation of search results.
Stack: Python, scikit-learn, TF-IDF, FAISS, cosine similarity, Tkinter and unittest.
Technical highlight: retrieval evaluation and traceable ranking explanation.

View on GitHub

Release Notes Generation Assistant

Python app for assisted release note generation based on release context, pull requests, similarity retrieval and evaluation metrics.

Goal: organize release context and PRs for assisted note generation.
Delivery: release note generation with retrieval and thematic rules.
Stack: Python, scikit-learn, TF-IDF, cosine similarity, Tkinter and unittest.
Technical highlight: product-oriented retrieval pipeline with reproducible evaluation.

View on GitHub

Educational and document assistants

Repositories focused on Q&A, study material organization and internal-knowledge style assistance.

Academic Paper RAG Search

Question answering over academic papers and technical chapters with evidence-based retrieval.

View on GitHub

Educational RAG Assistant

Educational assistant answering questions over chapters, articles, notes and FAQs.

View on GitHub

Syllabus to Study Guide RAG

Pipeline that turns course material into study guides, summaries and review questions with citations.

View on GitHub

Student Support Copilot

Copilot for academic rules, administrative questions and next-step guidance.

View on GitHub

Retrieval, ranking and search experiments

Experiments around hybrid search, ranking and retrieval pipelines across different domains.

Visual Product Complaint Retrieval

Complaint-oriented retrieval and multimodal search in a product context.

View on GitHub

Hybrid Ranking Product Search

Product search combining different ranking strategies in the same pipeline.

View on GitHub

Hybrid Ranking Support Search

Hybrid ranking system for support tickets and knowledge bases.

View on GitHub

PDF to RAG Rechunking

Chunking and rechunking experiments to improve retrieval quality in document pipelines.

View on GitHub

Agents, automation and platform

Tooling for agents, workflows, MCP, MLOps and product delivery

This front combines MCP servers, agent-driven automations, observability and product/platform work.

MCP Docs Assistant

Read-only MCP server for local markdown documentation with resources, tools and prompt support.

Goal: expose local documentation in a format consumable by MCP clients.
Delivery: read-only MCP server with catalog, search and retrieval.
Stack: Python, FastMCP, rank-bm25, frontmatter parsing and markdown.
Technical highlight: MCP design, BM25 retrieval and agent-ready documentation access.

View on GitHub

MCP SQL Analytics Server

MCP server for SQL analytics exposing structured tools for schema inspection and analytical querying.

Goal: enable structured data exploration by agents in a controlled environment.
Delivery: MCP tools for inspection and SQL analytics.
Stack: Python, MCP, SQL analytics and modular tool design.
Technical highlight: agent-oriented data access and tool-based analytics.

View on GitHub

Curriculo Site built with Codex and vibe coding

Personal website and technical portfolio with bilingual pages, booking pages and production publishing on a custom domain.

Goal: consolidate a professional presence in a custom domain.
Delivery: bilingual site with home, portfolio, booking flow and production deployment.
Stack: HTML, CSS, JavaScript, GitHub, Vercel and manual technical content curation.
Technical highlight: AI-assisted prototyping, content structuring and shipping.

View on GitHub

Agents, automation and workflows

Multi-agent setups, agent-driven automation, routing and HITL workflows.

AI Support Triage with HITL

Support triage workflow with human approval, retrieval and automated routing.

View on GitHub

Candidate Screening Workflow n8n

Candidate screening flow with automation and staged evaluation logic.

View on GitHub

Learning Path Agents

Agents for learning-path organization and recommendation.

View on GitHub

Market Intelligence CrewAI

Agent structure for market briefing, synthesis and intelligence workflows.

View on GitHub

Credit and domain-specific agents

Repositories focused on credit, service, fraud prevention and business insights.

Credit Analysis Agent

Support for profile, risk and decision reading in credit workflows.

View on GitHub

Customer Service Agent

Structured service workflow built with PydanticAI.

View on GitHub

Fraud Prevention Agent

Fraud signal monitoring and support for prevention-oriented workflows.

View on GitHub

Portfolio Risk Agent

Risk monitoring, alerting and portfolio-oriented analysis.

View on GitHub

MLOps, observability and cloud labs

Repos for serving, monitoring, feature pipelines, Vertex AI, Kubeflow and cloud experimentation.

ML Model Serving Observability

Model observability with metrics, Prometheus, Grafana and operational monitoring.

View on GitHub

Feature Store Pipeline Metaflow

Versioned and reproducible feature pipeline for training and scoring with Metaflow.

View on GitHub

Vertex AI and Kubeflow Labs

Training pipelines and benchmarks with Vertex AI, Kubeflow and recommendation/computer vision workloads.

View on GitHub

Cloud repositories

Umbrella repositories for GCP, AWS and Azure experiments organized by platform.

GCP

Projects organized by solution type, stack and technical focus

Choose the area that best fits the conversation

If you only have five minutes, start here

Generative-AI (RAG) assistant for public-sector support

Invoice audit with PySpark, Databricks and Genie

Contract reading and payment calendar with AI

RAG NLP SQL with LangChain, OpenAI and SQLite

MCP Docs Assistant with FastMCP and BM25 search

Quality, operational analytics, tabular machine learning and analytical indicators

Invoice audit with PySpark, Databricks and Genie

Bank transaction audit with Random Forest

Outlier Detection Lab for inconsistencies and anomalies

CadÚnico profile analytics

Social indicators and territorial analysis

Bolsa Família vs BPC by territory

BPC judicialization and concentration map

Bolsa Família territorial evolution

Classical ML, time series and tabular experiments

Covid-19 deaths linear regression baseline

Loan Default XGBoost

Anomaly Detection Lab sklearn

Sales Forecasting GRU

Document reading, text classification, scraping and information extraction

Contract reading, payment calendar and escalation with AI

Technical Request Document Assistant

Engineering Document Consistency AI

Political and Economic News Intelligence Dashboard

Extraction, classification and text understanding

LLM Tag Extraction Lab

Maintenance Request Classification

Ticket Classification Pipeline

Fake News Detection

OCR, legal and document automation

Document Auto Fill OCR

Processo Judicial OCR

Judicial Settlement MVP

Invoice Processing UiPath

Context retrieval, ranking and evidence-grounded answer systems

Generative-AI (RAG) assistant for public-sector support

RAG NLP SQL with LangChain, OpenAI and SQLite

Search Performance Assistant for retrieval evaluation

Release Notes Generation Assistant

Educational and document assistants

Academic Paper RAG Search

Educational RAG Assistant

Syllabus to Study Guide RAG

Student Support Copilot

Retrieval, ranking and search experiments

Visual Product Complaint Retrieval

Hybrid Ranking Product Search

Hybrid Ranking Support Search

PDF to RAG Rechunking

Tooling for agents, workflows, MCP, MLOps and product delivery

MCP Docs Assistant

MCP SQL Analytics Server

Curriculo Site built with Codex and vibe coding

Agents, automation and workflows

AI Support Triage with HITL

Candidate Screening Workflow n8n

Learning Path Agents

Market Intelligence CrewAI

Credit and domain-specific agents

Credit Analysis Agent

Customer Service Agent

Fraud Prevention Agent

Portfolio Risk Agent

MLOps, observability and cloud labs

ML Model Serving Observability

Feature Store Pipeline Metaflow

Vertex AI and Kubeflow Labs

Cloud repositories