GSoC Project Ideas

Below, you can find some ideas on the directions in which we could jointly push Polypheny forward. Please consider them as starting points for your proposal. Of course, if you have other ideas, we would be thrilled to hear them. Please have no hesitation to contact us and get feedback on what you plan to do beforehand.

Simply copying and pasting one of the ideas will not work. On the other hand, creating an entirely new project idea without first consulting the mentors might be difficult as well.

LLM-Assisted Query Planning & Plan Explain Copilot

Polypheny already has sophisticated planning/optimization capabilities. This project adds an LLM-powered “copilot” that can: explain plans in plain language, detect common inefficiencies (wrong join order, missing predicates, bad cardinality estimates), and propose safe rewrites (e.g., predicate pushdown, join reordering hints, projection pruning). A key goal is to keep the system robust: the LLM should not directly change execution behavior, but instead generate suggestions that Polypheny can validate (via planner rules, cost checks, or A/B plan comparison) before applying.

Expected outcome: A plan-copilot UI/API that explains query plans and proposes validated optimizations (rewrites/hints) with measurable improvements on benchmark workloads.

Difficulty: hard
Size: large (~350 hours)
Skills: Java, LLM integration
Mentor: Yiming, Marco

Driver for PHP, NodeJS, Ruby, …

Currently, there is a JDBC driver, a C++ driver, a .NET driver, a Go driver and a Python connector for Polypheny. In this project, support for other languages or frameworks shall be added. This project is explicitly for developers with experience with interacting with databases in a specific language or framework. Feel free to link references to experience with that language or framework in your proposal.

Expected outcome: A driver for a not yet supported programming language or framework that allows to query Polypheny using this language or framework.

Difficulty: medium
Size: medium (~175 hours)
Skills: Good knowledge of the programming language
Mentor: Yiming, Martin

Notebook “Python-to-Query” Refactoring Assistant

Polypheny Notebooks often mix Python data manipulation (e.g., pandas-like operations) with SQL/relational queries. This project builds a tool that detects DataFrame-style transformations in notebook cells and proposes an equivalent Polypheny query (or a sequence of queries) that can be executed closer to the data (pushdown), improving reproducibility and performance. The assistant should also support “result equivalence checks” on sampled data to build trust, and provide a UX that lets users accept, edit, or reject suggestions.

Expected outcome: A notebook feature that suggests query-based replacements for common Python data-wrangling patterns, with correctness checks and an interactive UX.

Difficulty: hard
Size: large (~350 hours)
Skills: TypeScript/Angular, Java, LLM integration
Mentor: Marco, David

Natural Language Interface for Polypheny’s Workflow Engine

With the next release, Polypheny will introduce a workflow engine for modeling ETL (Extract, Transform, Load) workflows. This project aims to simplify workflow creation by developing a Natural Language Processing (NLP) interface, allowing users to describe operations in plain language. The system will interpret user input, generate corresponding workflow configurations, and suggest optimizations, making ETL design more accessible to non-experts. The project involves building a robust NLP model to process ETL-related instructions, mapping them to workflow components, and ensuring seamless integration with Polypheny’s engine. Handling ambiguities, refining user input, and maintaining accuracy will be key challenges.

Expected outcome: A functional NLP-powered interface that enables users to configure ETL workflows using natural language.

Difficulty: hard
Size: large (~350 hours)
Skills: Natural Language Processing (NLP), Java
Mentor: David

To better support AI workloads (RAG, semantic search, hybrid retrieval), Polypheny could offer first-class management of embeddings and vector similarity queries. The project would implement a vector type (or compatible representation), similarity operators (cosine/dot/L2), and at least one index strategy (exact + optional approximate). It should integrate cleanly with Polypheny’s multimodel world and expose vector search through existing interfaces and/or notebooks.

Expected outcome: Vector data support + similarity querying (and optionally indexing) integrated into Polypheny, enabling end-to-end RAG-style pipelines without leaving the DB.

Difficulty: medium-hard
Size: large (~350 hours)
Skills: Java
Mentor: Martin

Tamper-Evident Audit Logging + Security Telemetry for Queries

This project implements a robust auditing pipeline: capture security-relevant events (logins, token usage, privilege checks, schema changes, query executions), store them in a tamper-evident form (e.g., hash-chained records), and expose a queryable view for admins. Optionally, add simple detection rules (rate anomalies, repeated denied accesses, suspicious patterns) to surface security issues early—without needing a full SIEM.

Expected outcome: A tamper-evident audit log subsystem + admin-facing querying/visualization hooks for security telemetry.

Difficulty: medium-hard
Size: medium (~175 hours)
Skills: Java
Mentor: Martin, Marc

CouchDB-like HTTP Query Interface

CouchDB is a popular document-oriented database system. It features an HTTP query interface that allows querying and manipulating data. The idea of this project is to build a query interface for Polypheny that adheres to the specification of the CouchDB query API. This would allow to seamlessly replace an CouchDB database with Polypheny or to use applications written for CouchDB with Polypheny.

Expected outcome: A new query interface that allows to retrieve data managed by Polypheny using the CouchDB query syntax.

Difficulty: medium-hard
Size: large (~350 hours)
Skills: Java
Mentor: Isabel