Welcome to Snowman

Have you ever gotten a newsletter twice? Probably - finding duplicates in data is a pretty difficult problem. Many different matching solutions exist with different approaches and tradeoffs. However, not only finding duplicates is difficult but also finding a deduplication solution for a dataset or even just comparing two different deduplication solutions. With our benchmark, we aim to solve exactly this challenge.

Snowman is developed as part of a bachelor's project in collaboration with SAP SE.

For a quick first impression, jump directly into the analyses overview.

Usage

If you want to use our benchmark, please consult the section Basic usage for details.

You can also use Snowman from your code or create a shared instance of Snowman.

Contributing

If you intend to contribute to our project, please take a look at the Development section. It includes information on how to get started.

Licenses

A complete list of all dependencies and their individual licenses can be found here: Licenses JSON