[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/supabase/supabase/blob/master/examples/ai/semantic_text_deduplication.ipynb)

# Semantic Text Deduplication

In this example we'll use PostgreSQL + pgvectors similarity search using the `vecs` library to identify near duplicate snippets of text.

Our task is to improve IMDB movie reviews by making sure each review on the site is substantive and original.
To achieve that, we'll identify and remove any reviews that are near duplicates of others.

## Install Dependencies

In [1]:
!pip install -qU vecs datasets sentence_transformers flupy tqdm

## Load the Dataset

First we load the IMBD dataset using the datasets library.
It contains the text of 25000 movie reviews.

In [3]:
from datasets import load_dataset

data = load_dataset("imdb", split="train")
data

Found cached dataset imdb (/Users/oliverrice/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [4]:
# Look at an example review
data["text"][5]

"I would put this at the top of my list of films in the category of unwatchable trash! There are films that are bad, but the worst kind are the ones that are unwatchable but you are suppose to like them because they are supposed to be good for you! The sex sequences, so shocking in its day, couldn't even arouse a rabbit. The so called controversial politics is strictly high school sophomore amateur night Marxism. The film is self-consciously arty in the worst sense of the term. The photography is in a harsh grainy black and white. Some scenes are out of focus or taken from the wrong angle. Even the sound is bad! And some people call this art?<br /><br />"

## Embedding Model

Next, we can use the `sentence-transformers/all-MiniLM-L6-v2` model to create a 384 dimensional text embedding that represents the
semantic meaning of each review. These embeddings are what we'll use for near-duplicate detection.

In [5]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

## Initialize the Vecs Collection

The [`vecs`](https://supabase.github.io/vecs/api/) library wraps a pythonic interface around PostgreSQL and pgvector.
A collection in `vecs` maps 1:1 with a PostgreSQL table.

First you will need to establish a connection to your database. You can find the Postgres connection string in the [Database Settings](https://supabase.com/dashboard/project/_/settings/database) of your Supabase project.

> **Note:** SQLAlchemy requires the connection string to start with `postgresql://` (instead of `postgres://`). Don't forget to rename this after copying the string from the dashboard.

> **Note:** You must use the "connection pooling" string (domain ending in `*.pooler.supabase.com`) with Google Colab since Colab does not support IPv6.

This will also work with any other Postgres provider that supports pgvector.

In [7]:
import vecs

# Substitute your connection string here
DB_CONNECTION = "postgresql://postgres:password@localhost:5431/db"

# create vector store client
vx = vecs.create_client(DB_CONNECTION)

# create a PostgreSQL/pgvector table named "reviews" to contain the review embeddings
reviews = vx.get_or_create_collection(name="reviews", dimension=384)

## Create Embeddings for Each Review

Now we can iterate over the dataset, producing embeddings for the reviews

In [16]:
from typing import List, Dict, Tuple
from flupy import flu
import numpy as np
from tqdm import tqdm


batch_size = 50

records: List[Tuple[str, np.ndarray, Dict]] = []

# Iterate over the dataset in chunks
for chunk_ix, chunk in tqdm(flu(data['text']).chunk(batch_size).enumerate()):

    # Create embeddings for current chunk
    embedding_chunk = model.encode(chunk)

    # Enumerate the embeddings and create a record to insert into the database
    for row_ix, (text, embedding) in enumerate(zip(chunk, embedding_chunk)):
        record_id = chunk_ix * batch_size + row_ix
        records.append((f"{record_id}", embedding, {"text": text}))

500it [07:06,  1.17it/s]


## Insert the Embeddings into Postgres

In [17]:
reviews.upsert(records)

## Index the Collection

Indexing the collection creates an index on the vector column in Postgres that significantly improves performance
of similarity queries.

In [22]:
reviews.create_index()

## Search for Near Duplicates

Finally we can enumerate each review, searching for the most similar
reviews and displaying them if the results are near duplicates. We could then prune out the near-duplicate reviews
to make sure our viewers see a new and interesting opinion with each review they choose to read.

In [28]:
for ix, text in tqdm(enumerate(data['text'])):

    # Load the next row from the dataset
    query_results = reviews.fetch(ids=[f'{ix}'])
    
    (query_id, query_embedding, query_meta) = query_results[0]

    # Retrieve the original text from the row's metadata
    query_text = query_meta["text"]

    # To keep the output easy to read quickly, we'll restrict reviews to < 500 characters
    # In the real-world you would not include this restriction
    if len(query_text) < 500:

        # Query the review embeddings for the most similar 5 reviews
        top_5 = reviews.query(
            query_vector=query_embedding,
            limit = 5,
            include_metadata= True,
            include_value=True
        )

        # For each result
        for result_id, result_distance, result_meta in top_5[1:]:
            
            result_text = result_meta["text"]

            if (
                # Since our query embedding is in the collection, the nearest result
                # is always itself with a distance of 0. We exclude that record and 
                # review any others with a distance < 0.17
                0.01 < abs(result_distance) < 0.17
                and len(result_text) < 500
                and query_id < result_id
            ):
                print(
                    "query_id:", query_id,
                    "\t", "result_id:", result_id,
                    "\t", "distance", round(result_distance, 4),
                    "\n\n", "Query Text",
                    "\n\n", query_meta["text"],
                    "\n\n", "Result Text",
                    "\n\n", result_meta["text"],
                    "\n", "-" * 80
                )

824it [00:05, 187.11it/s]

query_id: 817 	 result_id: 823 	 distance 0.1672 

 Query Text 

 This has to be, by far, the absolute worst movie I have seen in the last 20 years. When I saw that Michael Madsen was in it I figured it couldn't be too bad a movie since he has been in some pretty decent films, and he was a pretty fair actor. WRONG! No one should waste their time on this film. I fast forwarded through 80 percent of it and I don't feel that I missed a thing. 

 Result Text 

 I agree totally with the last commenter this could be the worst movie ever made .I too had to fast forward through most of this movie. Michael Madsen must have done this movie as a favor to someone.The picture quality is grainy all the way through .And what little plot there is,is just plain stupid .I give this movie a 1 out of 10 if I could give it a lower score I would .Don't waste your time on this movie or you'll regret it. 
 --------------------------------------------------------------------------------


8675it [00:47, 166.43it/s]

query_id: 8663 	 result_id: 8669 	 distance 0.1532 

 Query Text 

 Rodney Dangerfield is a great. He has done a lot of great works. But this one....is awful. The whole plot is whack. It could have been much better. The jokes in the movie aren't funny....their stupid. This was very not so hilarious. He can do much better than this. 

 Result Text 

 As much as I love Rodney Dangerfield, this was a terrible movie. The plot was kind of a holistic rip off of various movies, but unfortunately they forgot to rip off any good jokes. In addition it was annoying and boring and that's being kind. If you're looking for a good laugh, rent a copy of Private Parts. 
 --------------------------------------------------------------------------------


10522it [00:58, 164.98it/s]

query_id: 10486 	 result_id: 5437 	 distance 0.1634 

 Query Text 

 Aside for being classic in the aspect of its cheesy lines and terrible acting, this film should never be watched unless you are looking for a good cure for your insomnia. I can't imagine anyone actually thinking this was a "good movie." 

 Result Text 

 Aside from the horrendous acting and the ridiculous and ludicrous plot, this movie wasn't too bad. Unfortunately, that doesn't leave much movie not to suck. Do not waste your time on this film, even if you find yourself suffering from insomnia, as I did. Watch an infomercial instead. 
 --------------------------------------------------------------------------------


16405it [01:36, 97.07it/s] 

query_id: 16401 	 result_id: 16424 	 distance 0.1583 

 Query Text 

 This has to be the funniest stand up comedy I have ever seen. Eddie Izzard is a genius, he picks in Brits, Americans and everyone in between. His style is completely natural and completely hilarious. I doubt that anyone could sit through this and not laugh their a** off. Watch, enjoy, it's funny. 

 Result Text 

 Until I saw this special on HBO, I had never heard of Eddie Izzard. I sure am glad that I have now! He is one of the funniest comedians I have ever seen! Rarely has a comedian immersed himself so completely in his craft then Eddie. I could not stop laughing for the entire show. If you like to laugh you HAVE to see this special! 
 --------------------------------------------------------------------------------
query_id: 16401 	 result_id: 16408 	 distance 0.1663 

 Query Text 

 This has to be the funniest stand up comedy I have ever seen. Eddie Izzard is a genius, he picks in Brits, Americans and everyone i

18953it [01:54, 101.67it/s]

query_id: 18936 	 result_id: 18942 	 distance 0.1325 

 Query Text 

 World At War is perhaps the greatest documentary series of all time. The historical research is virtually flawless. Even after a quarter century, it is the most accurate and definitive documentary about WW2. An invaluable historical work that includes interviews with some of the most important and fascinating figures from the war. I highly recommend it as a learning experience. 

 Result Text 

 The world at war is one of the best documentaries about world war 2. <br /><br />The 24 episodes cover the war and what it was like in the countries involved in it. The first episode tells us how the Hitler came to power, and how he was able to build up one of the strongest armies in the world. They also fucus on the military actions taken during the war, and the holocaust. One of the strongest and best documentaries ever made. All of you must watch this. Perfection! 10/10<br /><br /> 
 ---------------------------------------

21236it [02:09, 144.35it/s]

query_id: 21202 	 result_id: 2618 	 distance 0.1586 

 Query Text 

 I think Hollow Point is a funny film with some good moments I have never seen before in action movies. Well,both Tia Carrere and Thomas Ian Griffith aren't so good in acting, but Tia Carrere is nice and good looking girl, isn't it? But Donald Sutherland is superb in his role so-so mad gangster. 

 Result Text 

 Hollow point is an alright movie worth a half price rental or if nothing else is on a good time waster with no thought required. There are the requisite explosions and hammy acting and pretty ladies. A pretty good cast with Donald Sutherland, John Lithgow, and the lovely Tia Carrere. This cast plus a light hearted touch make for a not a great movie but a fun one..on a scale of one to ten ..a 4 
 --------------------------------------------------------------------------------


25000it [02:38, 157.92it/s]
