Files
HKSingleParty/99_references/supabase-examples/ai/semantic_text_deduplication.ipynb
2025-05-28 09:55:51 +08:00

502 lines
19 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "4f3b89c3",
"metadata": {},
"source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/supabase/supabase/blob/master/examples/ai/semantic_text_deduplication.ipynb)"
]
},
{
"cell_type": "markdown",
"id": "2a1f181d-feeb-4b29-aabc-67a75234b92c",
"metadata": {},
"source": [
"# Semantic Text Deduplication\n",
"\n",
"In this example we'll use PostgreSQL + pgvectors similarity search using the `vecs` library to identify near duplicate snippets of text.\n",
"\n",
"Our task is to improve IMDB movie reviews by making sure each review on the site is substantive and original.\n",
"To achieve that, we'll identify and remove any reviews that are near duplicates of others."
]
},
{
"cell_type": "markdown",
"id": "3cab93f5-10d0-47c5-9f4e-64921461e7e2",
"metadata": {},
"source": [
"## Install Dependencies"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a41bc3e4-ea52-43aa-9239-a431b49f029e",
"metadata": {},
"outputs": [],
"source": [
"!pip install -qU vecs datasets sentence_transformers flupy tqdm"
]
},
{
"cell_type": "markdown",
"id": "4dbd176f-3b4e-4d41-a72d-1e1affe6ecae",
"metadata": {},
"source": [
"## Load the Dataset\n",
"\n",
"First we load the IMBD dataset using the datasets library.\n",
"It contains the text of 25000 movie reviews."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "dc6b0bc2-b95f-4190-bf77-fa2dc57fc247",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Found cached dataset imdb (/Users/oliverrice/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)\n"
]
},
{
"data": {
"text/plain": [
"Dataset({\n",
" features: ['text', 'label'],\n",
" num_rows: 25000\n",
"})"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from datasets import load_dataset\n",
"\n",
"data = load_dataset(\"imdb\", split=\"train\")\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "da15eff4-932c-4e0c-b938-ba188af62b63",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"I would put this at the top of my list of films in the category of unwatchable trash! There are films that are bad, but the worst kind are the ones that are unwatchable but you are suppose to like them because they are supposed to be good for you! The sex sequences, so shocking in its day, couldn't even arouse a rabbit. The so called controversial politics is strictly high school sophomore amateur night Marxism. The film is self-consciously arty in the worst sense of the term. The photography is in a harsh grainy black and white. Some scenes are out of focus or taken from the wrong angle. Even the sound is bad! And some people call this art?<br /><br />\""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Look at an example review\n",
"data[\"text\"][5]"
]
},
{
"cell_type": "markdown",
"id": "955b2700-8242-40eb-ac3f-d479a0312693",
"metadata": {},
"source": [
"## Embedding Model\n",
"\n",
"Next, we can use the `sentence-transformers/all-MiniLM-L6-v2` model to create a 384 dimensional text embedding that represents the\n",
"semantic meaning of each review. These embeddings are what we'll use for near-duplicate detection."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ce9bc85b-e844-407c-a0ad-ccf6af3c8866",
"metadata": {},
"outputs": [],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"\n",
"model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')"
]
},
{
"cell_type": "markdown",
"id": "9e82862d-440a-4f66-9ed7-0eaa6a0f4062",
"metadata": {},
"source": [
"## Initialize the Vecs Collection\n",
"\n",
"The [`vecs`](https://supabase.github.io/vecs/api/) library wraps a pythonic interface around PostgreSQL and pgvector.\n",
"A collection in `vecs` maps 1:1 with a PostgreSQL table.\n",
"\n",
"First you will need to establish a connection to your database. You can find the Postgres connection string in the [Database Settings](https://supabase.com/dashboard/project/_/settings/database) of your Supabase project.\n",
"\n",
"> **Note:** SQLAlchemy requires the connection string to start with `postgresql://` (instead of `postgres://`). Don't forget to rename this after copying the string from the dashboard.\n",
"\n",
"> **Note:** You must use the \"connection pooling\" string (domain ending in `*.pooler.supabase.com`) with Google Colab since Colab does not support IPv6.\n",
"\n",
"This will also work with any other Postgres provider that supports pgvector."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d2771545-d209-4ceb-a222-ed139a4620f2",
"metadata": {},
"outputs": [],
"source": [
"import vecs\n",
"\n",
"# Substitute your connection string here\n",
"DB_CONNECTION = \"postgresql://postgres:password@localhost:5431/db\"\n",
"\n",
"# create vector store client\n",
"vx = vecs.create_client(DB_CONNECTION)\n",
"\n",
"# create a PostgreSQL/pgvector table named \"reviews\" to contain the review embeddings\n",
"reviews = vx.get_or_create_collection(name=\"reviews\", dimension=384)"
]
},
{
"cell_type": "markdown",
"id": "9ae030b4-cfd2-43bc-802f-e7ac4007d2ad",
"metadata": {},
"source": [
"## Create Embeddings for Each Review\n",
"\n",
"Now we can iterate over the dataset, producing embeddings for the reviews"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "bd134310-9da1-4448-8358-9fc491c98e1e",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"500it [07:06, 1.17it/s]\n"
]
}
],
"source": [
"from typing import List, Dict, Tuple\n",
"from flupy import flu\n",
"import numpy as np\n",
"from tqdm import tqdm\n",
"\n",
"\n",
"batch_size = 50\n",
"\n",
"records: List[Tuple[str, np.ndarray, Dict]] = []\n",
"\n",
"# Iterate over the dataset in chunks\n",
"for chunk_ix, chunk in tqdm(flu(data['text']).chunk(batch_size).enumerate()):\n",
"\n",
" # Create embeddings for current chunk\n",
" embedding_chunk = model.encode(chunk)\n",
"\n",
" # Enumerate the embeddings and create a record to insert into the database\n",
" for row_ix, (text, embedding) in enumerate(zip(chunk, embedding_chunk)):\n",
" record_id = chunk_ix * batch_size + row_ix\n",
" records.append((f\"{record_id}\", embedding, {\"text\": text}))"
]
},
{
"cell_type": "markdown",
"id": "6ef285d7-dc7b-4576-ab2b-bf0a4ba06de4",
"metadata": {},
"source": [
"## Insert the Embeddings into Postgres"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "6156e7f9-b78a-4ab9-8002-b09ec8716be2",
"metadata": {},
"outputs": [],
"source": [
"reviews.upsert(records)"
]
},
{
"cell_type": "markdown",
"id": "aa095a53-dd9c-4a3c-93bf-c54708c67765",
"metadata": {},
"source": [
"## Index the Collection\n",
"\n",
"Indexing the collection creates an index on the vector column in Postgres that significantly improves performance\n",
"of similarity queries."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "6b96d3af-5592-4fbc-81ab-f77b4228ccaa",
"metadata": {},
"outputs": [],
"source": [
"reviews.create_index()"
]
},
{
"cell_type": "markdown",
"id": "371631e6-4995-484f-9d76-40ab5e7b2e16",
"metadata": {},
"source": [
"## Search for Near Duplicates\n",
"\n",
"Finally we can enumerate each review, searching for the most similar\n",
"reviews and displaying them if the results are near duplicates. We could then prune out the near-duplicate reviews\n",
"to make sure our viewers see a new and interesting opinion with each review they choose to read."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "d2a5c699-7c04-48ad-9033-a310bbb0dffc",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"824it [00:05, 187.11it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"query_id: 817 \t result_id: 823 \t distance 0.1672 \n",
"\n",
" Query Text \n",
"\n",
" This has to be, by far, the absolute worst movie I have seen in the last 20 years. When I saw that Michael Madsen was in it I figured it couldn't be too bad a movie since he has been in some pretty decent films, and he was a pretty fair actor. WRONG! No one should waste their time on this film. I fast forwarded through 80 percent of it and I don't feel that I missed a thing. \n",
"\n",
" Result Text \n",
"\n",
" I agree totally with the last commenter this could be the worst movie ever made .I too had to fast forward through most of this movie. Michael Madsen must have done this movie as a favor to someone.The picture quality is grainy all the way through .And what little plot there is,is just plain stupid .I give this movie a 1 out of 10 if I could give it a lower score I would .Don't waste your time on this movie or you'll regret it. \n",
" --------------------------------------------------------------------------------\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"8675it [00:47, 166.43it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"query_id: 8663 \t result_id: 8669 \t distance 0.1532 \n",
"\n",
" Query Text \n",
"\n",
" Rodney Dangerfield is a great. He has done a lot of great works. But this one....is awful. The whole plot is whack. It could have been much better. The jokes in the movie aren't funny....their stupid. This was very not so hilarious. He can do much better than this. \n",
"\n",
" Result Text \n",
"\n",
" As much as I love Rodney Dangerfield, this was a terrible movie. The plot was kind of a holistic rip off of various movies, but unfortunately they forgot to rip off any good jokes. In addition it was annoying and boring and that's being kind. If you're looking for a good laugh, rent a copy of Private Parts. \n",
" --------------------------------------------------------------------------------\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"10522it [00:58, 164.98it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"query_id: 10486 \t result_id: 5437 \t distance 0.1634 \n",
"\n",
" Query Text \n",
"\n",
" Aside for being classic in the aspect of its cheesy lines and terrible acting, this film should never be watched unless you are looking for a good cure for your insomnia. I can't imagine anyone actually thinking this was a \"good movie.\" \n",
"\n",
" Result Text \n",
"\n",
" Aside from the horrendous acting and the ridiculous and ludicrous plot, this movie wasn't too bad. Unfortunately, that doesn't leave much movie not to suck. Do not waste your time on this film, even if you find yourself suffering from insomnia, as I did. Watch an infomercial instead. \n",
" --------------------------------------------------------------------------------\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"16405it [01:36, 97.07it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"query_id: 16401 \t result_id: 16424 \t distance 0.1583 \n",
"\n",
" Query Text \n",
"\n",
" This has to be the funniest stand up comedy I have ever seen. Eddie Izzard is a genius, he picks in Brits, Americans and everyone in between. His style is completely natural and completely hilarious. I doubt that anyone could sit through this and not laugh their a** off. Watch, enjoy, it's funny. \n",
"\n",
" Result Text \n",
"\n",
" Until I saw this special on HBO, I had never heard of Eddie Izzard. I sure am glad that I have now! He is one of the funniest comedians I have ever seen! Rarely has a comedian immersed himself so completely in his craft then Eddie. I could not stop laughing for the entire show. If you like to laugh you HAVE to see this special! \n",
" --------------------------------------------------------------------------------\n",
"query_id: 16401 \t result_id: 16408 \t distance 0.1663 \n",
"\n",
" Query Text \n",
"\n",
" This has to be the funniest stand up comedy I have ever seen. Eddie Izzard is a genius, he picks in Brits, Americans and everyone in between. His style is completely natural and completely hilarious. I doubt that anyone could sit through this and not laugh their a** off. Watch, enjoy, it's funny. \n",
"\n",
" Result Text \n",
"\n",
" This is another gem of a stand up show from Eddie Izzard . You cannot fail to laugh at the wide range of topics he talks about. He even takes the piss out of his American audiance at times and most of them didnt even realise it! A must see for anybody who likes comedians. 9 out of 10. \n",
" --------------------------------------------------------------------------------\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"18953it [01:54, 101.67it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"query_id: 18936 \t result_id: 18942 \t distance 0.1325 \n",
"\n",
" Query Text \n",
"\n",
" World At War is perhaps the greatest documentary series of all time. The historical research is virtually flawless. Even after a quarter century, it is the most accurate and definitive documentary about WW2. An invaluable historical work that includes interviews with some of the most important and fascinating figures from the war. I highly recommend it as a learning experience. \n",
"\n",
" Result Text \n",
"\n",
" The world at war is one of the best documentaries about world war 2. <br /><br />The 24 episodes cover the war and what it was like in the countries involved in it. The first episode tells us how the Hitler came to power, and how he was able to build up one of the strongest armies in the world. They also fucus on the military actions taken during the war, and the holocaust. One of the strongest and best documentaries ever made. All of you must watch this. Perfection! 10/10<br /><br /> \n",
" --------------------------------------------------------------------------------\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"21236it [02:09, 144.35it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"query_id: 21202 \t result_id: 2618 \t distance 0.1586 \n",
"\n",
" Query Text \n",
"\n",
" I think Hollow Point is a funny film with some good moments I have never seen before in action movies. Well,both Tia Carrere and Thomas Ian Griffith aren't so good in acting, but Tia Carrere is nice and good looking girl, isn't it? But Donald Sutherland is superb in his role so-so mad gangster. \n",
"\n",
" Result Text \n",
"\n",
" Hollow point is an alright movie worth a half price rental or if nothing else is on a good time waster with no thought required. There are the requisite explosions and hammy acting and pretty ladies. A pretty good cast with Donald Sutherland, John Lithgow, and the lovely Tia Carrere. This cast plus a light hearted touch make for a not a great movie but a fun one..on a scale of one to ten ..a 4 \n",
" --------------------------------------------------------------------------------\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"25000it [02:38, 157.92it/s]\n"
]
}
],
"source": [
"for ix, text in tqdm(enumerate(data['text'])):\n",
"\n",
" # Load the next row from the dataset\n",
" query_results = reviews.fetch(ids=[f'{ix}'])\n",
" \n",
" (query_id, query_embedding, query_meta) = query_results[0]\n",
"\n",
" # Retrieve the original text from the row's metadata\n",
" query_text = query_meta[\"text\"]\n",
"\n",
" # To keep the output easy to read quickly, we'll restrict reviews to < 500 characters\n",
" # In the real-world you would not include this restriction\n",
" if len(query_text) < 500:\n",
"\n",
" # Query the review embeddings for the most similar 5 reviews\n",
" top_5 = reviews.query(\n",
" query_vector=query_embedding,\n",
" limit = 5,\n",
" include_metadata= True,\n",
" include_value=True\n",
" )\n",
"\n",
" # For each result\n",
" for result_id, result_distance, result_meta in top_5[1:]:\n",
" \n",
" result_text = result_meta[\"text\"]\n",
"\n",
" if (\n",
" # Since our query embedding is in the collection, the nearest result\n",
" # is always itself with a distance of 0. We exclude that record and \n",
" # review any others with a distance < 0.17\n",
" 0.01 < abs(result_distance) < 0.17\n",
" and len(result_text) < 500\n",
" and query_id < result_id\n",
" ):\n",
" print(\n",
" \"query_id:\", query_id,\n",
" \"\\t\", \"result_id:\", result_id,\n",
" \"\\t\", \"distance\", round(result_distance, 4),\n",
" \"\\n\\n\", \"Query Text\",\n",
" \"\\n\\n\", query_meta[\"text\"],\n",
" \"\\n\\n\", \"Result Text\",\n",
" \"\\n\\n\", result_meta[\"text\"],\n",
" \"\\n\", \"-\" * 80\n",
" )"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}