Semantic search tests “how similar in meaning is the phrase to another phrase”
cool cats is equivalent in meaning to chill felines, also groovy kittys, suave main coon and sphinx with style
Why is this data suitable for ‘semantic’ but not ‘keyword’ search?
GitHub stars are public and common across languages/stacks/specialisms
GitHub repos are text rich data
GitHub repos are a ‘weak standard’
Topics are free text
Descriptions are free text
An Intuitive Example
Polars vs Pandas
data-analysis
flexible
alignment
python
data-science
dataframe-library
dataframe
dataframes
arrow
python
out-of-core
Polars vs Pandas
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
The prepare_documents function is responsible for reading in the json files and creating Document objects
def prepare_documents( repo_contents_dir: str="./repo_content",) -> List[Document]:""" Prepare the documents for ingestion into the vectorstore """ ...return documents
prepare_documents
Get the file paths for each json file for each repo
def prepare_documents( repo_contents_dir: str="./repo_content",) -> List[Document]:""" Prepare the documents for ingestion into the vectorstore """ file_paths = []forfilein os.listdir(repo_contents_dir): file_paths.append(os.path.join(repo_contents_dir, file)) ...return documents
prepare_documents is used to load the Document objects into `Chroma
from langchain_community.vectorstores import Chromafrom langchain_openai.embeddings import OpenAIEmbeddingsChroma.from_documents( documents=utils.prepare_documents(), embedding=OpenAIEmbeddings(model="text-embedding-3-large"), persist_directory="./vectorstore-chroma",)
Using the prepare_documents function
Chroma.from_documents also needs an embedding object and a persist_directory
from langchain_community.vectorstores import Chromafrom langchain_openai.embeddings import OpenAIEmbeddingsChroma.from_documents( documents=utils.prepare_documents(), embedding=OpenAIEmbeddings(model="text-embedding-3-large"), persist_directory="./vectorstore-chroma",)
Accessing the Vector Store
Semantic similarity search
Semantic search accepts a query string and returns the most relevant Document objects
def shoot( vectorstore_path: str, k: int, method: SearchMethods = SearchMethods.similarity, query: str,) -> List[Document]:""" Create a retriever from a vectorstore and query it """
Semantic similarity search
Create a retriever object from the vector store and query it
def shoot( vectorstore_path: str, k: int, method: SearchMethods = SearchMethods.similarity, query: str,) -> List[Document]:""" Create a retriever from a vectorstore and query it """ retriever = Chroma( persist_directory=vectorstore_path, embedding_function=OpenAIEmbeddings( model="text-embedding-3-large" ), ).as_retriever( search_type=method, search_kwargs={"k": k, }, )return retriever.get_relevant_documents(query)
Self Querying
Self Querying
Self-querying is a way to pre-filter the results of a semantic search by metadata fields
def astrologer( query: str, k: Optional[int] = typer.Option(4, help="Number of results to fetch from the vectorstore" ),) -> List[Document]:""" A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes Example: ``` starpilot astrologer "What can I use to build a web app with Python?" starpilot astrologer "Suggest some Rust machine learning crates" ``` """ metadata_field_info = [ AttributeInfo( name="languages", description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",type="string", ), AttributeInfo( name="name", description="the name of a repository. Example: 'langchain'",type="string", ), AttributeInfo( name="topics", description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",type="string", ), AttributeInfo( name="url", description="the url of a repository on GitHub",type="string", ), AttributeInfo( name="stargazerCount", description="the number of stars a repository has on GitHub",type="number", ), ] document_content_description ="content describing a repository on GitHub" prompt = get_query_constructor_prompt( document_content_description, metadata_field_info, examples=[ ("Python machine learning repos", {"query": "machine learning","filter": 'eq("primaryLanguage", "Python")', }, ), ("Rust Dataframe crates", {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'}, ), ("What R packages do time series analysis", {"query": "time series", "filter": 'eq("primaryLanguage", "R")'}, ), ("data frame packages with 100 stars or more", {"query": "data frame","filter": 'gte("stargazerCount", 100)', }, ), ], allowed_comparators=[ Comparator.EQ, Comparator.NE, Comparator.GT, Comparator.GTE, Comparator.LT, Comparator.LTE, ], ) llm = ChatOpenAI(model="gpt-3.5-turbo",) output_parser = StructuredQueryOutputParser.from_components() query_constructor = prompt | llm | output_parser vectorstore = Chroma( persist_directory="./vectorstore-chroma", embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"), ) retriever = SelfQueryRetriever( query_constructor=query_constructor, vectorstore=vectorstore, structured_query_translator=ChromaTranslator(), search_kwargs={"k": k}, ) results = retriever.invoke(query)return results
Self Querying
Create a list of AttributeInfo objects that describe the metadata fields
def astrologer( query: str, k: Optional[int] = typer.Option(4, help="Number of results to fetch from the vectorstore" ),) -> List[Document]:""" A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes Example: ``` starpilot astrologer "What can I use to build a web app with Python?" starpilot astrologer "Suggest some Rust machine learning crates" ``` """ metadata_field_info = [ AttributeInfo( name="languages", description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",type="string", ), AttributeInfo( name="name", description="the name of a repository. Example: 'langchain'",type="string", ), AttributeInfo( name="topics", description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",type="string", ), AttributeInfo( name="url", description="the url of a repository on GitHub",type="string", ), AttributeInfo( name="stargazerCount", description="the number of stars a repository has on GitHub",type="number", ), ] document_content_description ="content describing a repository on GitHub" prompt = get_query_constructor_prompt( document_content_description, metadata_field_info, examples=[ ("Python machine learning repos", {"query": "machine learning","filter": 'eq("primaryLanguage", "Python")', }, ), ("Rust Dataframe crates", {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'}, ), ("What R packages do time series analysis", {"query": "time series", "filter": 'eq("primaryLanguage", "R")'}, ), ("data frame packages with 100 stars or more", {"query": "data frame","filter": 'gte("stargazerCount", 100)', }, ), ], allowed_comparators=[ Comparator.EQ, Comparator.NE, Comparator.GT, Comparator.GTE, Comparator.LT, Comparator.LTE, ], ) llm = ChatOpenAI(model="gpt-3.5-turbo",) output_parser = StructuredQueryOutputParser.from_components() query_constructor = prompt | llm | output_parser vectorstore = Chroma( persist_directory="./vectorstore-chroma", embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"), ) retriever = SelfQueryRetriever( query_constructor=query_constructor, vectorstore=vectorstore, structured_query_translator=ChromaTranslator(), search_kwargs={"k": k}, ) results = retriever.invoke(query)return results
Self Querying
Construct the query prompt using get query constructor prompt to return a BasePromptTemplate object
def astrologer( query: str, k: Optional[int] = typer.Option(4, help="Number of results to fetch from the vectorstore" ),) -> List[Document]:""" A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes Example: ``` starpilot astrologer "What can I use to build a web app with Python?" starpilot astrologer "Suggest some Rust machine learning crates" ``` """ metadata_field_info = [ AttributeInfo( name="languages", description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",type="string", ), AttributeInfo( name="name", description="the name of a repository. Example: 'langchain'",type="string", ), AttributeInfo( name="topics", description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",type="string", ), AttributeInfo( name="url", description="the url of a repository on GitHub",type="string", ), AttributeInfo( name="stargazerCount", description="the number of stars a repository has on GitHub",type="number", ), ] document_content_description ="content describing a repository on GitHub" prompt = get_query_constructor_prompt( document_content_description, metadata_field_info, examples=[ ("Python machine learning repos", {"query": "machine learning","filter": 'eq("primaryLanguage", "Python")', }, ), ("Rust Dataframe crates", {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'}, ), ("What R packages do time series analysis", {"query": "time series", "filter": 'eq("primaryLanguage", "R")'}, ), ("data frame packages with 100 stars or more", {"query": "data frame","filter": 'gte("stargazerCount", 100)', }, ), ], allowed_comparators=[ Comparator.EQ, Comparator.NE, Comparator.GT, Comparator.GTE, Comparator.LT, Comparator.LTE, ], ) llm = ChatOpenAI(model="gpt-3.5-turbo",) output_parser = StructuredQueryOutputParser.from_components() query_constructor = prompt | llm | output_parser vectorstore = Chroma( persist_directory="./vectorstore-chroma", embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"), ) retriever = SelfQueryRetriever( query_constructor=query_constructor, vectorstore=vectorstore, structured_query_translator=ChromaTranslator(), search_kwargs={"k": k}, ) results = retriever.invoke(query)return results
Self Querying
Use LangChain Expression Language in a chain to make a QueryConstructor object
def astrologer( query: str, k: Optional[int] = typer.Option(4, help="Number of results to fetch from the vectorstore" ),) -> List[Document]:""" A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes Example: ``` starpilot astrologer "What can I use to build a web app with Python?" starpilot astrologer "Suggest some Rust machine learning crates" ``` """ metadata_field_info = [ AttributeInfo( name="languages", description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",type="string", ), AttributeInfo( name="name", description="the name of a repository. Example: 'langchain'",type="string", ), AttributeInfo( name="topics", description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",type="string", ), AttributeInfo( name="url", description="the url of a repository on GitHub",type="string", ), AttributeInfo( name="stargazerCount", description="the number of stars a repository has on GitHub",type="number", ), ] document_content_description ="content describing a repository on GitHub" prompt = get_query_constructor_prompt( document_content_description, metadata_field_info, examples=[ ("Python machine learning repos", {"query": "machine learning","filter": 'eq("primaryLanguage", "Python")', }, ), ("Rust Dataframe crates", {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'}, ), ("What R packages do time series analysis", {"query": "time series", "filter": 'eq("primaryLanguage", "R")'}, ), ("data frame packages with 100 stars or more", {"query": "data frame","filter": 'gte("stargazerCount", 100)', }, ), ], allowed_comparators=[ Comparator.EQ, Comparator.NE, Comparator.GT, Comparator.GTE, Comparator.LT, Comparator.LTE, ], ) llm = ChatOpenAI(model="gpt-3.5-turbo",) output_parser = StructuredQueryOutputParser.from_components() query_constructor = prompt | llm | output_parser vectorstore = Chroma( persist_directory="./vectorstore-chroma", embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"), ) retriever = SelfQueryRetriever( query_constructor=query_constructor, vectorstore=vectorstore, structured_query_translator=ChromaTranslator(), search_kwargs={"k": k}, ) results = retriever.invoke(query)return results
Self Querying
Create a SelfQueryRetriever object from the QueryConstructor object and the Chroma object
def astrologer( query: str, k: Optional[int] = typer.Option(4, help="Number of results to fetch from the vectorstore" ),) -> List[Document]:""" A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes Example: ``` starpilot astrologer "What can I use to build a web app with Python?" starpilot astrologer "Suggest some Rust machine learning crates" ``` """ metadata_field_info = [ AttributeInfo( name="languages", description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",type="string", ), AttributeInfo( name="name", description="the name of a repository. Example: 'langchain'",type="string", ), AttributeInfo( name="topics", description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",type="string", ), AttributeInfo( name="url", description="the url of a repository on GitHub",type="string", ), AttributeInfo( name="stargazerCount", description="the number of stars a repository has on GitHub",type="number", ), ] document_content_description ="content describing a repository on GitHub" prompt = get_query_constructor_prompt( document_content_description, metadata_field_info, examples=[ ("Python machine learning repos", {"query": "machine learning","filter": 'eq("primaryLanguage", "Python")', }, ), ("Rust Dataframe crates", {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'}, ), ("What R packages do time series analysis", {"query": "time series", "filter": 'eq("primaryLanguage", "R")'}, ), ("data frame packages with 100 stars or more", {"query": "data frame","filter": 'gte("stargazerCount", 100)', }, ), ], allowed_comparators=[ Comparator.EQ, Comparator.NE, Comparator.GT, Comparator.GTE, Comparator.LT, Comparator.LTE, ], ) llm = ChatOpenAI(model="gpt-3.5-turbo",) output_parser = StructuredQueryOutputParser.from_components() query_constructor = prompt | llm | output_parser vectorstore = Chroma( persist_directory="./vectorstore-chroma", embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"), ) retriever = SelfQueryRetriever( query_constructor=query_constructor, vectorstore=vectorstore, structured_query_translator=ChromaTranslator(), search_kwargs={"k": k}, ) results = retriever.invoke(query)return results
Demo astrolger function
Evaluation
What I liked
Langchain is an “extensively” documented
Langchain company backed videos and tutorials
Langchain is a suitable orchestration framework
It’s like scikit-learn for LLMs
What I didn’t like
Heavily OOP
Reading the sourcecode is still necessary
kwargs are often silently ignored and undocumented
‘Wrappers’ and ‘Connectors’ don’t implement the full API
Moves fast and breaks things (@v0.1.*)
Observability is secondary
Conclusion
Outcome for developers
starpilotis available now:DaveParr/starpilot
Star it, fork it, PR it!
Langchain is also available now: langchain/langchain
This talk is also also available now: DaveParr/starpilot-presentation
Outcome for me
I have a working prototype
I have a better understanding of the LangChain library
I’ve translated this understanding into another, similar project for Magic the Gathering underway