starpilot

A tool to suggest GitHub repos from your users stars

Introduction

Goals for the talk

  • Share an AI project I’ve been working on
  • Share some of the design decisions I made to build it
  • Share some of the experience I had building it
  • Share some of the code I wrote to build it

Level Set

🙌 🙌🏻 🙌🏼 🙌🏽 🙌🏾 🙌🏿 🙌🏿

  • Retrieval Augmented Generation
  • Vector Embedding
  • Vector Store
  • Vector Similarity

Level Set

  • Retrieval Augmented Generation
    • Using a language model supported by a vector store
  • Vector Embedding
    • Converting text into a list of numbers to represent semantic meaning
  • Vector Store
    • Database optimized for storing embedding vectors
  • Vector Similarity
    • Quantifying the similarity between embedding vectors

Level Set

🙌 🙌🏻 🙌🏼 🙌🏽 🙌🏾 🙌🏿 🙌🏿

Who is comfortable reading python?

The Project

Goals for an AI project

  • As a Data Scientist
  • I want to build something novel with AI
  • So that I can learn and experiment with the technology
  • And maybe make something useful for other people

The Project is starpilot

A tool to suggest GitHub repos from your users stars based on semantic similarity

Demo 1

Why is ‘semantic similarity’ important?

  • Keyword search tests “how close a string is to another string”
    • Exact Match: cool cats != chill felines
    • Partial Match: cool cat partially fits cooler cats
  • Semantic search tests “how similar in meaning is the phrase to another phrase”
    • cool cats is equivalent in meaning to chill felines, also groovy kittys, suave main coon and sphinx with style

An Intuitive Example

Polars vs Pandas

  • data-analysis
  • flexible
  • alignment
  • python
  • data-science
  • dataframe-library
  • dataframe
  • dataframes
  • arrow
  • python
  • out-of-core

Polars vs Pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Polars vs Pandas

Embedding Example

  • “polars module”
  • “pandas package”
  • “panda breeding”
  • “polar environment”

Embedding Example

flowchart LR

subgraph strings
pm("polars module")
pp("pandas package")
pb("panda breeding")
pe("polar environment")
end

subgraph embedder
embed([function])

pm --> embed
pp --> embed
pb --> embed
pe --> embed
end

subgraph vector
pmv("0.24, 0.12")
ppv("0.91, 0.83")
pbv("0.83, 0.12")
pev("0.91, 0.24")

embed --> pmv
embed --> ppv
embed --> pbv
embed --> pev
end

Embedding Example

Embedding Example

First Experiment

Solution Requirements

Solution Architecture

Outcome

  • IT’S ALIVE
    • Data gets read
    • Read data gets embedded
    • Embedded data gets stored
    • Embedded data gets queried

Outcome

  • IT STINKS
    • Reading data is punishingly long
      • 200 stars in 30 minutes
    • Querying data is basic
      • Semantic search works, but is broad
      • ‘DataFrames for Python’ returns Pandas, and Tidyverse in R, and DataTables in JavaScript

Refinement

Refined Version

flowchart LR

LLM([LLM])
GH([GitHub Repo])
raw(raw_GH/*.json)
prepped(enhanced_GH/*.json)
style prepped fill:red
docs_in(Documents)
vec_doc(Vector Document)
DB[(Chroma)]
self_query(Self Query)
vec_q(Vector Embedding)
db_filter(Filter Value)
cli([cli])
q(query)
docs_out(Document)
a(suggestions)

subgraph 0 Resources
    GH
    DB
    LLM
end

subgraph 1 ETL
    GH--->raw
    raw-->prepped
    prepped-->docs_in
    docs_in-->LLM
    LLM-->vec_doc
    vec_doc-->DB
end

subgraph 2 Execution
    LLM-->self_query
    self_query-->db_filter
    style self_query fill:red
    self_query-->vec_q
    db_filter-->DB
    style db_filter fill:red
    vec_q --> DB
    cli-->q
    q -->LLM
    DB-->docs_out
    docs_out-->a
end

What this gets us

  • Faster Data reads from GitHub
    • 20x
  • Enhanced data preprocessing
    • Tags, Stars, Primary Language
  • Selective data load
    • ‘Popular’ repos
  • Self Querying
    • Language specific results

Example Data

Example JSON in

{
    "name": "langchain",
    "nameWithOwner": "langchain-ai/langchain",
    "url": "https://github.com/langchain-ai/langchain",
    "homepageUrl": "https://python.langchain.com",
    "description": "\ud83e\udd9c\ud83d\udd17 Build context-aware reasoning applications",
    "stargazerCount": 77908,
    "primaryLanguage": "Python",
    "languages": [
        "Python",
        "Makefile",
        "HTML",
        "Dockerfile",
        "TeX",
        "JavaScript",
        "Shell",
        "XSLT",
        "Jupyter Notebook",
        "MDX"
    ],
    "owner": "langchain-ai",
    "content": "langchain \ud83e\udd9c\ud83d\udd17 Build context-aware reasoning applications Python"
}

The prepare_documents function

prepare_documents

  • The prepare_documents function is responsible for reading in the json files and creating Document objects
def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    """
    Prepare the documents for ingestion into the vectorstore
    """
    ...
    return documents

prepare_documents

Get the file paths for each json file for each repo

def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    """
    Prepare the documents for ingestion into the vectorstore
    """

    file_paths = []
    for file in os.listdir(repo_contents_dir):
        file_paths.append(os.path.join(repo_contents_dir, file))

    ...

    return documents

prepare_documents

Load the Document objects from the json files

def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    ...
    documents = []
    for file_path in track(file_paths, description="Loading documents..."):
        logger.debug("Loading document", file=file_path)
        loader = JSONLoader(
            file_path,
            jq_schema=".",
            content_key="content",
            metadata_func=_metadata_func,
            text_content=False,
        )
        if (loaded_document := loader.load())[0].page_content != "":
            documents.extend(loaded_document)
    ...
    return documents

prepare_documents

The metadata_func function is used to extract metadata from the json file

def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    ...
    documents = []
    for file_path in track(file_paths, description="Loading documents..."):
        logger.debug("Loading document", file=file_path)
        loader = JSONLoader(
            file_path,
            jq_schema=".",
            content_key="content",
            metadata_func=_metadata_func,
            text_content=False,
        )
        if (loaded_document := loader.load())[0].page_content != "":
            documents.extend(loaded_document)
    ...
    return documents

prepare_documents

The _metadata_func function is used to extract metadata from the json file

def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:

    def _metadata_func(record: dict, metadata: dict) -> dict:
        metadata["url"] = record.get("url")
        metadata["name"] = record.get("name")
        metadata["stargazerCount"] = record["stargazerCount"]
        if (primary_language := record.get("primaryLanguage")) is not None:
            metadata["primaryLanguage"] = primary_language
        if (description := record.get("description")) is not None:
            metadata["description"] = description
        if (topics := record.get("topics")) is not None:
            metadata["topics"] = " ".join(topics)
        if (languages := record.get("languages")) is not None:
            metadata["languages"] = " ".join(languages)

        return metadata

    ...

    return documents

Using the prepare_documents function

prepare_documents is used to load the Document objects into `Chroma

from langchain_community.vectorstores import Chroma
from langchain_openai.embeddings import OpenAIEmbeddings

Chroma.from_documents(
    documents=utils.prepare_documents(),
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    persist_directory="./vectorstore-chroma",
)

Using the prepare_documents function

Chroma.from_documents also needs an embedding object and a persist_directory

from langchain_community.vectorstores import Chroma
from langchain_openai.embeddings import OpenAIEmbeddings

Chroma.from_documents(
    documents=utils.prepare_documents(),
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    persist_directory="./vectorstore-chroma",
)

Accessing the Vector Store

Semantic similarity search

Create a retriever object from the vector store and query it

def shoot(
        vectorstore_path: str,
    k: int,
    method: SearchMethods = SearchMethods.similarity,
    query: str,
) -> List[Document]:
    """
    Create a retriever from a vectorstore and query it
    """
    retriever = Chroma(
        persist_directory=vectorstore_path,
        embedding_function=OpenAIEmbeddings(
            model="text-embedding-3-large"
        ),
    ).as_retriever(
        search_type=method,
        search_kwargs={
            "k": k,
        },
    )
    return retriever.get_relevant_documents(query)

Self Querying

Self Querying

Self-querying is a way to pre-filter the results of a semantic search by metadata fields

def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results

Self Querying

Create a list of AttributeInfo objects that describe the metadata fields

def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results

Self Querying

Construct the query prompt using get query constructor prompt to return a BasePromptTemplate object

def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results

Self Querying

Use LangChain Expression Language in a chain to make a QueryConstructor object

def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results

Self Querying

Create a SelfQueryRetriever object from the QueryConstructor object and the Chroma object

def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results

Demo astrolger function

Evaluation

What I liked

  • Langchain is an “extensively” documented
  • Langchain company backed videos and tutorials
  • Langchain is a suitable orchestration framework
    • It’s like scikit-learn for LLMs

What I didn’t like

  • Heavily OOP
  • Reading the sourcecode is still necessary
    • kwargs are often silently ignored and undocumented
    • ‘Wrappers’ and ‘Connectors’ don’t implement the full API
  • Moves fast and breaks things (@v0.1.*)
  • Observability is secondary

Conclusion

Outcome for developers

  • starpilotis available now:DaveParr/starpilot
    • Star it, fork it, PR it!
  • Langchain is also available now: langchain/langchain
  • This talk is also also available now: DaveParr/starpilot-presentation

Outcome for me

  • I have a working prototype
  • I have a better understanding of the LangChain library
  • I’ve translated this understanding into another, similar project for Magic the Gathering underway

Thanks