`starpilot`

A tool to suggest GitHub repos from your users stars

Introduction

Goals for the talk

Share an AI project I’ve been working on
Share some of the design decisions I made to build it
Share some of the experience I had building it
Share some of the code I wrote to build it

Level Set

🙌 🙌🏻 🙌🏼 🙌🏽 🙌🏾 🙌🏿 🙌🏿

Retrieval Augmented Generation
Vector Embedding
Vector Store
Vector Similarity

Level Set

Retrieval Augmented Generation
- Using a language model supported by a vector store
Vector Embedding
- Converting text into a list of numbers to represent semantic meaning
Vector Store
- Database optimized for storing embedding vectors
Vector Similarity
- Quantifying the similarity between embedding vectors

Level Set

🙌 🙌🏻 🙌🏼 🙌🏽 🙌🏾 🙌🏿 🙌🏿

Who is comfortable reading python?

The Project

Goals for an AI project

As a Data Scientist
I want to build something novel with AI
So that I can learn and experiment with the technology
And maybe make something useful for other people

The Project is `starpilot`

A tool to suggest GitHub repos from your users stars based on semantic similarity

Demo 1

Why is ‘semantic similarity’ important?

Keyword search tests “how close a string is to another string”
- Exact Match: cool cats != chill felines
- Partial Match: cool cat partially fits cooler cats
Semantic search tests “how similar in meaning is the phrase to another phrase”
- cool cats is equivalent in meaning to chill felines, also groovy kittys, suave main coon and sphinx with style

Why is this data suitable for ‘semantic’ but not ‘keyword’ search?

GitHub stars are public and common across languages/stacks/specialisms
GitHub repos are text rich data
GitHub repos are a ‘weak standard’
- Topics are free text
- Descriptions are free text

An Intuitive Example

Polars vs Pandas

data-analysis
flexible
alignment
python
data-science

dataframe-library
dataframe
dataframes
arrow
python
out-of-core

Polars vs Pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Polars vs Pandas

Embedding Example

“polars module”
“pandas package”
“panda breeding”
“polar environment”

Embedding Example

flowchart LR

subgraph strings
pm("polars module")
pp("pandas package")
pb("panda breeding")
pe("polar environment")
end

subgraph embedder
embed([function])

pm --> embed
pp --> embed
pb --> embed
pe --> embed
end

subgraph vector
pmv("0.24, 0.12")
ppv("0.91, 0.83")
pbv("0.83, 0.12")
pev("0.91, 0.24")

embed --> pmv
embed --> ppv
embed --> pbv
embed --> pev
end

Embedding Example

First Experiment

Solution Requirements

Solution Architecture

Similarity Search

flowchart LR

subgraph 0 Resources
    GH([GitHub Repo])
    DB[(Chroma)]
    LLM([LLM])
end

subgraph 1 ETL
    raw(JSON)
    docs_in(Docs)
    vec(Embedding)

    GH--->raw
    raw-->docs_in
    docs_in-->LLM
    LLM-->vec
    vec-->DB
end

subgraph 2 Execution
    cli([cli])
    q(query)
    LLM([LLM])
    q_vec(Query Vector)
    docs_out(Document)
    a(suggestions)

    cli-->q
    q -->LLM
    LLM -->q_vec
    q_vec -->DB
    DB-->docs_out
    docs_out-->a
end

Outcome

IT’S ALIVE
- Data gets read
- Read data gets embedded
- Embedded data gets stored
- Embedded data gets queried

Outcome

IT STINKS
- Reading data is punishingly long
  - 200 stars in 30 minutes
- Querying data is basic
  - Semantic search works, but is broad
  - ‘DataFrames for Python’ returns Pandas, and Tidyverse in R, and DataTables in JavaScript

Refined Version

flowchart LR

LLM([LLM])
GH([GitHub Repo])
raw(raw_GH/*.json)
prepped(enhanced_GH/*.json)
style prepped fill:red
docs_in(Documents)
vec_doc(Vector Document)
DB[(Chroma)]
self_query(Self Query)
vec_q(Vector Embedding)
db_filter(Filter Value)
cli([cli])
q(query)
docs_out(Document)
a(suggestions)

subgraph 0 Resources
    GH
    DB
    LLM
end

subgraph 1 ETL
    GH--->raw
    raw-->prepped
    prepped-->docs_in
    docs_in-->LLM
    LLM-->vec_doc
    vec_doc-->DB
end

subgraph 2 Execution
    LLM-->self_query
    self_query-->db_filter
    style self_query fill:red
    self_query-->vec_q
    db_filter-->DB
    style db_filter fill:red
    vec_q --> DB
    cli-->q
    q -->LLM
    DB-->docs_out
    docs_out-->a
end

What this gets us

Faster Data reads from GitHub
- 20x
Enhanced data preprocessing
- Tags, Stars, Primary Language
Selective data load
- ‘Popular’ repos
Self Querying
- Language specific results

Example Data

Example JSON in

{
    "name": "langchain",
    "nameWithOwner": "langchain-ai/langchain",
    "url": "https://github.com/langchain-ai/langchain",
    "homepageUrl": "https://python.langchain.com",
    "description": "\ud83e\udd9c\ud83d\udd17 Build context-aware reasoning applications",
    "stargazerCount": 77908,
    "primaryLanguage": "Python",
    "languages": [
        "Python",
        "Makefile",
        "HTML",
        "Dockerfile",
        "TeX",
        "JavaScript",
        "Shell",
        "XSLT",
        "Jupyter Notebook",
        "MDX"
    ],
    "owner": "langchain-ai",
    "content": "langchain \ud83e\udd9c\ud83d\udd17 Build context-aware reasoning applications Python"
}

The `prepare_documents` function

`prepare_documents`

The prepare_documents function is responsible for reading in the json files and creating Document objects

def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    """
    Prepare the documents for ingestion into the vectorstore
    """
    ...
    return documents

`prepare_documents`

Get the file paths for each json file for each repo

def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    """
    Prepare the documents for ingestion into the vectorstore
    """

    file_paths = []
    for file in os.listdir(repo_contents_dir):
        file_paths.append(os.path.join(repo_contents_dir, file))

    ...

    return documents

`prepare_documents`

Load the Document objects from the json files

def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    ...
    documents = []
    for file_path in track(file_paths, description="Loading documents..."):
        logger.debug("Loading document", file=file_path)
        loader = JSONLoader(
            file_path,
            jq_schema=".",
            content_key="content",
            metadata_func=_metadata_func,
            text_content=False,
        )
        if (loaded_document := loader.load())[0].page_content != "":
            documents.extend(loaded_document)
    ...
    return documents

`prepare_documents`

The metadata_func function is used to extract metadata from the json file

def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    ...
    documents = []
    for file_path in track(file_paths, description="Loading documents..."):
        logger.debug("Loading document", file=file_path)
        loader = JSONLoader(
            file_path,
            jq_schema=".",
            content_key="content",
            metadata_func=_metadata_func,
            text_content=False,
        )
        if (loaded_document := loader.load())[0].page_content != "":
            documents.extend(loaded_document)
    ...
    return documents

`prepare_documents`

The _metadata_func function is used to extract metadata from the json file

def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:

    def _metadata_func(record: dict, metadata: dict) -> dict:
        metadata["url"] = record.get("url")
        metadata["name"] = record.get("name")
        metadata["stargazerCount"] = record["stargazerCount"]
        if (primary_language := record.get("primaryLanguage")) is not None:
            metadata["primaryLanguage"] = primary_language
        if (description := record.get("description")) is not None:
            metadata["description"] = description
        if (topics := record.get("topics")) is not None:
            metadata["topics"] = " ".join(topics)
        if (languages := record.get("languages")) is not None:
            metadata["languages"] = " ".join(languages)

        return metadata

    ...

    return documents

Using the `prepare_documents` function

prepare_documents is used to load the Document objects into `Chroma

from langchain_community.vectorstores import Chroma
from langchain_openai.embeddings import OpenAIEmbeddings

Chroma.from_documents(
    documents=utils.prepare_documents(),
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    persist_directory="./vectorstore-chroma",
)

Using the `prepare_documents` function

Chroma.from_documents also needs an embedding object and a persist_directory

from langchain_community.vectorstores import Chroma
from langchain_openai.embeddings import OpenAIEmbeddings

Chroma.from_documents(
    documents=utils.prepare_documents(),
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    persist_directory="./vectorstore-chroma",
)

Accessing the Vector Store

Semantic similarity search

Semantic search accepts a query string and returns the most relevant Document objects

def shoot(
    vectorstore_path: str,
    k: int,
    method: SearchMethods = SearchMethods.similarity,
    query: str,
) -> List[Document]:
    """
    Create a retriever from a vectorstore and query it
    """

Semantic similarity search

Create a retriever object from the vector store and query it

def shoot(
        vectorstore_path: str,
    k: int,
    method: SearchMethods = SearchMethods.similarity,
    query: str,
) -> List[Document]:
    """
    Create a retriever from a vectorstore and query it
    """
    retriever = Chroma(
        persist_directory=vectorstore_path,
        embedding_function=OpenAIEmbeddings(
            model="text-embedding-3-large"
        ),
    ).as_retriever(
        search_type=method,
        search_kwargs={
            "k": k,
        },
    )
    return retriever.get_relevant_documents(query)

Self Querying

Self-querying is a way to pre-filter the results of a semantic search by metadata fields

def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results

Self Querying

Create a list of AttributeInfo objects that describe the metadata fields

def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results

Self Querying

Construct the query prompt using get query constructor prompt to return a BasePromptTemplate object

def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results

Self Querying

Use LangChain Expression Language in a chain to make a QueryConstructor object

def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results

Self Querying

Create a SelfQueryRetriever object from the QueryConstructor object and the Chroma object

def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results

Demo `astrolger` function

Evaluation

What I liked

Langchain is an “extensively” documented
Langchain company backed videos and tutorials
Langchain is a suitable orchestration framework
- It’s like scikit-learn for LLMs

What I didn’t like

Heavily OOP
Reading the sourcecode is still necessary
- kwargs are often silently ignored and undocumented
- ‘Wrappers’ and ‘Connectors’ don’t implement the full API
Moves fast and breaks things (@v0.1.*)
Observability is secondary

heavily OOP
Must read source code to understand exactly what is and isn’t supported for each class
- Entirely unclear when kwargs are actually needed, or when they are silently ignored
- ‘Wrappers’ and ‘Connectors’ don’t implement the full API
Moves fast and breaks things (@v0.1.*)
- Documentation is not always up to date and often conflicting
- Huge structural changes (@v0.2.*) split into langchain and langchain_community
Observability is secondary
- LangChain is a company after all
- They are investing a lot into the langchain ecosystem, and AI/LLM community as a whole
- They do however still need to make themselves financially sustainable, and they are attempting to do that through monetizing observability for production AI applications though LangSmith
- I’m mixed about that, in that I think it’s important to make sure they don’t end up as a vast voluntary project that leads to burnout and the framework failing, but at the same time they are taking a very product centric route out of that.
- I feel that as a result developer logging and observability suffers as a side effect.

Conclusion

Outcome for developers

starpilotis available now:DaveParr/starpilot
- Star it, fork it, PR it!
Langchain is also available now: langchain/langchain
This talk is also also available now: DaveParr/starpilot-presentation

Outcome for me

I have a working prototype
I have a better understanding of the LangChain library
I’ve translated this understanding into another, similar project for Magic the Gathering underway

starpilot

Introduction

Goals for the talk

Level Set

Level Set

Level Set

The Project

Goals for an AI project

The Project is starpilot

Demo 1

Why is ‘semantic similarity’ important?

Why is this data suitable for ‘semantic’ but not ‘keyword’ search?

An Intuitive Example

Polars vs Pandas

Polars vs Pandas

Polars vs Pandas

Embedding Example

Embedding Example

Embedding Example

Embedding Example

First Experiment

Solution Requirements

Solution Architecture

Similarity Search

Outcome

Outcome

Refinement

Refined Version

What this gets us

Example Data

Example JSON in

The prepare_documents function

prepare_documents

prepare_documents

prepare_documents

prepare_documents

prepare_documents

Using the prepare_documents function

Using the prepare_documents function

Accessing the Vector Store

Semantic similarity search

Semantic similarity search

Self Querying

Self Querying

Self Querying

Self Querying

Self Querying

Self Querying

Demo astrolger function

Evaluation

What I liked

What I didn’t like

Conclusion

Outcome for developers

Outcome for me

Thanks

`starpilot`

The Project is `starpilot`

The `prepare_documents` function

`prepare_documents`

`prepare_documents`

`prepare_documents`

`prepare_documents`

`prepare_documents`

Using the `prepare_documents` function

Using the `prepare_documents` function

Demo `astrolger` function