Azure Cosmos DB No SQL
This notebook shows you how to leverage this integrated vector database to store documents in collections, create indicies and perform vector search queries using approximate nearest neighbor algorithms such as COS (cosine distance), L2 (Euclidean distance), and IP (inner product) to locate documents close to the query vectors.
Azure Cosmos DB is the database that powers OpenAI's ChatGPT service. It offers single-digit millisecond response times, automatic and instant scalability, along with guaranteed speed at any scale.
Azure Cosmos DB for NoSQL now offers vector indexing and search in preview. This feature is designed to handle high-dimensional vectors, enabling efficient and accurate vector search at any scale. You can now store vectors directly in the documents alongside your data. This means that each document in your database can contain not only traditional schema-free data, but also high-dimensional vectors as other properties of the documents. This colocation of data and vectors allows for efficient indexing and searching, as the vectors are stored in the same logical unit as the data they represent. This simplifies data management, AI application architectures, and the efficiency of vector-based operations.
Please refer here for more details:
Sign Up for lifetime free access to get started today.
%pip install --upgrade --quiet azure-cosmos langchain-openai langchain-community
Note: you may need to restart the kernel to use updated packages.
OPENAI_API_KEY = ""
OPENAI_API_TYPE = "azure"
OPENAI_API_VERSION = "2024-07-01-preview"
OPENAI_API_BASE = ""
OPENAI_EMBEDDINGS_MODEL_NAME = "text-embedding-3-small"
OPENAI_EMBEDDINGS_MODEL_DEPLOYMENT = "text-embedding-3-small"
Insert Data
from langchain_community.document_loaders import PyPDFLoader
# Load the PDF
loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)
print(docs[0])
page_content='GPT-4 Technical Report
OpenAI∗
Abstract
We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance
on various professional and academic benchmarks, including passing a simulated
bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-
based model pre-trained to predict the next token in a document. The post-training
alignment process results in improved performance on measures of factuality and
adherence to desired behavior. A core component of this project was developing
infrastructure and optimization methods that behave predictably across a wide
range of scales. This allowed us to accurately predict some aspects of GPT-4’s
performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1 Introduction' metadata={'source': 'https://arxiv.org/pdf/2303.08774.pdf', 'page': 0}
Creating AzureCosmosDB NoSQL Vector Search
indexing_policy = {
"indexingMode": "consistent",
"includedPaths": [{"path": "/*"}],
"excludedPaths": [{"path": '/"_etag"/?'}],
"vectorIndexes": [{"path": "/embedding", "type": "diskANN"}],
"fullTextIndexes": [{"path": "/text"}],
}
vector_embedding_policy = {
"vectorEmbeddings": [
{
"path": "/embedding",
"dataType": "float32",
"distanceFunction": "cosine",
"dimensions": 1536,
}
]
}
full_text_policy = {
"defaultLanguage": "en-US",
"fullTextPaths": [{"path": "/text", "language": "en-US"}],
}
from azure.cosmos import CosmosClient, PartitionKey
from langchain_community.vectorstores.azure_cosmos_db_no_sql import (
AzureCosmosDBNoSqlVectorSearch,
)
from langchain_openai import OpenAIEmbeddings
from pydantic import SecretStr
HOST = "AZURE_COSMOS_DB_ENDPOINT"
KEY = "AZURE_COSMOS_DB_KEY"
cosmos_client = CosmosClient(HOST, KEY)
database_name = "langchain_python_db_notebook"
container_name = "langchain_python_container"
partition_key = PartitionKey(path="/id")
cosmos_container_properties = {"partition_key": partition_key}
openai_embeddings = OpenAIEmbeddings(
deployment="smart-agent-embedding-ada",
model="text-embedding-ada-002",
chunk_size=1,
openai_api_key=SecretStr("OPENAI_API_KEY"),
)
# insert the documents in AzureCosmosDBNoSql with their embedding
vector_search = AzureCosmosDBNoSqlVectorSearch.from_documents(
documents=docs,
embedding=openai_embeddings,
cosmos_client=cosmos_client,
database_name=database_name,
container_name=container_name,
vector_embedding_policy=vector_embedding_policy,
full_text_policy=full_text_policy,
indexing_policy=indexing_policy,
cosmos_container_properties={"partition_key": partition_key},
cosmos_database_properties={},
vector_search_fields={"text_field": "text", "embedding_field": "embedding"},
full_text_search_enabled=True,
)
Vector Search
# Perform a similarity search between the embedding of the query and the embeddings of the documents
import json
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print(results[0].page_content)
performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1 Introduction
This technical report presents GPT-4, a large multimodal model capable of processing image and
text inputs and producing text outputs. Such models are an important area of study as they have the
potential to be used in a wide range of applications, such as dialogue systems, text summarization,
and machine translation. As such, they have been the subject of substantial interest and progress in
recent years [1–34].
One of the main goals of developing such models is to improve their ability to understand and generate
natural language text, particularly in more complex and nuanced scenarios. To test its capabilities
in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In
these evaluations it performs quite well and often outscores the vast majority of human test takers.
Vector Search with Score
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search_with_score(
query=query,
k=5,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"5a9a248f-6885-4e07-8321-e416ecd01556"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 1: 0.642735520879037
Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"64788ac7-2665-4987-994f-0086d701c909"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\nmodel-specific tuning. To address this, we developed infrastructure and optimization methods that\nhave very predictable behavior across multiple scales. These improvements allowed us to reliably\npredict some aspects of the performance of GPT-4 from smaller models trained using 1,000×–\n10,000×less compute.\n3.1 Loss Prediction\nThe final loss of properly-trained large language models is thought to be well approximated by power\nlaws in the amount of compute used to train the model [41, 42, 2, 14, 15].\nTo verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our","type":"Document"}
Score 2: 0.6270494557311032
Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"9846e748-87fc-4f17-a8cb-9c11699d6158"},"page_content":"GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nperformance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction","type":"Document"}
Score 3: 0.6231760505455314
Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":28,"id":"ec2d87d9-6cf4-4f42-8759-1f55c31ecd2b"},"page_content":"overall GPT-4 training budget. When mixing in data from these math benchmarks, a portion of the\ntraining data was held back, so each individual training example may or may not have been seen by\nGPT-4 during training.\nWe conducted contamination checking to verify the test set for GSM-8K is not included in the training\nset (see Appendix D). We recommend interpreting the performance results reported for GPT-4\nGSM-8K in Table 2 as something in-between true few-shot transfer and full benchmark-specific\ntuning.\nF Multilingual MMLU\nWe translated all questions and answers from MMLU [ 49] using Azure Translate. We used an\nexternal model to perform the translation, instead of relying on GPT-4 itself, in case the model had\nunrepresentative performance for its own translations. We selected a range of languages that cover\ndifferent geographic regions and scripts, we show an example question taken from the astronomy","type":"Document"}
Score 4: 0.5950017702893886
Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":2,"id":"ea46ce2e-4c73-4a3f-8098-8ec6bb664f12"},"page_content":"Observed\nPrediction\ngpt-4\n100p 10n 1µ 100µ 0.01 1\nCompute1.02.03.04.05.06.0Bits per wordOpenAI codebase next word predictionFigure 1. Performance of GPT-4 and smaller models. The metric is final loss on a dataset derived\nfrom our internal codebase. This is a convenient, large dataset of code tokens which is not contained in\nthe training set. We chose to look at loss because it tends to be less noisy than other measures across\ndifferent amounts of training compute. A power law fit to the smaller models (excluding GPT-4) is\nshown as the dotted line; this fit accurately predicts GPT-4’s final loss. The x-axis is training compute\nnormalized so that GPT-4 is 1.\nObserved\nPrediction\ngpt-4\n1µ 10µ 100µ 0.001 0.01 0.1 1\nCompute012345– Mean Log Pass RateCapability prediction on 23 coding problems\nFigure 2. Performance of GPT-4 and smaller models. The metric is mean log pass rate on a subset of\nthe HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted","type":"Document"}
Score 5: 0.586029243650397
Vector Search with filtering
from langchain_community.vectorstores.azure_cosmos_db_no_sql import (
Condition,
CosmosDBQueryType,
PreFilter,
)
query = "What were the compute requirements for training GPT 4"
pre_filter = PreFilter(
conditions=[
Condition(property="metadata.page", operator="$eq", value=0),
]
)
results = vector_search.similarity_search_with_score(
query=query,
k=5,
pre_filter=pre_filter,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"4b0034fa-0d0e-46b3-9385-0582511eb28f"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 1: 0.642735520879037
Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"87ec78d5-26e9-4eae-afd3-07935f706230"},"page_content":"GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nperformance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction","type":"Document"}
Score 2: 0.6231760505455314
Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"83fa9b0b-d25d-4b7f-bf79-0d39bf2a1033"},"page_content":"these evaluations it performs quite well and often outscores the vast majority of human test takers.\nFor example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers.\nThis contrasts with GPT-3.5, which scores in the bottom 10%.\nOn a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models\nand most state-of-the-art systems (which often have benchmark-specific training or hand-engineering).\nOn the MMLU benchmark [ 35,36], an English-language suite of multiple-choice questions covering\n57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but\nalso demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4\nsurpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these\nmodel capability results, as well as model safety improvements and results, in more detail in later\nsections.","type":"Document"}
Score 3: 0.5690217547521872
Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"430bc9df-336b-4002-b67c-df76381131ad"},"page_content":"model capability results, as well as model safety improvements and results, in more detail in later\nsections.\nThis report also discusses a key challenge of the project, developing deep learning infrastructure and\noptimization methods that behave predictably across a wide range of scales. This allowed us to make\npredictions about the expected performance of GPT-4 (based on small runs trained in similar ways)\nthat were tested against the final run to increase confidence in our training.\nDespite its capabilities, GPT-4 has similar limitations to earlier GPT models [ 1,37,38]: it is not fully\nreliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn\n∗Please cite this work as “OpenAI (2023)\". Full authorship contribution statements appear at the end of the\ndocument. Correspondence regarding this technical report can be sent to gpt4-report@openai.comarXiv:2303.08774v6 [cs.CL] 4 Mar 2024","type":"Document"}
Score 4: 0.5253629308670477
Full Text Search
query = "What were the compute requirements for training GPT 4"
pre_filter = PreFilter(
conditions=[
Condition(
property="text",
operator="$full_text_contains_any",
value="What were the compute requirements for training GPT 4",
),
],
)
results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.FULL_TEXT_SEARCH,
pre_filter=pre_filter,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print("\n")
Result 1: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"87ec78d5-26e9-4eae-afd3-07935f706230"},"page_content":"GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nperformance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction","type":"Document"}
Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"4b0034fa-0d0e-46b3-9385-0582511eb28f"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"83fa9b0b-d25d-4b7f-bf79-0d39bf2a1033"},"page_content":"these evaluations it performs quite well and often outscores the vast majority of human test takers.\nFor example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers.\nThis contrasts with GPT-3.5, which scores in the bottom 10%.\nOn a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models\nand most state-of-the-art systems (which often have benchmark-specific training or hand-engineering).\nOn the MMLU benchmark [ 35,36], an English-language suite of multiple-choice questions covering\n57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but\nalso demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4\nsurpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these\nmodel capability results, as well as model safety improvements and results, in more detail in later\nsections.","type":"Document"}
Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"430bc9df-336b-4002-b67c-df76381131ad"},"page_content":"model capability results, as well as model safety improvements and results, in more detail in later\nsections.\nThis report also discusses a key challenge of the project, developing deep learning infrastructure and\noptimization methods that behave predictably across a wide range of scales. This allowed us to make\npredictions about the expected performance of GPT-4 (based on small runs trained in similar ways)\nthat were tested against the final run to increase confidence in our training.\nDespite its capabilities, GPT-4 has similar limitations to earlier GPT models [ 1,37,38]: it is not fully\nreliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn\n∗Please cite this work as “OpenAI (2023)\". Full authorship contribution statements appear at the end of the\ndocument. Correspondence regarding this technical report can be sent to gpt4-report@openai.comarXiv:2303.08774v6 [cs.CL] 4 Mar 2024","type":"Document"}
Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"fa3aa45c-509b-4577-b86f-fff102df69ec"},"page_content":"from experience. Care should be taken when using the outputs of GPT-4, particularly in contexts\nwhere reliability is important.\nGPT-4’s capabilities and limitations create significant and novel safety challenges, and we believe\ncareful study of these challenges is an important area of research given the potential societal impact.\nThis report includes an extensive system card (after the Appendix) describing some of the risks we\nforesee around bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more.\nIt also describes interventions we made to mitigate potential harms from the deployment of GPT-4,\nincluding adversarial testing with domain experts, and a model-assisted safety pipeline.\n2 Scope and Limitations of this Technical Report\nThis report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a\nTransformer-style model [ 39] pre-trained to predict the next token in a document, using both publicly","type":"Document"}
Full Text Search BM 25 Ranking
query = "What were the compute requirements for training GPT 4"
full_text_rank_filter = [
{
"search_field": "text",
"search_text": "What were the compute requirements for training GPT 4",
}
]
results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.FULL_TEXT_RANK,
full_text_rank_filter=full_text_rank_filter,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print("\n")
Result 1: {"id":null,"metadata":{"id":"f81d994b-bd4e-4471-905f-841ac529584d"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}
Result 2: {"id":null,"metadata":{"id":"b8117761-b5ec-473d-a818-dd5f7dda75ac"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\nmodel-specific tuning. To address this, we developed infrastructure and optimization methods that\nhave very predictable behavior across multiple scales. These improvements allowed us to reliably\npredict some aspects of the performance of GPT-4 from smaller models trained using 1,000×–\n10,000×less compute.\n3.1 Loss Prediction\nThe final loss of properly-trained large language models is thought to be well approximated by power\nlaws in the amount of compute used to train the model [41, 42, 2, 14, 15].\nTo verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our","type":"Document"}
Result 3: {"id":null,"metadata":{"id":"96549f60-a72e-42fe-9dec-772cfe0ddd32"},"page_content":"Observed\nPrediction\ngpt-4\n100p 10n 1µ 100µ 0.01 1\nCompute1.02.03.04.05.06.0Bits per wordOpenAI codebase next word predictionFigure 1. Performance of GPT-4 and smaller models. The metric is final loss on a dataset derived\nfrom our internal codebase. This is a convenient, large dataset of code tokens which is not contained in\nthe training set. We chose to look at loss because it tends to be less noisy than other measures across\ndifferent amounts of training compute. A power law fit to the smaller models (excluding GPT-4) is\nshown as the dotted line; this fit accurately predicts GPT-4’s final loss. The x-axis is training compute\nnormalized so that GPT-4 is 1.\nObserved\nPrediction\ngpt-4\n1µ 10µ 100µ 0.001 0.01 0.1 1\nCompute012345– Mean Log Pass RateCapability prediction on 23 coding problems\nFigure 2. Performance of GPT-4 and smaller models. The metric is mean log pass rate on a subset of\nthe HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted","type":"Document"}
Result 4: {"id":null,"metadata":{"id":"98f12229-18ee-4407-96f8-b64e3c99aac5"},"page_content":"which measures the ability to synthesize Python functions of varying complexity. We successfully\npredicted the pass rate on a subset of the HumanEval dataset by extrapolating from models trained\nwith at most 1,000×less compute (Figure 2).\nFor an individual problem in HumanEval, performance may occasionally worsen with scale. Despite\nthese challenges, we find an approximate power law relationship −EP[log(pass _rate(C))] = α∗C−k\n2In addition to the accompanying system card, OpenAI will soon publish additional thoughts on the social\nand economic implications of AI systems, including the need for effective regulation.\n2","type":"Document"}
Result 5: {"id":null,"metadata":{"id":"8c5a8136-9944-44f7-90d6-575b922ff5cf"},"page_content":"Unsupervised Multitask Learners,” 2019.\n[23]G. C. Bowker and S. L. Star, Sorting Things Out . MIT Press, Aug. 2000.\n[24]L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P.-S. Huang, J. Mellor, A. Glaese, M. Cheng,\nB. Balle, A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, W. Hawkins, T. Stepleton, A. Birhane,\nL. A. Hendricks, L. Rimell, W. Isaac, J. Haas, S. Legassick, G. Irving, and I. Gabriel, “Taxonomy\nof Risks posed by Language Models,” in 2022 ACM Conference on Fairness, Accountability,\nand Transparency , FAccT ’22, (New York, NY, USA), pp. 214–229, Association for Computing\nMachinery, June 2022.\n72","type":"Document"}
Hybrid Search
query = "What were the compute requirements for training GPT 4"
full_text_rank_filter = [
{
"search_field": "text",
"search_text": "What were the compute requirements for training GPT 4",
}
]
results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.HYBRID,
full_text_rank_filter=full_text_rank_filter,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1: {"id":null,"metadata":{"id":"3ae8615a-5fd3-4543-8d94-b01687904e02"},"page_content":"Figure 11: Results on IF evaluations across GPT3.5, GPT3.5-Turbo, GPT-4-launch\n98","type":"Document"}
Score 1: 0.5545045822126439
Result 2: {"id":null,"metadata":{"id":"f81d994b-bd4e-4471-905f-841ac529584d"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}
Score 2: 0.5529193759066282
Result 3: {"id":null,"metadata":{"id":"b8117761-b5ec-473d-a818-dd5f7dda75ac"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\nmodel-specific tuning. To address this, we developed infrastructure and optimization methods that\nhave very predictable behavior across multiple scales. These improvements allowed us to reliably\npredict some aspects of the performance of GPT-4 from smaller models trained using 1,000×–\n10,000×less compute.\n3.1 Loss Prediction\nThe final loss of properly-trained large language models is thought to be well approximated by power\nlaws in the amount of compute used to train the model [41, 42, 2, 14, 15].\nTo verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our","type":"Document"}
Score 3: 0.6270494557311032
Result 4: {"id":null,"metadata":{"id":"4b0034fa-0d0e-46b3-9385-0582511eb28f"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 4: 0.642735520879037
Result 5: {"id":null,"metadata":{"id":"89751c93-55eb-4497-ac51-64b07368fab9"},"page_content":"Preliminary results on a narrow set of academic vision benchmarks can be found in the GPT-4 blog\npost [ 65]. We plan to release more information about GPT-4’s visual capabilities in follow-up work.\n8","type":"Document"}
Score 5: 0.5310937141136488
Hybrid Search with filtering
query = "What were the compute requirements for training GPT 4"
pre_filter = PreFilter(
conditions=[
Condition(
property="text",
operator="$full_text_contains_any",
value="What were the compute requirements for training GPT 4",
),
Condition(property="metadata.page", operator="$eq", value=0),
],
logical_operator="$and",
)
full_text_rank_filter = [
{
"search_field": "text",
"search_text": "What were the compute requirements for training GPT 4",
}
]
results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.HYBRID,
full_text_rank_filter=full_text_rank_filter,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1: {"id":null,"metadata":{"id":"3ae8615a-5fd3-4543-8d94-b01687904e02"},"page_content":"Figure 11: Results on IF evaluations across GPT3.5, GPT3.5-Turbo, GPT-4-launch\n98","type":"Document"}
Score 1: 0.5545045822126439
Result 2: {"id":null,"metadata":{"id":"f81d994b-bd4e-4471-905f-841ac529584d"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}
Score 2: 0.5529193759066282
Result 3: {"id":null,"metadata":{"id":"b8117761-b5ec-473d-a818-dd5f7dda75ac"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\nmodel-specific tuning. To address this, we developed infrastructure and optimization methods that\nhave very predictable behavior across multiple scales. These improvements allowed us to reliably\npredict some aspects of the performance of GPT-4 from smaller models trained using 1,000×–\n10,000×less compute.\n3.1 Loss Prediction\nThe final loss of properly-trained large language models is thought to be well approximated by power\nlaws in the amount of compute used to train the model [41, 42, 2, 14, 15].\nTo verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our","type":"Document"}
Score 3: 0.6270494557311032
Result 4: {"id":null,"metadata":{"id":"4b0034fa-0d0e-46b3-9385-0582511eb28f"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 4: 0.642735520879037
Result 5: {"id":null,"metadata":{"id":"89751c93-55eb-4497-ac51-64b07368fab9"},"page_content":"Preliminary results on a narrow set of academic vision benchmarks can be found in the GPT-4 blog\npost [ 65]. We plan to release more information about GPT-4’s visual capabilities in follow-up work.\n8","type":"Document"}
Score 5: 0.5310937141136488
Related
- Vector store conceptual guide
- Vector store how-to guides