0

i am using the RecursiveCharacterTextSplitter from Langchain to split python files. in doing so i lose the information which chunk belongs to which file. How can I keep track and assign the individual chunks to a file name afterwards?

def index_repo(repo_url):

    os.environ['OPENAI_API_KEY'] = ""

    contents = []
    fileextensions = [
        ".py", ]


    print('cloning repo')
    repo_dir = get_repo(repo_url)

    print(repo_dir)

    for dirpath, dirnames, filenames in os.walk(repo_dir):
        for file in filenames:
            if file.endswith(tuple(fileextensions)):
                try:
                    with open(os.path.join(dirpath, file), "r", encoding="utf-8") as f:
                        contents.append(f.read())

                except Exception as e:
                    pass


    # chunk the files
    text_splitter =  RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, chunk_size=5000, chunk_overlap=0)
    texts = text_splitter.create_documents(contents)

    return texts
alpa
  • 35
  • 3

1 Answers1

0
create_documents(texts: List[str], metadatas: Optional[List[dict]] = None) → List[Document]

add the file infomation in metadatas and pass them to create_documents.

Xiaomin Wu
  • 400
  • 1
  • 5