how to assign code to a file after TextSplitter (langchain)?

Question

i am using the RecursiveCharacterTextSplitter from Langchain to split python files. in doing so i lose the information which chunk belongs to which file. How can I keep track and assign the individual chunks to a file name afterwards?

def index_repo(repo_url):

    os.environ['OPENAI_API_KEY'] = ""

    contents = []
    fileextensions = [
        ".py", ]


    print('cloning repo')
    repo_dir = get_repo(repo_url)

    print(repo_dir)

    for dirpath, dirnames, filenames in os.walk(repo_dir):
        for file in filenames:
            if file.endswith(tuple(fileextensions)):
                try:
                    with open(os.path.join(dirpath, file), "r", encoding="utf-8") as f:
                        contents.append(f.read())

                except Exception as e:
                    pass


    # chunk the files
    text_splitter =  RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, chunk_size=5000, chunk_overlap=0)
    texts = text_splitter.create_documents(contents)

    return texts

score 0 · Answer 1 · answered Aug 31 '23 at 01:49

0

create_documents(texts: List[str], metadatas: Optional[List[dict]] = None) → List[Document]

add the file infomation in metadatas and pass them to create_documents.

answered Aug 31 '23 at 01:49

Xiaomin Wu

400
1
5

how to assign code to a file after TextSplitter (langchain)?

1 Answers1