how to convert langchain documents back to strings?

Question

i have built a splitter function with langchain library that splits a series of python files. At another point in the code I need to convert these documents back into python code. Only I do not know how to do this

def index_repo(repo_url):

    os.environ['OPENAI_API_KEY'] = ""

    contents = []
    fileextensions = [
        ".py", ]


    print('cloning repo')
    repo_dir = get_repo(repo_url)

    file_names = []

    for dirpath, dirnames, filenames in os.walk(repo_dir):
        for file in filenames:
            if file.endswith(tuple(fileextensions)):
                file_names.append(os.path.join(dirpath, file))
                try:
                    with open(os.path.join(dirpath, file), "r", encoding="utf-8") as f:
                        contents.append(f.read())

                except Exception as e:
                    pass


    # chunk the files
    text_splitter =  RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, chunk_size=5000, chunk_overlap=0)
    texts = text_splitter.create_documents(contents)

    return texts, file_names

score 0 · Answer 1 · answered Sep 03 '23 at 05:51

Try replacing this:

    texts = text_splitter.create_documents(contents)

With this:

    texts = text_splitter.split_text(contents)

The code you provided, with the create_documents method, creates a Document object (which is a list object in which each item is a dictionary containing two keys: page_content: string and metadata: dictionary). Using the split_text method will put each chunk from the RecursiveCharacterTextSplitter as an item in your texts list.

Hope this helps!

how to convert langchain documents back to strings?

1 Answers1