2

My default assumption was that the chunk_size parameter would set a ceiling on the size of the chunks/splits that come out of the split_text method, but that's clearly not right:

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size = 6
chunk_overlap = 2

c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

text = 'abcdefghijklmnopqrstuvwxyz'

c_splitter.split_text(text)

prints: ['abcdefghijklmnopqrstuvwxyz'], i.e. one single chunk that is much larger than chunk_size=6.

So I understand that it didn't split the text into chunks because it never encountered the separator. But so then the question is what is the chunk_size even doing?

I checked the documentation page for langchain.text_splitter.CharacterTextSplitter here but did not see an answer to this question. And I asked the "mendable" chat-with-langchain-docs search functionality, but got the answer "The chunk_size parameter of the CharacterTextSplitter determines the maximum number of characters in each chunk of text."...which is not true, as the code sample above shows.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Max Power
  • 8,265
  • 13
  • 50
  • 91

2 Answers2

5

CharacterTextSplitter will only split on separator (which is '\n\n' by default). chunk_size is the maximum chunk size that will be split if splitting is possible. If a string starts with n characters, has a separator, and has m more characters before the next separator then the first chunk size will be n if chunk_size < n + m + len(separator).

Your example string has no matching separators so there's nothing to split on.

DMcC
  • 321
  • 2
  • 7
-4

The chunk_size parameter in the CharacterTextSplitter class determines the maximum number of characters in each chunk when splitting a text into smaller chunks. Here is an explanation of the chunk_size parameter based on the provided search results:

The default value for chunk_size is 1000 tokens. The chunk_size parameter can be set when creating an instance of the CharacterTextSplitter class. When splitting a text into chunks, the chunk_size parameter controls the maximum number of characters in each chunk. The chunk_size parameter is used to split a text into smaller chunks [2]. The chunk_size parameter is used to control the size of the final documents when splitting a text. To illustrate how the chunk_size parameter is used, here is an example:

import { CharacterTextSplitter } from "langchain/text_splitter";
const text = "This is a sample text to be split into smaller chunks.";
const splitter = new CharacterTextSplitter({
  chunkSize: 10,
});

const output = await splitter.createDocuments([text]);

In this example, the chunk_size is set to 10, which means the text will be split into chunks of 10 characters each. The createDocuments method is used to split the text and returns a list of documents containing the smaller chunks.

desertnaut
  • 57,590
  • 26
  • 140
  • 166