Search large single-line text file in python for a string

Question

I'm trying to write a program to search an extremely large text file for specific strings, and then return the string as well as a certain amount of characters before and after it from within the file, and to be able to quickly do this thousands of times.

The file is all one line and over a gigabyte in size, and most approaches I have seen work by breaking the file up into lines.

I want to find a way to search efficiently AND take advantage of multiple threads. I have been looking into trying to use the Polars library but I am not sure if it will work for just a long text file.

This is what I have written so far, which works, but not very quickly

with open(r"file.txt", 'rb', 0) as file:
  for date in D:
     x = date.encode('UTF-8')

     s = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
     Pos = s.find(x)

     if Pos != -1:
        file.seek(Pos-30)
        P.append(file.read(60 + len(date)).decode('UTF-8'))
    
     else:
        print(date + " was not found")
file.close()

just a note the last line `file.close()` likely can be removed in this case, since you're using a with statement — rv.kvetch, May 30 '22 at 15:41
Don't know how much faster `mmap` is vs. reading the file in chunks in Python, but with your approach, you are reading the file again and again for each `date`; might be faster reading in chunks _once_ in python, then checking for _all_ dates in those chunks (or last + current chunk for overlapping dates). — tobias_k, May 30 '22 at 15:48
I tested it doing the mmap once and then doing the loop but it was not any faster — Jack Hammer, May 30 '22 at 16:15

score 1 · Answer 1 · answered Jun 02 '22 at 05:27

Here is a solution that uses Polars to perform the search task you describe. Subject to some caveats, it will parallelize the task across all cores of your machine ... and it will be blazingly fast. (I'll provide a benchmark at the end.)

Caveats

Support will likely be limited. Searching a single-line text file for thousands of strings represents a rather atypical use of the Polars DataFrame query engine. As such, support from the Polars user community will likely be limited.
Polars is an in-memory query engine. As such, your large, single-line text file must fit in memory. If your text file is too large to fit comfortably in RAM on your computing platform, you may need to apply this algorithm in batches.
Upgrade to Polars 0.13.40 or later. We'll be using the new extract_all method.
Be sure that your CPU cooling is adequate. This algorithm will push all cores on your CPU to 100%, potentially for an extended period of time, depending on the capabilities of your machine, the size of your text file, and the number of search items. (Indeed, to develop this algorithm, I had to apply long-delayed BIOS & BMC updates to ensure adequate fan control, which delayed this response.)
Performance is not guaranteed to be optimized for this particular use, nor necessarily representative of the performance expected for more typical uses of Polars.

The Algorithm

Example Text File and Search Terms

It's easiest to explain an algorithm and result if we have some example data.

As our text file, I'll use The Expressions page from the Polars API Documentation. Using a browser, you can save the HTML file to your hard drive.

For our search items, I've selected a few commonly used Polars Expressions and terms. (Later, for benchmarking, we'll use a far larger text file and a far longer list of search terms.)

Notice that in the Python open statement that I am decoding the contents of the file and loading the resulting string into a Polars DataFrame in a column named txt. (In particular, I am not loading raw bytes into the DataFrame.)

import polars as pl

search_terms = [
    "select",
    "with_column",
    "DataFrame",
    "explode",
    "(E|e)xpression",
    "(P|p)olars",
]

with open(r"Expressions — Polars documentation.html", "r") as my_file:
    txt_df = pl.DataFrame({"txt": [my_file.read()]})

print(txt_df)

shape: (1, 1)
┌─────────────────────────────────────┐
│ txt                                 │
│ ---                                 │
│ str                                 │
╞═════════════════════════════════════╡
│ logo <https://pola-rs.github.io/... │
└─────────────────────────────────────┘

What we get is a one-row, one-column Polars DataFrame with the entire text file loaded as a string.

The (List of) Expressions

Expressions are the heart of Polars. Mastering the use of Expressions is the key to shocking performance in Polars.

For our use, we'll generate a list of Expressions, one Expression for each search item. Each Expression will contain a regex pattern that will include the search item along with 10 characters before and after the search item. (You can change this to suit your needs.)

Each Expression will create a column ('col_1', 'col_2', etc...), so we will get one column for each search term.

Note that the extract_all expression will return a list of all regex matches for each search term. This differs from the Python code in your question in that the code in your question is returning only the first match, rather than all matches.

expr_list = [
    pl.col("txt")
    .str.extract_all(r".{10}" + search_term + r".{10}")
    .alias("col_" + str(col_num))
    for col_num, search_term in enumerate(search_terms)
]

Running the List of Expressions

Here is the code to run the search, using our expressions above, along with the 3,356 matches found for the search items.

result = (
    txt_df.with_columns(expr_list)
    .select(pl.exclude("txt"))
    .melt()
    .select(["value"])
    .with_column(pl.Series(values=search_terms).alias("search_item"))
    .explode("value")
)
print(result)

shape: (3356, 2)
┌────────────────────────────┬─────────────┐
│ value                      ┆ search_item │
│ ---                        ┆ ---         │
│ str                        ┆ str         │
╞════════════════════════════╪═════════════╡
│ DataFrame.select_at_idx.ht ┆ select      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ DataFrame.select_at_idx.ht ┆ select      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pulation/ selection <#mani ┆ select      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pi/polars.select.html#pola ┆ select      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                        ┆ ...         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ rence/api/polars.internals ┆ (P|p)olars  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ian.html> polars.select <h ┆ (P|p)olars  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ github.io/polars/py-polars ┆ (P|p)olars  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ rence/api/polars.select.ht ┆ (P|p)olars  │
└────────────────────────────┴─────────────┘

Let's take it in steps.

The algorithm first runs the list of Expressions. The result is a one-row DataFrame. Each Expression in our list creates a column that contains a list of all matches for each search item. Notice that the second search item, "with_column", (col_1), found no matches on the Polars Expressions API page. (I purposely chose "with_column" as a search item to show what happens when no match is found.)

txt_df.with_columns(expr_list)

shape: (1, 7)
┌──────────────────────────────────────┬─────────────────────────────────────┬───────────┬─────────────────────────────────────┬─────────────────────────────────┬─────────────────────────────────────┬─────────────────────────────────────┐
│ txt                                  ┆ col_0                               ┆ col_1     ┆ col_2                               ┆ col_3                           ┆ col_4                               ┆ col_5                               │
│ ---                                  ┆ ---                                 ┆ ---       ┆ ---                                 ┆ ---                             ┆ ---                                 ┆ ---                                 │
│ str                                  ┆ list[str]                           ┆ list[str] ┆ list[str]                           ┆ list[str]                       ┆ list[str]                           ┆ list[str]                           │
╞══════════════════════════════════════╪═════════════════════════════════════╪═══════════╪═════════════════════════════════════╪═════════════════════════════════╪═════════════════════════════════════╪═════════════════════════════════════╡
│ logo <https://pola-rs.github.io/...  ┆ ["DataFrame.select_at_idx.ht", "... ┆ null      ┆ [" o polars.DataFrame.write_csv"... ┆ ["lars.Expr.explode.html#pola"] ┆ ["e used as expression and somet... ┆ ["github.io/polars/py-polars", "... │
└──────────────────────────────────────┴─────────────────────────────────────┴───────────┴─────────────────────────────────────┴─────────────────────────────────┴─────────────────────────────────────┴─────────────────────────────────────┘

In the next step, we will use melt to convert this "wide" format DataFrame of many columns (and only one row) into a "long" format DataFrame. This will put the list of results for each search term in it's own row.

(
    txt_df.with_columns(expr_list)
    .select(pl.exclude("txt"))
    .melt()
)

shape: (6, 2)
┌──────────┬─────────────────────────────────────┐
│ variable ┆ value                               │
│ ---      ┆ ---                                 │
│ str      ┆ list[str]                           │
╞══════════╪═════════════════════════════════════╡
│ col_0    ┆ ["DataFrame.select_at_idx.ht", "... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ col_1    ┆ null                                │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ col_2    ┆ [" o polars.DataFrame.write_csv"... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ col_3    ┆ ["lars.Expr.explode.html#pola"]     │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ col_4    ┆ ["e used as expression and somet... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ col_5    ┆ ["github.io/polars/py-polars", "... │
└──────────┴─────────────────────────────────────┘

Next, we'll replace the variable column with our actual search terms, which is more helpful.

(
    txt_df.with_columns(expr_list)
    .select(pl.exclude("txt"))
    .melt()
    .select(["value"])
    .with_column(pl.Series(values=search_terms).alias("search_item"))
)

shape: (6, 2)
┌─────────────────────────────────────┬────────────────┐
│ value                               ┆ search_item    │
│ ---                                 ┆ ---            │
│ list[str]                           ┆ str            │
╞═════════════════════════════════════╪════════════════╡
│ ["DataFrame.select_at_idx.ht", "... ┆ select         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null                                ┆ with_column    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [" o polars.DataFrame.write_csv"... ┆ DataFrame      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["lars.Expr.explode.html#pola"]     ┆ explode        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["e used as expression and somet... ┆ (E|e)xpression │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["github.io/polars/py-polars", "... ┆ (P|p)olars     │
└─────────────────────────────────────┴────────────────┘

And finally, we'll use the explode method, to place each item in each list on it's own row.

(
    txt_df.with_columns(expr_list)
    .select(pl.exclude("txt"))
    .melt()
    .select(["value"])
    .with_column(pl.Series(values=search_terms).alias("search_item"))
    .explode("value")
)

shape: (3356, 2)
┌────────────────────────────┬─────────────┐
│ value                      ┆ search_item │
│ ---                        ┆ ---         │
│ str                        ┆ str         │
╞════════════════════════════╪═════════════╡
│ DataFrame.select_at_idx.ht ┆ select      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ DataFrame.select_at_idx.ht ┆ select      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pulation/ selection <#mani ┆ select      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pi/polars.select.html#pola ┆ select      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                        ┆ ...         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ rence/api/polars.internals ┆ (P|p)olars  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ian.html> polars.select <h ┆ (P|p)olars  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ github.io/polars/py-polars ┆ (P|p)olars  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ rence/api/polars.select.ht ┆ (P|p)olars  │
└────────────────────────────┴─────────────┘

Benchmarking

So, for all this work, how fast is it?

For benchmarking, I chose the "enwik9" file, a 1 GB file often used for benchmarking procedures involving text files (e.g., file compression, text searching, etc.) According to the description at the link, the "enwik9" file is "the first 10^9 bytes of the English Wikipedia dump on Mar. 3, 2006".

For search terms, I chose the 1,969 unique names of US counties (e.g., "Broward County", "Palm Beach County", etc..)

My computing platform is a 32-core Threadripper Pro (with plenty of RAM). If an algorithm can be parallelized, then a Threadripper Pro will reveal this.

I ran the algorithm using the code you provided in your question and the code above.

The code in your question: 439 seconds

The Polars code above: 54 seconds

For reference, here is the result of the Polars code (154,833 matches for all search terms):

shape: (154833, 2)
┌─────────────────────────────────────┬───────────────┐
│ value                               ┆ search_item   │
│ ---                                 ┆ ---           │
│ str                                 ┆ str           │
╞═════════════════════════════════════╪═══════════════╡
│ nty]] # [[Cuming County, Nebrask... ┆ Cuming County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ (East) *[[Cuming County, Nebrask... ┆ Cuming County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ theast *[[Cuming County, Nebrask... ┆ Cuming County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ es === *[[Cuming County, Nebrask... ┆ Cuming County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                                 ┆ ...           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [Category:Roseau County, Minneso... ┆ Roseau County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ated in [[Roseau County, Minneso... ┆ Roseau County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [Category:Roseau County, Minneso... ┆ Roseau County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hlighting Roseau County.png</tit... ┆ Roseau County │
└─────────────────────────────────────┴───────────────┘

It's not an apples-to-apples comparison. For example, the Polars code above extracts all matches for each regex pattern, versus just the first match.

If nothing else, hopefully this answer provides some idea of the shockingly fast performance that we Polars users enjoy.