Here is a solution that uses Polars to perform the search task you describe. Subject to some caveats, it will parallelize the task across all cores of your machine ... and it will be blazingly fast. (I'll provide a benchmark at the end.)
Caveats
Support will likely be limited. Searching a single-line text file for thousands of strings represents a rather atypical use of the Polars DataFrame query engine. As such, support from the Polars user community will likely be limited.
Polars is an in-memory query engine. As such, your large, single-line text file must fit in memory. If your text file is too large to fit comfortably in RAM on your computing platform, you may need to apply this algorithm in batches.
Upgrade to Polars 0.13.40
or later. We'll be using the new extract_all
method.
Be sure that your CPU cooling is adequate. This algorithm will push all cores on your CPU to 100%, potentially for an extended period of time, depending on the capabilities of your machine, the size of your text file, and the number of search items. (Indeed, to develop this algorithm, I had to apply long-delayed BIOS & BMC updates to ensure adequate fan control, which delayed this response.)
Performance is not guaranteed to be optimized for this particular use, nor necessarily representative of the performance expected for more typical uses of Polars.
The Algorithm
Example Text File and Search Terms
It's easiest to explain an algorithm and result if we have some example data.
As our text file, I'll use The Expressions page from the Polars API Documentation. Using a browser, you can save the HTML file to your hard drive.
For our search items, I've selected a few commonly used Polars Expressions and terms. (Later, for benchmarking, we'll use a far larger text file and a far longer list of search terms.)
Notice that in the Python open
statement that I am decoding the contents of the file and loading the resulting string into a Polars DataFrame
in a column named txt
. (In particular, I am not loading raw bytes into the DataFrame
.)
import polars as pl
search_terms = [
"select",
"with_column",
"DataFrame",
"explode",
"(E|e)xpression",
"(P|p)olars",
]
with open(r"Expressions — Polars documentation.html", "r") as my_file:
txt_df = pl.DataFrame({"txt": [my_file.read()]})
print(txt_df)
shape: (1, 1)
┌─────────────────────────────────────┐
│ txt │
│ --- │
│ str │
╞═════════════════════════════════════╡
│ logo <https://pola-rs.github.io/... │
└─────────────────────────────────────┘
What we get is a one-row, one-column Polars DataFrame
with the entire text file loaded as a string.
The (List of) Expressions
Expressions are the heart of Polars. Mastering the use of Expressions is the key to shocking performance in Polars.
For our use, we'll generate a list of Expressions, one Expression for each search item. Each Expression will contain a regex pattern that will include the search item along with 10 characters before and after the search item. (You can change this to suit your needs.)
Each Expression will create a column ('col_1', 'col_2', etc...), so we will get one column for each search term.
Note that the extract_all
expression will return a list of all regex matches for each search term. This differs from the Python code in your question in that the code in your question is returning only the first match, rather than all matches.
expr_list = [
pl.col("txt")
.str.extract_all(r".{10}" + search_term + r".{10}")
.alias("col_" + str(col_num))
for col_num, search_term in enumerate(search_terms)
]
Running the List of Expressions
Here is the code to run the search, using our expressions above, along with the 3,356 matches found for the search items.
result = (
txt_df.with_columns(expr_list)
.select(pl.exclude("txt"))
.melt()
.select(["value"])
.with_column(pl.Series(values=search_terms).alias("search_item"))
.explode("value")
)
print(result)
shape: (3356, 2)
┌────────────────────────────┬─────────────┐
│ value ┆ search_item │
│ --- ┆ --- │
│ str ┆ str │
╞════════════════════════════╪═════════════╡
│ DataFrame.select_at_idx.ht ┆ select │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ DataFrame.select_at_idx.ht ┆ select │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pulation/ selection <#mani ┆ select │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pi/polars.select.html#pola ┆ select │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ rence/api/polars.internals ┆ (P|p)olars │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ian.html> polars.select <h ┆ (P|p)olars │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ github.io/polars/py-polars ┆ (P|p)olars │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ rence/api/polars.select.ht ┆ (P|p)olars │
└────────────────────────────┴─────────────┘
Let's take it in steps.
The algorithm first runs the list of Expressions. The result is a one-row DataFrame
. Each Expression in our list creates a column that contains a list of all matches for each search item. Notice that the second search item, "with_column", (col_1
), found no matches on the Polars Expressions API page. (I purposely chose "with_column" as a search item to show what happens when no match is found.)
txt_df.with_columns(expr_list)
shape: (1, 7)
┌──────────────────────────────────────┬─────────────────────────────────────┬───────────┬─────────────────────────────────────┬─────────────────────────────────┬─────────────────────────────────────┬─────────────────────────────────────┐
│ txt ┆ col_0 ┆ col_1 ┆ col_2 ┆ col_3 ┆ col_4 ┆ col_5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ list[str] ┆ list[str] ┆ list[str] ┆ list[str] ┆ list[str] │
╞══════════════════════════════════════╪═════════════════════════════════════╪═══════════╪═════════════════════════════════════╪═════════════════════════════════╪═════════════════════════════════════╪═════════════════════════════════════╡
│ logo <https://pola-rs.github.io/... ┆ ["DataFrame.select_at_idx.ht", "... ┆ null ┆ [" o polars.DataFrame.write_csv"... ┆ ["lars.Expr.explode.html#pola"] ┆ ["e used as expression and somet... ┆ ["github.io/polars/py-polars", "... │
└──────────────────────────────────────┴─────────────────────────────────────┴───────────┴─────────────────────────────────────┴─────────────────────────────────┴─────────────────────────────────────┴─────────────────────────────────────┘
In the next step, we will use melt
to convert this "wide" format DataFrame
of many columns (and only one row) into a "long" format DataFrame
. This will put the list of results for each search term in it's own row.
(
txt_df.with_columns(expr_list)
.select(pl.exclude("txt"))
.melt()
)
shape: (6, 2)
┌──────────┬─────────────────────────────────────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ list[str] │
╞══════════╪═════════════════════════════════════╡
│ col_0 ┆ ["DataFrame.select_at_idx.ht", "... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ col_1 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ col_2 ┆ [" o polars.DataFrame.write_csv"... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ col_3 ┆ ["lars.Expr.explode.html#pola"] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ col_4 ┆ ["e used as expression and somet... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ col_5 ┆ ["github.io/polars/py-polars", "... │
└──────────┴─────────────────────────────────────┘
Next, we'll replace the variable
column with our actual search terms, which is more helpful.
(
txt_df.with_columns(expr_list)
.select(pl.exclude("txt"))
.melt()
.select(["value"])
.with_column(pl.Series(values=search_terms).alias("search_item"))
)
shape: (6, 2)
┌─────────────────────────────────────┬────────────────┐
│ value ┆ search_item │
│ --- ┆ --- │
│ list[str] ┆ str │
╞═════════════════════════════════════╪════════════════╡
│ ["DataFrame.select_at_idx.ht", "... ┆ select │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ with_column │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [" o polars.DataFrame.write_csv"... ┆ DataFrame │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["lars.Expr.explode.html#pola"] ┆ explode │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["e used as expression and somet... ┆ (E|e)xpression │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["github.io/polars/py-polars", "... ┆ (P|p)olars │
└─────────────────────────────────────┴────────────────┘
And finally, we'll use the explode
method, to place each item in each list on it's own row.
(
txt_df.with_columns(expr_list)
.select(pl.exclude("txt"))
.melt()
.select(["value"])
.with_column(pl.Series(values=search_terms).alias("search_item"))
.explode("value")
)
shape: (3356, 2)
┌────────────────────────────┬─────────────┐
│ value ┆ search_item │
│ --- ┆ --- │
│ str ┆ str │
╞════════════════════════════╪═════════════╡
│ DataFrame.select_at_idx.ht ┆ select │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ DataFrame.select_at_idx.ht ┆ select │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pulation/ selection <#mani ┆ select │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pi/polars.select.html#pola ┆ select │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ rence/api/polars.internals ┆ (P|p)olars │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ian.html> polars.select <h ┆ (P|p)olars │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ github.io/polars/py-polars ┆ (P|p)olars │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ rence/api/polars.select.ht ┆ (P|p)olars │
└────────────────────────────┴─────────────┘
Benchmarking
So, for all this work, how fast is it?
For benchmarking, I chose the "enwik9" file, a 1 GB file often used for benchmarking procedures involving text files (e.g., file compression, text searching, etc.) According to the description at the link, the "enwik9" file is "the first 10^9 bytes of the English Wikipedia dump on Mar. 3, 2006".
For search terms, I chose the 1,969 unique names of US counties (e.g., "Broward County", "Palm Beach County", etc..)
My computing platform is a 32-core Threadripper Pro (with plenty of RAM). If an algorithm can be parallelized, then a Threadripper Pro will reveal this.
I ran the algorithm using the code you provided in your question and the code above.
The code in your question: 439 seconds
The Polars code above: 54 seconds
For reference, here is the result of the Polars code (154,833 matches for all search terms):
shape: (154833, 2)
┌─────────────────────────────────────┬───────────────┐
│ value ┆ search_item │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════════════════════════════╪═══════════════╡
│ nty]] # [[Cuming County, Nebrask... ┆ Cuming County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ (East) *[[Cuming County, Nebrask... ┆ Cuming County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ theast *[[Cuming County, Nebrask... ┆ Cuming County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ es === *[[Cuming County, Nebrask... ┆ Cuming County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [Category:Roseau County, Minneso... ┆ Roseau County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ated in [[Roseau County, Minneso... ┆ Roseau County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [Category:Roseau County, Minneso... ┆ Roseau County │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hlighting Roseau County.png</tit... ┆ Roseau County │
└─────────────────────────────────────┴───────────────┘
It's not an apples-to-apples comparison. For example, the Polars code above extracts all matches for each regex pattern, versus just the first match.
If nothing else, hopefully this answer provides some idea of the shockingly fast performance that we Polars users enjoy.