About a year ago, I was joining a existing team at work, and, while digging through the codebase I was to be working on, I stumbled upon this block of code in the utilities.py:
1 2 3 4 5 6 7 8 9 10 11 12
# TODO: There is doubtless a better way to do this defremove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: parts = haystack.split('\n') result = [] for part in parts: was_seen = False for needle in needles: if needle in part: was_seen = True ifnot was_seen: result.append(part) return'\n'.join(result)
After a quick git blame, that function had been written by a non-python engineer at the beginning of the
project and no-one had the time or was confident enough with their python to refactor it. I, being the most
experience python engineer, decided to take a crack at it.
That one comment
There is doubtless a better way to do this
had me sufficently nerd sniped. I ended up spending the rest of the day, and even got other engineers
in on the problem.
Understanding the Problem
The function, as the name and type-hints imply, was meant to take a newline delimited blob of text,
and remove the lines from it that contained one of a set of patterns.
First Pass Solution
Python-Expert-Me’s first thought was that is a simple nested list comprehension: a one liner
1 2
defremove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: return"\n".join([line for line in haystack.split('\n') ifnot any(needle in lien for needle in needles)])
There, simple, one single line, does the job, passes all of the tests. Job Done. Here is where my python hubris
wanted to show off, and performance test the solution to show off how much I had helped by replacing
a for loop with a list comprehension; because everybody knows that list comprehensions are optimized by
the compiler.
The Performance Test
I setup the test in jupyter notebook so I can have quick access to %timeit, build a test case that
was sufficently hard, and tested some of the edge cases:
gettysburg_address = """Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.
But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth."""
So we have the gettysburg address, and a few words that we want to remove the lines from it. Some of
the words match a lot, some of them never match; I wanted to make sure all of the edge cases were covered.
So I benchmark the original:
1 2
original 29.6 µs ± 672 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Then to show off I benchmark my nested list comprehension:
1 2
list comprehension 52.6 µs ± 893 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Wait what? It almost took 2x as long! List comprehensions are supposed to be way faster!
Use a generator comprehension.
Okay maybe my problem was because I had a list comprehension when I should have used a generator (I didnt need the []):
1 2
defremove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: return'\n'.join(line for line in haystack.split('\n') ifnot any(needle in line for needle in needles))
Result:
1 2
generator comprehension 53.5 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Okay, that was even slower (although statistically the same with that margin of error)
Lets try Filter
Okay maybe using the keyword filter would help.
1 2
defremove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: return'\n'.join(filter(lambda part: not any(needle in part for needle in needles), haystack.split('\n')))
Still a one liner, not a big fan of that lambda being in there but lets see how it does:
1 2
filter 58.7 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Still slower yet.
What about set Intersection
Sets are supposed to be fast right? lets try that:
1 2
defremove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: return'\n'.join(line for line in haystack.split('\n') ifnot (set(line.split(' ')) & needles))
Please be fast…
1 2
set intersection 58.7 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The slowest yet.
This is about the time where I started breaking down the problem and trying to figure
out why the procedural code was outperforming my compiler-optimized comprehensions. The answer as to why
came in the form of short circuiting. The loops of the comprehension are as follows:
An our loop, looping over every line
An inner loop looping over every needle
If any of the needles are found, you can stop and remove that line. The list comprehensions are not allowing
for that short circuiting to occur. Looking at the original, there is additional short circuiting we can do:
Short Circuiting
1 2 3 4 5 6 7 8 9 10 11 12
defremove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: parts = haystack.split('\n') result = [] for part in parts: was_seen = False for needle in needles: if needle in part: was_seen = True break; ifnot was_seen: result.append(part) return'\n'.join(result)
By adding the break, we allow the inner loop to short circuit.
1 2
inner-for break 27.3 µs ± 793 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We got faster! but it ended up being longer, but that is fine, lets see if we can get even faster.
Flipping the Loops.
In theory, most of the times we are going to run this, the list of needles is going to be shorter than the list
of lines. So lets try flipping the loops:
1 2 3 4 5 6 7
defremove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: parts = haystack.split('\n') for needle in needles: for i in range(len(parts) - 1, -1, -1): if needle in parts[i]: del parts[i] return"\n".join(parts)
Here we can loop over the needles, then over the list of parts in reverse order (we will be removing indicies as we go,
so reverse order is safe). This way as we go through the needles, the list of lines gets shorter. The iterations should
shrink over time:
1 2
flipped loops 31.5 µs ± 852 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Not faster, but still a much better solution than the list comprehensions.
Pure Generator Solution
Generators are supposed to be fas and take up less memory, and if implemented properly, we can add short circuiting to them:
1 2 3 4 5 6 7 8 9 10 11
def remove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: def filter_by(lines, needles): for line in lines: seen = False for needle in needles: if needle in line : seen = True break; if not seen: yield line return "\n".join(filter_by(haystack.split('\n'), needles))
Here we are implementing a pure generator as a filter using our short cutting code. The benefit here is
we should be able to use less memory in iterating over the generators:
1 2
pure generator 26.1 µs ± 469 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
There we go. ~25% faster than the original. Not bad. Can we go faster?
Pre Culling Needles
One of the things I have noticed is because we doing O(lines_n^needles_n), that reducing the number of iterations
is the key to performance in this function. If the needles never show up in the haystack, we can remove them, reducing
the iterations. Lets do that before our flipped loops so we have less overall loops in the outer loop of needles:
1 2 3 4 5 6 7 8
defremove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: found_needles = [needle for needle in needles if needle in haystack] parts = haystack.split('\n') for needle in found_needles: for i in range(len(parts) -1, -1, -1): if needle in parts[i]: del parts[i] return"\n".join(parts)
Now we have less needles and the list of lines should be getting shorter each time:
1 2
precull flipped loops 26.2 µs ± 606 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Significantly faster than the non-preculled version, and almost as fast as our generator solution.
Preculled Generator
Which leads to can we breed our two fastest methods to create one even faster?
1 2 3 4 5 6 7 8 9 10 11 12 13 14
defremove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: found_needles = [needle for needle in needles if needle in haystack] ifnot needles: return haystack def__gen(lines, needles): for line in lines: seen = False for needle in needles: if needle in line : seen = True break; ifnot seen: yield line return"\n".join(__gen(haystack.split('\n'), found_needles))
Also added a shortcut if there are no needles found at all, because why do any work then? Create a private
generator that short circuits, much like our generator, but instead pass it the preculled list of needs.
1 2
precull generator 23.6 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
And there it is, our fastest solution.
What are we Really Optimizing
So after a day and 3 other engineers pitching in solutions, even some betting on which method should be
the fastest, we ended up going with the original list comprehension one liner. The difference was miniscule in performance,
and the use case in our system had needles set to a set of 2 elements and we were not parsing strings larger
than 20 lines. The ratio of performance to lines of codes in the codebase was not worth all of the optimizations. The
generator was easy to ready, used a given pattern, and performed good enough.
However it was fun walking through optimizing the code and figuring out better ways to solve the same problem, as well
as addressing some of the complexity concerns and seeing how fast we could squeeze out performance.
For those of you who want to try, I put together a performance testing harness for this problem that
allows you to simply add methods to the class and it will validate the results and time it for you.
from typing import List import timeit import hashlib
classRemoveLinesWorkset: gettysburg_address = """Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.
But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth."""
remove_words = {'Four', 'nation', 'men', 'great', 'remember', 'foo', 'bar', 'fizzbuzz', 'variable'} result_hash = '6292b1703fd369e05ae6d9e49ef2fe38' result_length = 970 defget_tries(self) -> List[str]: """ Get list of try functions, prefixed with try_* """ return [methodname for methodname in dir(self) if methodname[:4] == 'try_'] defvalidate_methods(self): """ Make sure all methods are returning the correct data. """ for method in self.get_tries(): result = getattr(self, method)(self.gettysburg_address, self.remove_words) hashed = hashlib.md5(result.encode("utf-8")).hexdigest() ifnot result or len(result) != self.result_length or hashed != self.result_hash: print("'{}()' did not return the correct result set: {} is not {}".format(method, hashed, self.result_hash)) returnFalse returnTrue deftime_methods(self): """ Use timeit to time each method for performance """ ifnot self.validate_methods(): returnNone for method in self.get_tries(): print(method) test_case = "t = RemoveLinesWorkset()\nt.{}(t.gettysburg_address, t.remove_words)".format(method) timer = timeit.Timer(test_case, setup="from __main__ import RemoveLinesWorkset") print("100,000 loops, averaged {:.3f} µs per loop\n".format(timer.timeit(number=100000)/.1)) deftry_original(self, haystack, needles): parts = haystack.split('\n') result = [] for part in parts: was_seen = False for needle in needles: if needle in part: was_seen = True ifnot was_seen: result.append(part) return'\n'.join(result) deftry_list_comprehension(self, haystack, needles): return'\n'.join([line for line in haystack.split('\n') ifnot any(needle in line for needle in needles)]) deftry_generator_comprehension(self, haystack, needles): return'\n'.join(line for line in haystack.split('\n') ifnot any(needle in line for needle in needles)) deftry_filter_lambda(self, haystack, needles): return'\n'.join(filter(lambda part: not any(needle in part for needle in needles), haystack.split('\n'))) deftry_for_break(self, haystack, needles): parts = haystack.split('\n') result = [] for part in parts: was_seen = False for needle in needles: if needle in part: was_seen = True break; ifnot was_seen: result.append(part) return'\n'.join(result) deftry_reversed_loops(self, haystack, needles): parts = haystack.split('\n') for needle in needles: for i in range(len(parts) - 1, -1, -1): if needle in parts[i]: del parts[i] return"\n".join(parts) deftry_pure_generator(self, haystack, needles): deffilter_by(lines, needles): for line in lines: seen = False for needle in needles: if needle in line : seen = True break; ifnot seen: yield line return"\n".join(filter_by(haystack.split('\n'), needles)) deftry_culled_reverse_loops(self, haystack, needles): found_needles = [needle for needle in needles if needle in haystack] parts = haystack.split('\n') for needle in found_needles: for i in range(len(parts) -1, -1, -1): if needle in parts[i]: del parts[i] return"\n".join(parts) deftry_culled_generator(self, haystack, needles): found_needles = [needle for needle in needles if needle in haystack] ifnot needles: return haystack def__gen(lines, needles): for line in lines: seen = False for needle in needles: if needle in line : seen = True break; ifnot seen: yield line return"\n".join(__gen(haystack.split('\n'), found_needles)) RemoveLinesWorkset().time_methods()