Remove Lines Containing Strings

MC706.io

2018-02-14

About a year ago, I was joining a existing team at work, and, while digging through the codebase I was to be working on, I stumbled upon this block of code in the utilities.py:

# TODO: There is doubtless a better way to do this
def remove_lines_containing_strings(haystack: str, needles: Set[str]) -> str:
    parts = haystack.split('\n')
    result = []
    for part in parts:
        was_seen = False
        for needle in needles:
            if needle in part:
                was_seen = True
        if not was_seen:
            result.append(part)
    return '\n'.join(result)

After a quick git blame, that function had been written by a non-python engineer at the beginning of the project and no-one had the time or was confident enough with their python to refactor it. I, being the most experience python engineer, decided to take a crack at it.

That one comment

There is doubtless a better way to do this

had me sufficently nerd sniped. I ended up spending the rest of the day, and even got other engineers in on the problem.

Understanding the Problem

The function, as the name and type-hints imply, was meant to take a newline delimited blob of text, and remove the lines from it that contained one of a set of patterns.

First Pass Solution

Python-Expert-Me’s first thought was that is a simple nested list comprehension: a one liner

1 2	def remove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: return "\n".join([line for line in haystack.split('\n') if not any(needle in lien for needle in needles)])

There, simple, one single line, does the job, passes all of the tests. Job Done. Here is where my python hubris wanted to show off, and performance test the solution to show off how much I had helped by replacing a for loop with a list comprehension; because everybody knows that list comprehensions are optimized by the compiler.

The Performance Test

I setup the test in jupyter notebook so I can have quick access to %timeit, build a test case that was sufficently hard, and tested some of the edge cases:

gettysburg_address = """Four score and seven years ago 
our fathers brought forth on this continent, 
a new nation, 
conceived in Liberty, 
and dedicated to the proposition that 
all men are created equal.

Now we are engaged in a great civil war, 
testing whether that nation, 
or any nation so conceived and so dedicated, 
can long endure. 
We are met on a great battle-field of that war. 
We have come to dedicate a portion of that field, 
as a final resting place for those who 
here gave their lives that that nation might live. 
It is altogether fitting 
and proper that we should do this.

But, in a larger sense, 
we can not dedicate -- 
we can not consecrate -- 
we can not hallow -- this ground. 
The brave men, living and dead, 
who struggled here, have consecrated it, 
far above our poor power to add or detract. 
The world will little note, 
nor long remember what we say here, 
but it can never forget what they did here. 
It is for us the living,
rather, to be dedicated here 
to the unfinished work which they who fought here 
have thus far so nobly advanced. 
It is rather for us to be here dedicated 
to the great task remaining before us -- 
that from these honored dead we take increased devotion 
to that cause for which they gave the last full measure of devotion -- 
that we here highly resolve that these dead shall not have died in vain -- 
that this nation, under God, shall have a new birth of freedom -- 
and that government of the people, by the people, for the people, 
shall not perish from the earth."""

remove_words = {'Four', 'nation', 'men', 'great', 'remember', 'foo', 'bar', 'fizzbuzz'}

So we have the gettysburg address, and a few words that we want to remove the lines from it. Some of the words match a lot, some of them never match; I wanted to make sure all of the edge cases were covered.

So I benchmark the original:

1 2	original 29.6 µs ± 672 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Then to show off I benchmark my nested list comprehension:

1 2	list comprehension 52.6 µs ± 893 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Wait what? It almost took 2x as long! List comprehensions are supposed to be way faster!

Use a generator comprehension.

Okay maybe my problem was because I had a list comprehension when I should have used a generator (I didnt need the []):

1 2	def remove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: return '\n'.join(line for line in haystack.split('\n') if not any(needle in line for needle in needles))

Result:

1 2	generator comprehension 53.5 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Okay, that was even slower (although statistically the same with that margin of error)

Lets try Filter

Okay maybe using the keyword filter would help.

1 2	def remove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: return '\n'.join(filter(lambda part: not any(needle in part for needle in needles), haystack.split('\n')))

Still a one liner, not a big fan of that lambda being in there but lets see how it does:

1 2	filter 58.7 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Still slower yet.

What about set Intersection

Sets are supposed to be fast right? lets try that:

1 2	def remove_lines_containing_strings(haystack: str, needles: Set[str]) -> str: return '\n'.join(line for line in haystack.split('\n') if not (set(line.split(' ')) & needles))

Please be fast…

1 2	set intersection 58.7 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The slowest yet.

This is about the time where I started breaking down the problem and trying to figure out why the procedural code was outperforming my compiler-optimized comprehensions. The answer as to why came in the form of short circuiting. The loops of the comprehension are as follows:

An our loop, looping over every line
An inner loop looping over every needle

If any of the needles are found, you can stop and remove that line. The list comprehensions are not allowing for that short circuiting to occur. Looking at the original, there is additional short circuiting we can do:

Short Circuiting

def remove_lines_containing_strings(haystack: str, needles: Set[str]) -> str:
    parts = haystack.split('\n')
    result = []
    for part in parts:
        was_seen = False
        for needle in needles:
            if needle in part:
                was_seen = True
                break;
        if not was_seen:
            result.append(part)
    return '\n'.join(result)

By adding the break, we allow the inner loop to short circuit.

1 2	inner-for break 27.3 µs ± 793 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We got faster! but it ended up being longer, but that is fine, lets see if we can get even faster.

Flipping the Loops.

In theory, most of the times we are going to run this, the list of needles is going to be shorter than the list of lines. So lets try flipping the loops:

def remove_lines_containing_strings(haystack: str, needles: Set[str]) -> str:
    parts = haystack.split('\n')
    for needle in needles:
        for i in range(len(parts) - 1, -1, -1):
            if needle in parts[i]:
                del parts[i]
    return "\n".join(parts)

Here we can loop over the needles, then over the list of parts in reverse order (we will be removing indicies as we go, so reverse order is safe). This way as we go through the needles, the list of lines gets shorter. The iterations should shrink over time:

1 2	flipped loops 31.5 µs ± 852 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Not faster, but still a much better solution than the list comprehensions.

Pure Generator Solution

Generators are supposed to be fas and take up less memory, and if implemented properly, we can add short circuiting to them:

def remove_lines_containing_strings(haystack: str, needles: Set[str]) -> str:
   def filter_by(lines, needles):
       for line in lines:
           seen = False
           for needle in needles:
               if needle in line :
                   seen = True
                   break;
           if not seen:
               yield line
   return "\n".join(filter_by(haystack.split('\n'), needles))

Here we are implementing a pure generator as a filter using our short cutting code. The benefit here is we should be able to use less memory in iterating over the generators:

1 2	pure generator 26.1 µs ± 469 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

There we go. ~25% faster than the original. Not bad. Can we go faster?

Pre Culling Needles

One of the things I have noticed is because we doing O(lines_n^needles_n), that reducing the number of iterations is the key to performance in this function. If the needles never show up in the haystack, we can remove them, reducing the iterations. Lets do that before our flipped loops so we have less overall loops in the outer loop of needles:

def remove_lines_containing_strings(haystack: str, needles: Set[str]) -> str:
    found_needles = [needle for needle in needles if needle in haystack]
    parts = haystack.split('\n')
    for needle in found_needles:
        for i in range(len(parts) -1, -1, -1):
            if needle in parts[i]:
                del parts[i]
    return "\n".join(parts)

Now we have less needles and the list of lines should be getting shorter each time:

1 2	precull flipped loops 26.2 µs ± 606 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Significantly faster than the non-preculled version, and almost as fast as our generator solution.

Preculled Generator

Which leads to can we breed our two fastest methods to create one even faster?

def remove_lines_containing_strings(haystack: str, needles: Set[str]) -> str:
    found_needles = [needle for needle in needles if needle in haystack]
    if not needles:
        return haystack
    def __gen(lines, needles):
        for line in lines:
            seen = False
            for needle in needles:
                if needle in line :
                    seen = True
                    break;
            if not seen:
                yield line
    return "\n".join(__gen(haystack.split('\n'), found_needles))

Also added a shortcut if there are no needles found at all, because why do any work then? Create a private generator that short circuits, much like our generator, but instead pass it the preculled list of needs.

1 2	precull generator 23.6 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And there it is, our fastest solution.

What are we Really Optimizing

So after a day and 3 other engineers pitching in solutions, even some betting on which method should be the fastest, we ended up going with the original list comprehension one liner. The difference was miniscule in performance, and the use case in our system had needles set to a set of 2 elements and we were not parsing strings larger than 20 lines. The ratio of performance to lines of codes in the codebase was not worth all of the optimizations. The generator was easy to ready, used a given pattern, and performed good enough.

However it was fun walking through optimizing the code and figuring out better ways to solve the same problem, as well as addressing some of the complexity concerns and seeing how fast we could squeeze out performance.

For those of you who want to try, I put together a performance testing harness for this problem that allows you to simply add methods to the class and it will validate the results and time it for you.

from typing import List
import timeit
import hashlib

class RemoveLinesWorkset:
    gettysburg_address = """Four score and seven years ago 
our fathers brought forth on this continent, 
a new nation, 
conceived in Liberty, 
and dedicated to the proposition that 
all men are created equal.

Now we are engaged in a great civil war, 
testing whether that nation, 
or any nation so conceived and so dedicated, 
can long endure. 
We are met on a great battle-field of that war. 
We have come to dedicate a portion of that field, 
as a final resting place for those who 
here gave their lives that that nation might live. 
It is altogether fitting 
and proper that we should do this.

But, in a larger sense, 
we can not dedicate -- 
we can not consecrate -- 
we can not hallow -- this ground. 
The brave men, living and dead, 
who struggled here, have consecrated it, 
far above our poor power to add or detract. 
The world will little note, 
nor long remember what we say here, 
but it can never forget what they did here. 
It is for us the living,
rather, to be dedicated here 
to the unfinished work which they who fought here 
have thus far so nobly advanced. 
It is rather for us to be here dedicated 
to the great task remaining before us -- 
that from these honored dead we take increased devotion 
to that cause for which they gave the last full measure of devotion -- 
that we here highly resolve that these dead shall not have died in vain -- 
that this nation, under God, shall have a new birth of freedom -- 
and that government of the people, by the people, for the people, 
shall not perish from the earth."""

    remove_words = {'Four', 'nation', 'men', 'great', 'remember', 'foo', 'bar', 'fizzbuzz', 'variable'}
    result_hash = '6292b1703fd369e05ae6d9e49ef2fe38'
    result_length = 970
    
    def get_tries(self) -> List[str]:
        """
        Get list of try functions, prefixed with try_*
        """
        return [methodname for methodname in dir(self) if methodname[:4] == 'try_']
        
    
    def validate_methods(self):
        """
        Make sure all methods are returning the correct data.
        """
        for method in self.get_tries():
            result = getattr(self, method)(self.gettysburg_address, self.remove_words)
            hashed = hashlib.md5(result.encode("utf-8")).hexdigest()
            if not result or len(result) != self.result_length or hashed != self.result_hash:
                print("'{}()' did not return the correct result set: {} is not {}".format(method, hashed, self.result_hash))
                return False
        return True
        
    def time_methods(self):
        """
        Use timeit to time each method for performance
        """
        if not self.validate_methods():
            return None
        for method in self.get_tries():
            print(method)
            test_case = "t = RemoveLinesWorkset()\nt.{}(t.gettysburg_address, t.remove_words)".format(method)
            timer = timeit.Timer(test_case, setup="from __main__ import RemoveLinesWorkset")
            print("100,000 loops, averaged {:.3f} µs per loop\n".format(timer.timeit(number=100000)/.1))
            
    
    def try_original(self, haystack, needles):
        parts = haystack.split('\n')
        result = []
        for part in parts:
            was_seen = False
            for needle in needles:
                if needle in part:
                    was_seen = True
            if not was_seen:
                result.append(part)
        return '\n'.join(result)
       
    def try_list_comprehension(self, haystack, needles):
        return '\n'.join([line for line in haystack.split('\n') if not any(needle in line for needle in needles)])
    
    def try_generator_comprehension(self, haystack, needles):
        return '\n'.join(line for line in haystack.split('\n') if not any(needle in line for needle in needles))
  
    def try_filter_lambda(self, haystack, needles):
        return '\n'.join(filter(lambda part: not any(needle in part for needle in needles), haystack.split('\n')))
     
    def try_for_break(self, haystack, needles):
        parts = haystack.split('\n')
        result = []
        for part in parts:
            was_seen = False
            for needle in needles:
                if needle in part:
                    was_seen = True
                    break;
            if not was_seen:
                result.append(part)
        return '\n'.join(result)
    
    def try_reversed_loops(self, haystack, needles):
        parts = haystack.split('\n')
        for needle in needles:
            for i in range(len(parts) - 1, -1, -1):
                if needle in parts[i]:
                    del parts[i]
        return "\n".join(parts)
    
    def try_pure_generator(self, haystack, needles):
        def filter_by(lines, needles):
            for line in lines:
                seen = False
                for needle in needles:
                    if needle in line :
                        seen = True
                        break;
                if not seen:
                    yield line
        return "\n".join(filter_by(haystack.split('\n'), needles))
    
    def try_culled_reverse_loops(self, haystack, needles):
        found_needles = [needle for needle in needles if needle in haystack]
        parts = haystack.split('\n')
        for needle in found_needles:
            for i in range(len(parts) -1, -1, -1):
                if needle in parts[i]:
                    del parts[i]
        return "\n".join(parts)
        
    def try_culled_generator(self, haystack, needles):
        found_needles = [needle for needle in needles if needle in haystack]
        if not needles:
            return haystack
        def __gen(lines, needles):
            for line in lines:
                seen = False
                for needle in needles:
                    if needle in line :
                        seen = True
                        break;
                if not seen:
                    yield line
        return "\n".join(__gen(haystack.split('\n'), found_needles))
        
RemoveLinesWorkset().time_methods()