Meitham


Self Ramblings

Stuff that matters to me


A Pure Python GNU Find Like Utility

written on Saturday, January 5, 2013

Christmas is a great time to catch up on projects. I used the holiday to hack together this find tool. The kids were of course jumping all over me, but I somehow managed to get something working.

GNU Command Line Style

Okay, I must admit, I never was a big fan of the GNU find utility. I have always thought it was ported from the BSD project as-is, and it did not adhere to the POSIX uniform standards. Unlike all other GNU tools, the long-named options were prefixed with one hyphen (BSD Style) rather than two as specified by the GNU standards and that didn't fit my head, or at least it wasn't easy to memorise.

So I found myself to repeatitively doing

find --type f --name lablabla --exec do_something ;

And the tool yelling back at me saying I should be doing it with single hyphens.

find -type f -name lablabla -exec do_something ;

However, my opinion of the (single uniform GNU interface works for all) has changed since Python switched over from optparse to argparse in python27. PEP 389 summarises all the reasons I agree with on using characters other than - as an argument prefix.

A Pure Python Find

So I have decided to challenge myself and implement the find command in Python. I knew it was not hard to support all the standard find options, but I thought it would be great if this tool could support find tests that can handle files metadata. For a start, let's handle image exif details, something like

find ~/Pictures -iname \*.jpg -make Canon -print-tag 'Exif.Image.Model' \;

And this should scan my pictures directory looking for all the jpg images that are taken by a Canon camera and print out the model of the camera for each image.

I wanted this code to support plugins, though I was not sure what is the best pythonic architecture around for plugins. Surprisingly, the google top hit is a stackoverflow question yet none of the answers mentioned setuptools entry points plugin framework.

Anyway, following the entry points approach I was up and running in no time.

Implementation

I have always been bad in history, and this is a mistake I have made, I should have made some efforts to read about the find history - I wish I had. I typed man find and started implementing the options one by one. I assumed there is only one canonical find implementation. Well, I was wrong. I was using a mac (OSX) and apparently I was looking at the original BSD implementation man page of the find, rather than the GNU one. Anyway, I am sticking with my approach for now until I find a good reason to switch to the GNU one.

The difference between GNU and BSD are mostly around terms. GNU divided the arguments into tests and actions where as BSD label everything as primaries. Let's create an abstract Primary

class Primary(object):
    """This will be extended by all primaries
    """
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)

A primary object has to support one interface that is callable, it takes a context and return a context back - updated if the primary wish.

A context is a mapping that is passed around from one primary to another, if a primary does not return the context it disqualifies the files. In the figure below, primary1 processes the file and if it returns a context, then that context gets passed to primary2. primary2 would apply some_arg on the file and if some conditions are met then it is again passed to primary3 ... until primaryN.

/media/pyfind-context.png

A basic primary is the NameMatchPrimary, this is equivalent to the -name primary in the find command.

class NameMatchPrimary(Primary):
    """Compares a file name against a pattern
    similar to `find -name arg1`
    """
    def __call__(self, context):
        filename = context['filename']
        pattern = context['args']
        if fnmatch.fnmatch(filename, pattern):
            return context

If a primary returns None, that is a failure and this file will be ignored and not passed through to the next primary.

I also have a central mapping between primaries and user flags, set as below

primaries_map = {
        'name': NameMatchPrimary(),
        'print': PrintPrimary(),
        'print_context': PrintContext(),
        'exec': ExecPrimary(),
}

Python's argparse makes parsing options really easy. By default it stores the arguments in a dictionary, which does not preserve the order. However, I have added a custom action that collects the argument into a list. An OrderedDict would not help here because some arguments are allowed to be repeated.

class PrimaryAction(argparse.Action):
    """An Action that collects arguments in the order they appear at the shell
    """
    def __call__(self, parser, namespace, values, option_string=None):
        if not 'primaries' in namespace:
            setattr(namespace, 'primaries', [])
        namespace.primaries.append((self.dest, values))

The arguments are generated using this simple snippet.

def cli_args():
    """
    """
    parser = argparse.ArgumentParser(description="extensible pure python "
            "gnu file like tool.")
    parser = argparse.ArgumentParser()
    parser.add_argument('path', action='store', nargs='?', default=os.getcwd())
    parser.add_argument('--verbose', '-v', action='count')
    parser.add_argument('-name', dest='name', action=PrimaryAction)
    return parser

Plugins Support

Extensions are supported through setuptools entry points. A plugin would have to provide a cli_args entry point that takes a ArgumentParser object, adds more arguments (or delete/or modify) and return it.

The plugin also need to provide a primaries named entry point that defines the mappings between the arguments and the primaries, like the primaries_map object we discussed earlier. An example below from the exivfind project.

setup(
    name='exivfind',
    ...
    entry_points="""
        [pygnutools.plugin]
        primaries=exivfind:primaries_map
        cli_args=exivfind:cli_args
    """,
)

The cli_args() function we defined earlier has the following two lines at the end to support extensions

# add plugins
for plugin in iter_entry_points(group='pygnutools.plugin', name='cli_args'):
    parser = plugin.load()(parser)
return parser

The exivfind plugin adds supplies pyfind with EXIF primaries that can qualify files based on their EXIF details. You do things like:

pyfind ~/Pictures -iname \*.jpg -imake canon -print-tag 'Exif.Image.Model' \;

The source code is available at github and patches and features are most welcomed.

This entry was tagged argparse, cli, exif, gnu and python