Building the raidgrep query language

April 24, 2020 in programming

Some time ago I started working on raidgrep, which is a tool similar to ripgrep and find. It recursively searches a folder of special files and filters them based on some criteria that can be specified on the command line.

The files in question are .evtc files, which are produced by a third-party addon to the MMORPG Guild Wars 2 — arcdps. Those files contain a lot of metadata about a given fight in the game, and (initially most importantly) also contain the account and character names of all participating players. Therefore, the idea of raidgrep — and the reason for its name — was to search through the log files for a specific player name.

With passing time, other useful filter criteria had emerged. For example, excluding files based on their success or the specific boss's name were also useful. All of those (some of them documented in various gitlab issues: #2, #3) got added as command line flags:

FLAGS:
    [...]
    -v, --invert-match          Invert the regular expression (show fights that do not match)
    [...]

OPTIONS:
    -a, --younger <after>        Only show logs that are younger than the given time
    -b, --older <before>         Only show logs that are older than the given time
    -e, --bosses <bosses>        Only show logs from the given encounters [default: *]
    -f, --fields <field>         The fields which should be searched [default: account,character]
    -o, --outcome <outcome>      Only display fights with the given outcome [default: *]
    [...]
    -w, --weekdays <weekdays>    Only show logs from the given weekdays [default: *]

ARGS:
    <EXPR>    The regular expression to search for

Both the documentation and the implementation were a bit wonky: For example, --invert-match only applied to the regular expression. There was custom logic built to allow "inverting" the value of --bosses, for example, by using --bosses !deimos — that stood for all bosses except for Deimos.

This was powered by a custom "comma separated list" parser that had special support for handling "*" and "!..." style values to mean "all" and "all except for ...", respectively:

impl<T> FromStr for CommaSeparatedList<T>
    where T: FromStr + Variants + Hash + Eq + fmt::Debug
{
    type Err = ParseError<T::Err>;

    fn from_str(input: &str) -> Result<Self, Self::Err> {
        if input == "*" {
            Ok(CommaSeparatedList { values: T::variants().collect() })
        } else if input.starts_with(NEGATOR) {
            let no_csl = CommaSeparatedList::from_str(&input[1..])?;
            let all_values = T::variants().collect::<HashSet<_>>();
            Ok(CommaSeparatedList {
                values: all_values.difference(&no_csl.values).cloned().collect()
            })
        } else {
            let parts = input.split(DELIMITER);
            let values = parts
                .map(FromStr::from_str)
                .collect::<Result<HashSet<_>, _>>()
                .map_err(ParseError::Underlying)?;
            Ok(CommaSeparatedList { values })
        }
    }
}

Medium yikes to my old me!

In anticipation of even more filters (like #4) and a better understanding of what is actually happening, I decided to revamp the interface first and thought about the raidgrep filtering language.

The RFL

The main inspiration for RFL was the find command line utility. This is a command line application that allows you to find files based on simple "tests" such as the file name (-name), the file type (-type), the permissions (-perm) and many more. Furthermore, it has support for expression grouping with parenthesis and boolean operators such as -a for "and" and -o for "or". Given the similarities to what I searched for, this seemed like a good fit!

Apart from that, I honestly did not put all too much thought into RFL. Since the tests in find start with a -, the predicates in RFL start with a - as well. There's no practical reason for that, it just makes them look like regular command line switches. I did decide against using -and/-or in favor of just and/or, simply because I thought it'd look a bit weird if every word on a line had a - prefixed. Additionally, it makes it easy to tell predicates and operators apart from each other.

At one point, I considered arbitrary binary operators and some sort of "global variables". That would mean that you could, for example, write name == "Dunje" for an exact match, name =~ "Dunje" for a regular expression, name in ["Dunje"] for a list, and so on. I scratched that idea because it seemed to become too complex, and it opened up a lot of different possibilities that would need to be supported, along with some sort of runtime checks to make sure the predicates are sensible, and probably even more stuff — definitely more than the scope of a simple find-like language, it would've ended up more like a SQL WHERE clause.

One thing that I did add however were quantifiers. The reason for that is that a lot of the predicates apply to the whole file ("log"), but some of them apply to parts of the file (such as predicates pertaining to a single player). In order to use a player-predicate for a log, you need to use either all(player: ...) (which succeeds if the player-predicate matches on all players in the file) or any(player: ...) (which succeeds if the player-predicate matches at least one player). The syntax for those was inspired by Python's all and any functions, as well as SQL's WHERE EXISTS clause.

In the end though, a lot of it is also just bikeshedding (such as using &, && or and as an operator), so I didn't want to spend too much time thinking about it. The gitlab issue about raidgrep filtering language contains some more examples and thoughts, but without further ado, here are some expressions that work in the new engine:

any(player: -name "Peter Parker"): Shows logs in which any player has the name "Peter Parker".
all(player: -name "Dunje" or -name "xyoz"): Shows logs in which all players are either called "Dunje" or "xyoz".
any(player: -name "Gellalli") and -boss Qadim: Shows logs in which a player called "Gellalli" is participating and the boss is called "Qadim".
-success: Shows all logs that were successful.
not -wipe: Shows all logs that were successful, this time though by excluding logs that were unsucessful ("wipe").

Similarly, some expressions which are erroneous and will not work:

-name "Peter Parker": -name is a player predicate and needs either any or all surrounding it.
any(player: all(player: -name "Peter Parker")): all is a log-predicate and cannot be nested in a any predicate.

You get the idea.

Dynamic abstract filtering trees

In order to support those arbitrary filters, raidgrep needed its filtering engine adjusted. The old engine looked like this:

fn search_log(entry: &DirEntry, opt: &Opt) -> Result<Option<LogResult>> {
    let mut stream = /* some code omitted */;
    let partial = evtclib::raw::parser::parse_partial_file(&mut stream)?;

    let early_ok =
        filters::filter_name(&partial, opt) != opt.invert && filters::filter_boss(&partial, opt);

    if !early_ok {
        return Ok(None);
    }

    let raw = evtclib::raw::parser::finish_parsing(partial, &mut stream)?;
    let parsed = evtclib::process(&raw).ok();
    let log = if let Some(e) = parsed {
        e
    } else {
        debug!("log file cannot be parsed: {:?}", entry.path());
        return Ok(None);
    };

    let info = extract_info(entry, &log);

    let take_log = filters::filter_outcome(&info, opt)
        && filters::filter_weekday(&info, opt)
        && filters::filter_time(&info, opt)
        && filters::filter_guilds(&info, opt);

    if take_log {
        Ok(Some(info))
    } else {
        Ok(None)
    }
}

The filters are all hardcoded and executed at pre-defined places. The reason for that is that some filters can give a result early in the file parsing process (think after parsing the header), while other filters need the whole file to be processed before they can make a decision. For example, the boss filter (for --bosses) can already throw away logs after checking the header, but the outcome filter needs to process the whole file.

This process saves a lot of time, because the header is orders of magnitude smaller than the rest of the file. Skipping a file early in the filtering process is something that I don't want to give up, even with "more flexible" filters. However, not all filters are able to make a decision early, so before we design something like a Filter trait, we first need to define a small helper enum for an enhanced Bool (no, seriously, three-valued logic is a thing):

#[derive(Debug, Copy, Clone, PartialEq, Eq, FromPrimitive)]
pub enum Inclusion {
    /// The log should be included.
    Include = 1,
    /// The state is yet unknown.
    Unknown = 0,
    /// The log should be excluded.
    Exclude = -1,
}

Here, Inclusion::Unknown works like SQL's NULL. The reason why we need a third state is because we need to distinguish between "we can safely exclude this log already" (e.g. for the purposes of short-circuiting and early) and "we can safely include this log already" (e.g. for the purposes of short-circuiting or). If we would represent Unknown by mapping it to either Include or Exclude, we would lose this extra information.

Armed with our super-boolean, we can now define a Filter-trait:

pub trait Filter<Early, Late>: Send + Sync + fmt::Debug {
    fn filter_early(&self, _: &Early) -> Inclusion {
        Inclusion::Unknown
    }

    fn filter(&self, _: &Late) -> bool;
}

The reason why this trait is generic is because we can re-use it for both log-filters/log-predicates and player-filters/player-predicates: The underlying logic is the same, the only thing that changes is what the filter operates on.

Since filters can be combined using and/or/not, I implemented those by overloading the bitwise operators:

struct AndFilter<E, L>(Box<dyn Filter<E, L>>, Box<dyn Filter<E, L>>);

impl<E, L> Filter<E, L> for AndFilter<E, L> {
    fn filter_early(&self, partial_evtc: &E) -> Inclusion {
        let lhs = self.0.filter_early(partial_evtc);
        // Short circuit behaviour
        if lhs == Inclusion::Exclude {
            Inclusion::Exclude
        } else {
            lhs & self.1.filter_early(partial_evtc)
        }
    }

    fn filter(&self, log: &L) -> bool {
        self.0.filter(log) && self.1.filter(log)
    }
}

impl<E: 'static, L: 'static> ops::BitAnd<Box<dyn Filter<E, L>>> for Box<dyn Filter<E, L>> {
    type Output = Box<dyn Filter<E, L>>;

    fn bitand(self, rhs: Box<dyn Filter<E, L>>) -> Self::Output {
        Box::new(AndFilter(self, rhs))
    }
}

(OrFilter and NotFilter are kept as an exercise for the reader). The benefit of keeping Filter generic is that we get and behaviour for free — for both player-predicates and log-predicates. We can also see that the short-circuiting behaviour is implemented here.

Since we mostly deal in trait objects, and typing Filter<..., ...> every time is cumbersome and error-prone, I took the liberty to enable #![feature(trait_alias) and defined two aliases:

// PartialEvtc is the "header", LogResult is the completely parsed file.
pub trait LogFilter = Filter<PartialEvtc, LogResult>;
// Agent is the player data that we can extract from the "header", Player
// is the data from the completely parsed file.
pub trait PlayerFilter = Filter<Agent, Player>;

This allows us to use Box<dyn LogFilter> and Box<dyn PlayerFilter> as well as &dyn LogFilter and &dyn PlayerFilter as shorthands.

The key to using a PlayerFilter on logs is the function fn all(player_filter: Box<dyn PlayerFilter>) -> Box<dyn LogFilter>, which matches a log if all included players match the given player filter:

pub fn all(player_filter: Box<dyn PlayerFilter>) -> Box<dyn LogFilter> {
    /* implementation omitted */
}

pub fn any(player_filter: Box<dyn PlayerFilter>) -> Box<dyn LogFilter> {
    !all(!player_filter)
}

We get any for free by making use of De Morgan's laws and our already-implemented filter negation.

Besides that, translating the existing filters to the new variant is very straightforward and the result can be seen in src/filters/log.rs. The new search_log looks like this:

fn search_log(entry: &DirEntry, filter: &dyn LogFilter) -> Result<Option<LogResult>> {
    let mut stream = /* ... */;
    let partial = evtclib::raw::parser::parse_partial_file(&mut stream)?;

    let early_ok = filter.filter_early(&partial);

    if early_ok == Inclusion::Exclude {
        return Ok(None);
    }

    let raw = evtclib::raw::parser::finish_parsing(partial, &mut stream)?;
    let parsed = evtclib::process(&raw).ok();
    let log = if let Some(e) = parsed {
        e
    } else {
        debug!("log file cannot be parsed: {:?}", entry.path());
        return Ok(None);
    };

    let info = extract_info(entry, &log);

    let take_log = filter.filter(&info);

    if take_log {
        Ok(Some(info))
    } else {
        Ok(None)
    }
}

No more hardcoded filters, and yet we retain the ability to bail early before wasting processing time! With our filtering trees in place, it is time to expose that flexibility to the user by allowing them to use RFL expressions.

Parsing RFL

Even though we have dynamic filters now, we didn't gain a lot. Their construction is still hardwired to the command line arguments, so there's no way to use their full power and expressiveness. This is about to change: Our goal is to write a function which takes in a string (&str), usually supplied by the user, and transforms it into our Filter (or rather, a Box<dyn LogFilter>).

The goal of the parsing however should not just be to work for well-formed input, it should also allow us to give good error messages in case the syntax of the input is wrong. We want to point the user to where they made a mistake instead of just saying "invalid filter" and moving on.

So before we start diving into builing a parser though, there are some questions to be asked and answered:

First of all, what should our parsing code produce? It is common that the parser outputs some sort of abstract syntax tree (AST), which is a representation of the input as a structured object. This allows the computer to do further analysis, such as optimizations or type checking, before producing the output value. For raidgrep, I have decided to not go through that intermediate step: Our Box<dyn LogFilter> is pretty much already like an AST, and we don't really need a lot of optimizations. Furthermore, it would lead to some duplication of information, as we'd probably end up having a lot of AST nodes that represent things we can already represent with Filter.

Secondly, how fine grained should the parser be? One extreme would be to parse predicates as some sort of (name, argument) value, and then decide at a later time (such as the filter runtime) whether we even have a predicate called name. This would be similar to how programming languages usually parse function calls: foo() and bar() usually end up going through the same parser routine. And it's also needed, considering that the user can usually define their own functions, and the parser author cannot predict all of the function names that people will ever use. Although that also means that the parser would accept things like -success "foobar", even though -success doesn't take any parameter — leading to a "runtime error".

For RFL, I've decided to keep the parser very strict: It knows about all available predicates and whether they take any arguments, as they are hard-coded in the grammar. While this prevents us from having dynamic predicates (such as user-defined ones), it makes the error handling a lot easier. Invalid input like all(player: all(player: ...)) doesn't need a separate checking step, because we can encode the difference between player-predicates and log-predicates directly in the parser.

This also has the benefit that we never try to use something that we parsed as a player-predicate as a log filter, or vice versa, leveraging Rust's type system here as well.

The final question is how to write the parser code. I did not want to write the parser by hand, so I opted for using LALRPOP instead. I've used LALRPOP before (quite some time ago) and was happy with how it worked, so it seemed like a good fit for this small language. It also provides functionality to reduce code duplication — for example, we can use its macros functionality to avoid having to define grammars for and/or/not for both player and log predicates:

Disjunction<T>: T = {
    <a:Disjunction<T>> "or" <b:Conjunction<T>> => a | b,
    Conjunction<T>,
}

Conjunction<T>: T = {
    <a:Conjunction<T>> "and"? <b:Negation<T>> => a & b,
    Negation<T>,
}

Negation<T>: T = {
    "not" <Negation<T>> => ! <>,
    "!" <Negation<T>> => ! <>,
    T,
}

On the left side is what we parse, and on the right side is what we output. You can see that we're building the final Filter by combining the inner filters with our overloaded operators, and since Filter itself (and the operator overloads) are generic, this works for both PlayerFilter and LogFilter. It's all coming together!

We can now plug in concrete types for the T placeholder in the snippet above to get our PlayerFilter — the LogFilter works similar:

PlayerFilter: Box<dyn filters::player::PlayerFilter> = {
    Disjunction<PlayerPredicate>,
}

PlayerPredicate: Box<dyn filters::player::PlayerFilter> = {
    "-character" <Regex> => filters::player::NameFilter::new(SearchField::Character, <>),
    "-account" <Regex> => filters::player::NameFilter::new(SearchField::Account, <>),
    "-name" <Regex> =>
        filters::player::NameFilter::new(SearchField::Account, <>.clone())
        | filters::player::NameFilter::new(SearchField::Character, <>),

    "(" <PlayerFilter> ")",
}

With a small helper function, we're almost good to go:

pub fn parse_logfilter<'a>(
    input: &'a str,
) -> Result<Box<dyn filters::log::LogFilter>, ParseError<usize, Token<'a>, FError>> {
    grammar::LogFilterParser::new().parse(input)
}

Hooking it up

We now have a way to dynamically combine filters and to parse a given string into such a dynamic filter. The only thing that's missing now is to remove the old command line arguments and use our new parser.

This part is not yet as nice as it could be. The reason for that is that the shell already does some argument parsing and splitting, so we might expect something like raidgrep -player "foo bar" to work. However, our parser (currently) also does its own splitting and tokenization, so we have to pass the whole argument as one big string: raidgrep '-player "foo bar"'. Not optimal, but changing that would probably require a custom lexer that can deal with pre-inserted breaks, or some other "hacky" workarounds.

Another problem is that we still have some flags which are not part of the filter parsing, and as such, are not handled by our parser. For example, --debug or --dir should continue to work, which means we'll stick with structopt for argument parsing for now.

That however means that structopt will see -player as another parameter that it tries to parse, and it outputs a nice error message:

error: Found argument '-p' which wasn't expected, or isn't valid in this context

In order to avoid that, we have to prefix the filter expression with --. Maybe that would be an argument to not have the predicates start with an - each, and instead a different letter should be used. Admittedly, that is something that I didn't think of before, but it's not unfixable.

Aftermath

raidgrep has a powerful filter language that allows users to express more things than before with the command line flag based approach. I'm looking to expand the list of predicates even further, now that there's a nice and clean way to combine them. I'll also see if the filter language should be changed slightly (like removing the - prefix) to make it easier to use, or if the glue-code between structopt and the parser can be improved to overcome some of the problems. Furthermore, I hope to provide better error messages, maybe even with suggestions on what the error was (such as "player predicate in place of a log predicate" or similar), instead of the automatically generated "Unexpected ..." from LALRPOP, as well as readline-powered completion of the REPL.

As for working with LALRPOP, it was quite nice. It's easy to get started with and yet quite powerful, for example the macro system is great to use. However, the tutorial/documentation is lacking a bit, and a lot of the functionality has to be learned by reading other examples, such as defining fallible actions with =>? or capturing the location of the token with @L (for better error reporting). There also seem to be some small bugs regarding the importing of items from parents of super, as LALRPOP adds some inner modules as well. While it tries to insert more super as needed, the resulting code failed to compile — so I resorted to re-importing the items in the main mod.rs, and then importing them from there in the grammar.

The RFL functionality is currently available on the new-filters branch, and I hope to merge it to master soon, after polishing some of the rough edges!