Importer development guide

Introduction to import process & data flow

When a set of files are imported into the database using the pepys-import command-line interface, the following process is carried out:

  1. The command-line interface connects to the database (using the details in the Configuration file) and loads all of the available importers (both those provided with pepys-import, and any local importers in the folder specified in the configuration file)

  2. The File Processor finds all the files in the specified path, and processes these files in turn.

  3. The File Processor looks through all the available importers and asks each importer in turn if it can process the file (based on the file extension, name, first line and contents, in turn). It keeps a list of all importers that would work on that specific file (as multiple importers can import a single file)

  4. Each importer in the list is run separately on the file, and this is repeated for each file.

  5. When the importer wants to add data to the database, it adds measurement objects (such as State or Contact) to a list of measurements that are stored temporarily (ie. not inserted directly into the database).

  6. If all importers on the file complete with no errors, then the measurements are inserted into the database. Otherwise, an output file is written with the errors and no database insertion takes place.

The vast majority of this process is managed by the code in pepys-import itself: the only part that a importer author needs to write themselves is the actual parsing code (step 4), which creates measurements in the temporary storage area (step 5).

The Importer class

Various examples of importer code are shown in the documentation below. A relatively simple importer is the ETracImporter (importers/e_trac_importer.py), and it may be useful to read its source-code along with the following documentation.

A pepys-import importer is implemented as a class which inherits from the Importer base-class. This class must implement various methods which are used to find out which files this importer can process and then to actually do the importing. These are defined, with comments, in the base Importer class (pepys_import/file/importer.py). A summary of the methods which must be implemented is below:

Method

Description

__init__

Initialises the class. Must call the super().__init__ method to set various key class parameters

can_load_this_type

Checks the file extension (‘suffix’) provided and returns True if it can import that type of file, otherwise returns False

can_load_this_filename

Checks the filename and returns True if it can import a file with that filename, otherwise returns False

can_load_this_header

Checks the first line of the file and returns True if it can import a file with that header, otherwise returns False

can_load_this_file

Checks the entire file contents and returns True if it can import that file, otherwise returns False

_load_this_line

Imports data from a single line in a file. Either this method or _load_this_file must be overriden.

_load_this_file

Imports data from the entire file.

Initialisation

The __init__ method must be defined, taking the arguments listed below, and the super().__init__() method must be called.

Argument

Description

name

Full name for the parser - used where space isn’t a problem, so can be reasonably long

short_name

Short name for the parser - used in error lists and other places - so should be relatively short

validation_level

Defines the level of validation required for this parser. Must be one of pepys_import.core.validators.constants.NONE_LEVEL, pepys_import.core.validators.constants.BASIC_LEVEL, or pepys_import.core.validators.constants.ENHANCED_LEVEL

default_privacy

(Optional) Defines the hardcoded privacy level of all data coming from this importer. Defaults to None, which means the user is asked for the privacy.

A complete example is:

def __init__(
        self,
        name="Example File Format Importer",
        validation_level=constants.ENHANCED_LEVEL,
        short_name="Example Importer",
    ):
        super().__init__(name, validation_level, short_name)

Parser check methods

The File Processor checks each of the can_load_this_x methods to determine whether the importer can load a specific file. They are checked in the following order:

  • Type

  • Filename

  • Header

  • File

If it is impossible to tell if a file can be loaded based on the information provided then these methods should just return True. For example, the GPXImporter can import files with an extension .gpx, so has some logic in the can_load_type method, but just returns True for all other checking methods.

Loading methods

Many importers operate on a line-by-line basis (for example, a CSV-style importer which reads each line separately) but some parsers need access to the whole file at one time (for example, an XML-style importer which needs to parse the whole document in one operation). To accommodate these two styles of parsing, there are two methods. One, and only one, of these methods should be implemented by importer classes.

_load_this_line() is passed a single line and should process the line and create any measurement objects that are needed. Code in the implementation of the _load_this_file() method in the base Importer class is responsible for iterating over the lines in the file and calling _load_this_line() once for each line. Alternatively, _load_this_file() itself can be implemented in the subclass, and this is passed an entire file which it can process in any way it wants.

It is possible to process files where the data for one measurement instance is split across multiple lines by implementing _load_this_line() - data should be stored in instance variables when it appears, and then once all data is available a measurement object should be created. For an example of this see the NMEAImporter in importers/nmea_importer.py.

Tokens and highlighting

A typical implementation of _load_this_line() would take the line provided, split it by some delimiter and then process each separate token in the line. You may expect to use the standard Python functions like split to do this, but pepys-import also provides a more fully-featured way of dealing with tokens and lines, which allows the export of highlighted HTML files showing which parts of the file have been used to extract each individual part field in the created measurements, and tracks the extraction in the database to help understand data provenance.

But this token highlighting and database recording does come with a performance cost. For high volume file types that are tightly structured, with little room for misinterpretation, the overhead may not be justified. You can configure the level of extraction that will take place by calling self.set_highlighting_level() in the __init__ method. Three different values can be passed to this function: HighlightLevel.NONE will turn off all extraction and highlighting, "HighlightLevel.HTML" will record extractions to HTML but not to the database, and "HighlightLevel.DATABASE" will record to both a HTML file and the database. The default is "HighlightLevel.HTML".

Similarly, it may be justified to capture extraction data in the early stages of developing/maintaining the parser for a new file format, with level of extraction recording reduced as it matures.

The line parameter passed to _load_this_line() is actually a Line object, from which a list of tokens can be produced using the tokens() method. This takes a regular expression,

The text of each Token/Line object can be accessed as Token.text, so the following two code examples are equivalent:

# Basic Python
# (default separator is any whitespace)
tokens = line.split()
print(tokens[1])

# Pepys-import methods
Tokens = line.tokens(line.WHITESPACE_TOKENISER)
print(tokens[1].text)

The argument to tokens() is a regular expression giving the parts of the line to extract as tokens (not the parts of the line to use as a separator). Various useful regular expressions are defined in the line class including WHITESPACE_TOKENISER and CSV_TOKENISER. As an example, the WHITESPACE_TOKENISER is \S+, which matches one or more non-whitespace characters, which is what defines a token if the separator is whitespace.

Tokens and Lines also have a record() method, which is used to record that a particular extraction was performed using that token. This will then be represented as a highlighted region in the output HTML. The record method takes arguments of the name of the parser (should always be set to self.name), the name of the field that was extracted and the parsed value. It also takes an optional units argument, for use when units cannot be explicitly specified (see the Setting measurement object properties with units section). For example:

comp_name_token = tokens[18]
vessel_name = comp_name_token.text.upper()
comp_name_token.record(self.name, "vessel name", vessel_name)

The combine_tokens() function can be used to combine two disparate tokens into one new token, so that one extraction can be recorded from disparate data in the file. For example:

combine_tokens(long_degrees_token, lat_degrees_token).record(
    self.name, "location", state.location, "decimal degrees")

Once token extractions have been recorded using the record method, the recorded information must be linked to the appropriate measurement object (State/Contact/Comment etc) and prepared for saving to the database. This can be done using the Datafile.flush_extracted_tokens method, which should be called once all the data has been loaded for a specific measurement object. Usually this will be at the end of the _load_this_line() method, or at the end of a loop inside the _load_this_file() method - but for complex importers it may be elsewhere.

Creating measurement objects

Both _load_this_line() and _load_this_file() are passed a Datafile object by the File Processor. This can be used to create measurement objects using the create_state(), create_contact() or create_comment() methods. These methods require arguments for the data_store (provided by the File Processor), platform, sensor, timestamp and the parser_name (always self.short_name).

The platform can be obtained from data_store.get_platform and the sensor from platform.get_sensor. The timestamp must be parsed from the file.

Where the value for a field is missing, Pepys will use command-line interaction to get this data from the current user. It will also ensure that supporting metadata is also determined (such as the nationality for a new platform).

In most circumstances, a file will only contain information on the name of the platform (eg. DOLPHIN) and not the nationality or identifier. As the combination of all three of these are required to uniquely identify a platform, we have to ‘resolve’ it via user interaction to establish which platform this name refers to. Doing this repeatedly for every line of a file would require a lot of user interaction, and a particular platform name in a file is likely to refer to the same platform throughout the file. Therefore, we can use the self.get_cached_platform method instead of the data_store.get_platform method. This tries to look in a cache which links platform names from the file to actual Platform entries, and if it can’t find it in the cache then it resolves it as usual. The cache is reset every time a new file is processed - meaning the user should only have to interact to identify the platform once per platform, per file.

Similarly, a self.get_cached_sensor method exists, which can be used to cache the sensor associated with a platform, to reduce the need for user interaction to select the sensor for each row of the data. This should be used for files where there is only one sensor per platform in the file - for example, when the sensor is a location sensor used to produce the location data which are being imported into State objects, and is not listed in the file. However, this should not be used for files where the sensors are specifically listed in the file - for example, those that import Contact objects.

For example:

# Get the platform given the vessel_name (a variable parsed from the file earlier in the code)
platform = self.get_cached_platform(
        data_store, platform_name=vessel_name, change_id=change_id
    )

sensor = self.get_cached_sensor(
            data_store=data_store,
            sensor_name=None,
            sensor_type=None,
            platform_id=platform.platform_id,
            change_id=change_id,
        )

# Create the actual state object
state = datafile.create_state(data_store, platform, sensor, timestamp, self.short_name)
# Set the privacy
state.privacy = privacy.privacy_id

# Now we can set state properties
state.location = ...

As metadata is created (such as platform or sensor) it is added directly to the database, to ease streamline re-loading the file if parsing later fails.

But, once a measurement (state, contact, or comment) is created using the datafile.create_x method, it is added to the list of pending measurements stored in the File Processor - therefore nothing else needs doing to ‘commit’ the measurement, and the function body can finish immediately after assigning all of the measurement properties. Once all of the relevant parsers have run on a file, and data validation tests pass, the data is submitted to the database.

Setting measurement object properties with units

All properties on measurement objects (states, contacts or comments) can be set in the standard Python way as state.speed = <variable>. However, properties that have units associated with them (for example, state.speed, state.elevation, contact.bearing, contact.orientation) must be set to a value with associated units. These are provided through the pint Python package, which provides a unit_registry (available as pepys_import.core.formats.unit_registry) containing objects for all units. A value can be given units by multiplying it by a unit object, for example:

# 5 knots
speed = 5 * unit_registry.knot

# 157 degrees
bearing = 157 * unit_registry.degree

The resulting value is a Quantity (an object type provided by pint) which stores both the numerical value along with the unit, and provides helpful methods for converting to different units.

In the context of an importer, code like the following may be used:

state.elevation = altitude_token.text * unit_registry.metre

A more robust approach is given by using the conversion functions provided in pepys_import/utils/unit_utils.py, such as convert_absolute_angle() and convert_speed(). These take a string value and a unit, and attempt to convert the string value to a float (storing an error if this fails) and then set the relevant unit (either the unit passed to the function, or the default unit for values that are only ever provided in a single unit, such as angles).

Setting location properties

Some of the most common errors in this sort of data processing are errors in storing location data - for example, mixing up latitude and longitude, or errors parsing values given in ‘degrees, minutes, seconds’ (DMS) format. Therefore, pepys-import has a verbose but clear way of setting location values using the Location class.

The :class.`Location` class stores a latitude and longitude in decimal degrees, but allows setting these values in both decimal degrees and DMS. The class should be initialised as:

location = Location(errors=self.errors, error_type=self.error_type)

as this allows any errors from parsing locations to be reported (see the Storing errors section). Then methods like set_latitude_decimal_degrees() or set_longitude_dms() can be used to set values based on values in the input file. Finally, the state.location property should be set to the instance of the Location class. For example:

location = Location(errors=self.errors, error_type=self.error_type)
location.set_latitude_decimal_degrees(lat_degrees_token.text)
location.set_longitude_decimal_degrees(long_degrees_token.text)

state.location = location

Storing errors

Various errors can be found while parsing a document - for example, a missing field, or an invalid numerical value. These need to be stored so that they can be reported to the user - and used to stop the actual import to the database occurring if there are errors.

Errors should be stored in the self.errors list, where each item in the list is a dictionary where the key is the ‘error type’ and the value is a string describing the error. The ‘error type’ is automatically defined by the class initialisation to be a string of the form "{self.short_name} - Parsing error on {basename}", and this is stored in self.error_type, so errors can be added as in the example below:

self.errors.append(
    {
        self.error_type: f"Line {line_number}. Error in Date format '{time_token.text}'."
        "Should be HH:mm:ss"
    }
)

Parser development environment and testing

Ensure development environment is set up correctly

The deployed version of pepys-import contains all the development dependencies, except for a local install of PostgreSQL. To check everything is set up correctly, run all the tests _excluding_ those that require PostgreSQL by following these steps:

  1. Open a Windows Command Prompt

  2. Change to the directory containing the pepys-import installation

  3. Run cd bin to change to the bin directory

  4. Run set_paths.bat to set up the relevant paths to Python and its dependencies

  5. Run cd .. to change back to the main directory

  6. Run python -m pytest tests/ -m "not postgres" to run all tests excluding the PostgreSQL tests

Start developing parser

Obtain or create an example file in the new format - ideally one that covers as many of the variations in the format as possible.

Start writing a new Importer subclass using the instructions above. Name the importer file format_importer.py (for example, gpx_importer.py) and place it in a directory of your choice (you will set the configuration to allow pepys-import to pick up this importer in the How to deploy a parser section below).

Create test for new parser

Create a new file called test_load_FORMAT.py. This can be located anywhere you want, but a sensible place to put it would be in the same folder as the new importer.

Copy into this file the basic test structure from one of the existing test files (for example tests/test_load_gpx.py) and edit to load the sample data file (change DATA_PATH) and to register the new Importer subclass (the processor.register_importer line). Also change the definition of self.store to be a DataStore backed by a local SQLite database file, for example:

self.store = DataStore("", "", "", 0, "test.db", db_type="sqlite")

Comment out the actual assert statements in the test. You can then use this test as a test while developing the new importer by running the test as:

python -m pytest test_load_blah.py -s

This will show all stdout output, so you can see debugging print statements that you may have used in the importer. If you want to debug at a particular point in the importer then add a line containing breakpoint() at the relevant location, and run pytest with the --pdb flag.

The data will be imported into a SQLite database called test.db in the same folder as the Importer python file, and this can be viewed using something like (SQLite Studio)

Once the importer is working, uncomment the assert statements and update the tests for the number of states, platforms and datafiles that should be added, plus add some tests for specific values that should be present in the imported data.

How to deploy a parser

Set the parsers configuration option in the [local] section of the pepys-import Configuration file to point to a directory to hold custom local importers. Place the new importer in this folder: it should now be picked up by pepys-import.

To test this, run the pepys-import import CLI with the --db test.db option, to do an import to a local SQLite database.

Skipping a line in a file

The return statement in _load_this_line() returns from the function which is processing that specific line, and then moves on to the next line - therefore it is equivalent to the continue statement in a standard for loop.

If you want to skip a specific line then a line_number argument is passed to _load_this_line(), so this can be done by:

# Skip the header line (can do this with any if statement)
if line_number == 1:
    return