Configuration
1. PIISA configurations
All steps in the processing chain in a PIISA framework are meant to be configurable. Such configuration can be provided by three means:
- command-line tools can provide modifiers as arguments
- object constructors can also accept some arguments as modifiers
- configuration files can integrate most of the configuration capabilities
Those configuration files can be written in either YAML or JSON, and have three sources:
- packages contain their own local configurations as resource files; those act as default configurations
- package-level configuration files can be supplied at object construction time
- global configuration files (containing aggregated information for several packages) can also be supplied at object construction time
2. Configuration formats
The base syntax for those files is either YAML or JSON. The contents of one configuration file is a dictionary, whose fields depend on the specific configuration.
There are however, two standardized fields:
format
: this is a compulsory field, whose value is a string that indicates the type of configuration held by the dictionary (i.e. the configuration section; it is typically a package + module identifier).name
: a string giving a name to this configuration. This is optional; if it is not present and the configuration is loaded from a file, the framework will automatically use the filename as configuration name.
2.1. Package level
A package-level configuration file has the structure of a dictionary. The
format
field for such a file has the general shape
piisa:config:<module>:<section>:v1
, where
piisa:config
is a fixed prefix<module>
identifies which module this configuration is for<section>
identifies the section in the module that is to be configuredv1
is a version format string.
2.2. Full file
A global configuration file contains simply a config
field with a list
of package configurations, i.e. it is a list of dictionaries, each one with its
format
key. It carries configuration for all (or many) PIISA modules from
different packages. This makes possible to encapsulate in a single file
a configuration for the whole PIISA toolchain.
A full file contains also a global format
key, whose value is
piisa:config:full:v1
Note that in a full configuration it is possible to have more than one
configuration section with the same format
tag; they will be combined (later
fields with the same name will override/update previous fields).
The general shape is thus as follows:
{
"format": "piisa:config:full:v1",
"config": [
{
"format": "piisa:config:<module1>:<name1>:v1",
...config for module1/name1
},
{
"format": "piisa:config:<module1>:<name2>:v1",
...config for module1/name2
},
{
"format": "piisa:config:<module2>:<name>:v1",
...config for module2/name
}
]
}
3. Default configurations
Some examples of provided default configuration files are:
- The loader.json file in the
pii-preprocess
package maps file extensions to file types, and for each type defines a loader to read that document type - A placeholder.json file in the
pii-transform
package defines the dummy substitution values for the placeholder policy. - The pii-extract-plg-transformers plugin contains a configuration file to define the models to load and the mappping of model entities to PIISA entities
- The pii-extract-plg-presidio plugin contains a configuration file to map Presidio entities to PIISA entities
4. Custom configurations
The default files can be replaced at execution time by custom configurations. Additionally other aspects of the processing flow can be also modified with additional configurationss:
- A task configuration file can be used to define additional PII detection tasks, perhaps coming from custom external code (defining the external tasks by specifying their class paths). There is a small example available.
- A
plugins.json
file can be defined to define the plugins to load, and provide custom arguments to the loader (by default the PIISA system loads all the plugins it can detect)
5. Dynamic configurations
When using the APIs provided by the PIISA packages, many objects take a
config
argument in the constructor. This argument can be a single config
element or a list of elements (which will be merged). Each config element, in
turn, can be
* a path to a configuration file
* a dictionary containing a live configuration object, created on the fly
This live, dynamic configuration is a dictionary, indexed by PIISA section:
* The key is the PIISA module package the configuration is for, as a
<module>:<name>:v1
string (i.e. the same string as the format
field
in configuration files, without the piisa:config:
prefix)
* The value contains a standard configuration for that PIISA module
This is an example, in this case containing a custom configuration for the
pii-transform
package:
from pii_data.types import PiiEnum
from pii_transform.defs import FMT_CONFIG_TRANSFORM
config = {
FMT_CONFIG_TRANSFORM: {
"default": "annotate",
"policy": {
PiiEnum.CREDIT_CARD.name: "synthetic",
PiiEnum.GOV_ID.name: "label"
}
}
}
If used in an object constructor, a custom config will be merged with the default config for the module (if it exists), overriding any matching fields and adding the new ones.