tods.data_processing package

Submodules

tods.data_processing.CategoricalToBinary module

class tods.data_processing.CategoricalToBinary.CategoricalToBinary(*args, **kwds)

Bases: d3m.primitive_interfaces.transformer.TransformerPrimitiveBase

A primitive which will convert all the distinct values present in a column to a binary represntation with each distinct value having a different column.

metadata

Primitive’s metadata. Available as a class attribute.

logger

Primitive’s logger. Available as a class attribute.

hyperparams

Hyperparams passed to the constructor.

random_seed

Random seed passed to the constructor.

docker_containers

A dict mapping Docker image keys from primitive’s metadata to (named) tuples containing container’s address under which the container is accessible by the primitive, and a dict mapping exposed ports to ports on that address.

volumes

A dict mapping volume keys from primitive’s metadata to file and directory paths where downloaded and extracted files are available to the primitive.

temporary_directory

An absolute path to a temporary directory a primitive can use to store any files for the duration of the current pipeline run phase. Directory is automatically cleaned up after the current pipeline run phase finishes.

Parameters
  • use_columns (Set) – A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.

  • exclude_columns (Set) – A set of column indices to not operate on. Applicable only if “use_columns” is not provided.

  • return_result (Enumeration) – Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.

  • use_semantic_types (Bool) – Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe.

  • add_index_columns (Bool) – Also include primary index columns if input data has them. Applicable only if “return_result” is set to “new”.

  • error_on_no_input (Bool() – Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.

  • return_semantic_type (Enumeration[str]() – Decides what semantic type to attach to generated attributes’

produce(*, inputs: d3m.container.pandas.DataFrame, timeout: float = None, iterations: int = None) → d3m.primitive_interfaces.base.CallResult[d3m.container.pandas.DataFrame]
Parameters
  • inputs – Container DataFrame

  • inputs – The inputs of shape [num_inputs, …].

  • timeout – A maximum time this primitive should take to produce outputs during this method call, in seconds.

  • iterations – How many of internal iterations should the primitive do.

Returns

Container DataFrame added with binary version of a column a sort of one hot encoding of values under different columns named as “column name_category value” for all the columns passed in list while building the pipeline

Returns

Return type

The outputs of shape [num_inputs, …] wrapped inside CallResult.

tods.data_processing.ColumnFilter module

class tods.data_processing.ColumnFilter.ColumnFilter(*args, **kwds)

Bases: d3m.primitive_interfaces.transformer.TransformerPrimitiveBase

A primitive that filters out columns of wrong shape in DataFrame (specifically columns generated by some features analysis)

metadata

Primitive’s metadata. Available as a class attribute.

logger

Primitive’s logger. Available as a class attribute.

hyperparams

Hyperparams passed to the constructor.

random_seed

Random seed passed to the constructor.

docker_containers

A dict mapping Docker image keys from primitive’s metadata to (named) tuples containing container’s address under which the container is accessible by the primitive, and a dict mapping exposed ports to ports on that address.

volumes

A dict mapping volume keys from primitive’s metadata to file and directory paths where downloaded and extracted files are available to the primitive.

temporary_directory

An absolute path to a temporary directory a primitive can use to store any files for the duration of the current pipeline run phase. Directory is automatically cleaned up after the current pipeline run phase finishes.

produce(*, inputs: d3m.container.pandas.DataFrame, timeout: float = None, iterations: int = None) → d3m.primitive_interfaces.base.CallResult[d3m.container.pandas.DataFrame]

Process the testing data. :param inputs: Container DataFrame.

Returns

Container DataFrame after AutoCorrelation.

Parameters
  • inputs – The inputs of shape [num_inputs, …].

  • timeout – A maximum time this primitive should take to produce outputs during this method call, in seconds.

  • iterations – How many of internal iterations should the primitive do.

Returns

Return type

The outputs of shape [num_inputs, …] wrapped inside CallResult.

tods.data_processing.ContinuityValidation module

class tods.data_processing.ContinuityValidation.ContinuityValidation(*args, **kwds)

Bases: d3m.primitive_interfaces.transformer.TransformerPrimitiveBase

Check whether the seires data is consitent in time interval and provide processing if not consistent.

metadata

Primitive’s metadata. Available as a class attribute.

logger

Primitive’s logger. Available as a class attribute.

hyperparams

Hyperparams passed to the constructor.

random_seed

Random seed passed to the constructor.

docker_containers

A dict mapping Docker image keys from primitive’s metadata to (named) tuples containing container’s address under which the container is accessible by the primitive, and a dict mapping exposed ports to ports on that address.

volumes

A dict mapping volume keys from primitive’s metadata to file and directory paths where downloaded and extracted files are available to the primitive.

temporary_directory

An absolute path to a temporary directory a primitive can use to store any files for the duration of the current pipeline run phase. Directory is automatically cleaned up after the current pipeline run phase finishes.

Parameters
  • continuity_option (enumeration) –

    Choose ablation or imputation.

    ablation: delete some rows and increase timestamp interval to keep the timestamp consistent imputation: linearly imputate the absent timestamps to keep the timestamp consistent

  • interval (float) – Only used in imputation, give the timestamp interval. ‘interval’ should be an integral multiple of ‘timestamp’ or ‘timestamp’ should be an integral multiple of ‘interval’

produce(*, inputs: d3m.container.pandas.DataFrame, timeout: float = None, iterations: int = None) → d3m.primitive_interfaces.base.CallResult[d3m.container.pandas.DataFrame]
Parameters
  • inputs – Container DataFrame

  • timeout – Default

  • iterations – Default

  • inputs – The inputs of shape [num_inputs, …].

  • timeout – A maximum time this primitive should take to produce outputs during this method call, in seconds.

  • iterations – How many of internal iterations should the primitive do.

Returns

Container DataFrame with consistent timestamp

Returns

Return type

The outputs of shape [num_inputs, …] wrapped inside CallResult.

tods.data_processing.DatasetToDataframe module

tods.data_processing.DuplicationValidation module

class tods.data_processing.DuplicationValidation.DuplicationValidation(*args, **kwds)

Bases: d3m.primitive_interfaces.transformer.TransformerPrimitiveBase

Check whether the seires data involves duplicate data in one timestamp, and provide processing if the duplication exists.

metadata

Primitive’s metadata. Available as a class attribute.

logger

Primitive’s logger. Available as a class attribute.

hyperparams

Hyperparams passed to the constructor.

random_seed

Random seed passed to the constructor.

docker_containers

A dict mapping Docker image keys from primitive’s metadata to (named) tuples containing container’s address under which the container is accessible by the primitive, and a dict mapping exposed ports to ports on that address.

volumes

A dict mapping volume keys from primitive’s metadata to file and directory paths where downloaded and extracted files are available to the primitive.

temporary_directory

An absolute path to a temporary directory a primitive can use to store any files for the duration of the current pipeline run phase. Directory is automatically cleaned up after the current pipeline run phase finishes.

Parameters

keep_option (enumeration) – When dropping rows, choose to keep the first one or calculate the average

produce(*, inputs: d3m.container.pandas.DataFrame, timeout: float = None, iterations: int = None) → d3m.primitive_interfaces.base.CallResult[d3m.container.pandas.DataFrame]
Parameters
  • inputs – Container DataFrame

  • timeout – Default

  • iterations – Default

  • inputs – The inputs of shape [num_inputs, …].

  • timeout – A maximum time this primitive should take to produce outputs during this method call, in seconds.

  • iterations – How many of internal iterations should the primitive do.

Returns

Container DataFrame after drop the duplication

Returns

Return type

The outputs of shape [num_inputs, …] wrapped inside CallResult.

tods.data_processing.TimeIntervalTransform module

class tods.data_processing.TimeIntervalTransform.TimeIntervalTransform(*args, **kwds)

Bases: d3m.primitive_interfaces.transformer.TransformerPrimitiveBase

A primitive which configures the time interval of the dataframe. Resample the timestamps based on the time_interval passed as hyperparameter

metadata

Primitive’s metadata. Available as a class attribute.

logger

Primitive’s logger. Available as a class attribute.

hyperparams

Hyperparams passed to the constructor.

random_seed

Random seed passed to the constructor.

docker_containers

A dict mapping Docker image keys from primitive’s metadata to (named) tuples containing container’s address under which the container is accessible by the primitive, and a dict mapping exposed ports to ports on that address.

volumes

A dict mapping volume keys from primitive’s metadata to file and directory paths where downloaded and extracted files are available to the primitive.

temporary_directory

An absolute path to a temporary directory a primitive can use to store any files for the duration of the current pipeline run phase. Directory is automatically cleaned up after the current pipeline run phase finishes.

produce(*, inputs: d3m.container.pandas.DataFrame, timeout: float = None, iterations: int = None) → d3m.primitive_interfaces.base.CallResult[d3m.container.pandas.DataFrame]

Produce primitive’s best choice of the output for each of the inputs.

The output value should be wrapped inside CallResult object before returning.

In many cases producing an output is a quick operation in comparison with fit, but not all cases are like that. For example, a primitive can start a potentially long optimization process to compute outputs. timeout and iterations can serve as a way for a caller to guide the length of this process.

Ideally, a primitive should adapt its call to try to produce the best outputs possible inside the time allocated. If this is not possible and the primitive reaches the timeout before producing outputs, it should raise a TimeoutError exception to signal that the call was unsuccessful in the given time. The state of the primitive after the exception should be as the method call has never happened and primitive should continue to operate normally. The purpose of timeout is to give opportunity to a primitive to cleanly manage its state instead of interrupting execution from outside. Maintaining stable internal state should have precedence over respecting the timeout (caller can terminate the misbehaving primitive from outside anyway). If a longer timeout would produce different outputs, then CallResult’s has_finished should be set to False.

Some primitives have internal iterations (for example, optimization iterations). For those, caller can provide how many of primitive’s internal iterations should a primitive do before returning outputs. Primitives should make iterations as small as reasonable. If iterations is None, then there is no limit on how many iterations the primitive should do and primitive should choose the best amount of iterations on its own (potentially controlled through hyper-parameters). If iterations is a number, a primitive has to do those number of iterations, if possible. timeout should still be respected and potentially less iterations can be done because of that. Primitives with internal iterations should make CallResult contain correct values.

For primitives which do not have internal iterations, any value of iterations means that they should run fully, respecting only timeout.

If primitive should have been fitted before calling this method, but it has not been, primitive should raise a PrimitiveNotFittedError exception.

Parameters
  • inputs – The inputs of shape [num_inputs, …].

  • timeout – A maximum time this primitive should take to produce outputs during this method call, in seconds.

  • iterations – How many of internal iterations should the primitive do.

Returns

Return type

The outputs of shape [num_inputs, …] wrapped inside CallResult.

tods.data_processing.TimeStampValidation module

Module contents