ReadSparse

Efficient library for reading and writing labelled sparse matrices in delimited text format (a.k.a. SVMLight or LibSVM format).

For more information, see the project’s GitHub page:

https://www.github.com/david-cortes/readsparse/

Read Sparse Matrices

class readsparse.read_sparse(file, multilabel=False, has_qid=False, integer_labels=False, index1=True, sort_indices=True, ignore_zeros=True, min_cols=0, min_classes=0, limit_nrows=0, no_trailing_ws=False, use_int64=False, use_double=True, use_cpp=True, from_string=False)[source]

Read Sparse Matrix from Text File

Read a labelled sparse CSR matrix in text format as used by libraries such as SVMLight, LibSVM, ThunderSVM, LibFM, xLearn, XGBoost, LightGBM, and more.

The format is as follows:

<label(s)> <column>:<value> <column>:<value> …

Example line (row):

1 1:1.234 3:20

This line denotes a row with label (target variable) equal to 1, a value for the first column of 1.234, a value of zero for the second column (which is missing), and a value of 20 for the third column.

The labels might be decimal (for regression), and each row might contain more than one label (must be integers in this case), separated by commas without spaces inbetween - e.g.:

1,5,10 1:1.234 3:20

This line indicates a row with labels 1, 5, and 10 (for multi-class classification). If the line has no labels, it should still include a space before the features.

The rows might additionally contain a ‘qid’ parameter as used in ranking algorithms, which should always lay inbetween the labels and the features and must be an integer - e.g.:

1 qid:2 1:1.234 3:20

The file might optionally contain a header as the first line with metadata (number of rows, number of columns, number of classes). Presence of a header will be automatically detected, and is recommended to include it for speed purposes. Datasets from the extreme classification repository (see references) usually include such a header.

Lines might include comments, which start after a ‘#’ character. Lines consisting of only a ‘#’ will be ignored. When reading from a file, such file might have a BOM (information about encoding uses in Windows sytems), which will be automatically skipped.

Some extra notes - this function:
  • Will not make any checks for negative column indices.

  • Will be able to read numeric values in scientific notation only if the E is capitalized.

  • Will fill missing labels with NAs when passing multilabel=False.

  • Will fill with zeros (empty values) the lines that are empty (that is, they generate a row in the data), but will ignore (that is, will not generate a row in the data) the lines that start with ‘#’.

On 64-bit Windows systems, if compiling the library with a compiler other than MinGW or MSVC, it will not be able to read files larger than 2GB and might crash the system if the file is larger.

If the file contains a header, and this header denotes a larger number of columns or of labels than the largest index in the data, the resulting object will have this dimension set according to the header. The third entry in the header (number of classes/labels) will be ignored when passing multilabel=False.

The function uses different code paths when reading from a file or from a string, and there might be slight differences between the obtained results from them. For example, reading from a file might produce the desired output if the file uses tabs as separators instead of spaces (not supported by most other software and not standard), whereas reading from a string will not. If any such difference is encountered, please submit a bug report in the package’s GitHub page.

Parameters
  • file (str, None, or file connection) – Either a file path from which the data will be read, or a string containing the text from which the data will be read. In the latter case, must pass from_string=True.

  • multilabel (bool) – Whether the input file can have multiple labels per observation. If passing multilabel=False and it turns out to have multiple labels, will only take the first one for each row. If the labels are non-integers or have decimal point, the results will be invalid.

  • has_qid (bool) – Whether the input file has ‘qid’ field (used for ranking). If passing False and the file does turns out to have ‘qid’, the features will not be read for any observations.

  • integer_labels (bool) – Whether to output the observation labels as integers.

  • index1 (bool) – Whether the input file uses numeration starting at 1 for the column numbers (and for the label numbers when passing multilabel=True). This is usually the case for files downloaded from the repositories in the references. The function will check for whether any of the column indices is zero, and will ignore this option if so (i.e. will assume it is False).

  • sort_indices (bool) – Whether to sort the indices of the columns after reading the data. These should already be sorted in the files from the repositories in the references.

  • ignore_zeros (bool) – Whether to avoid adding features which have a value of zero. If the zeros are caused due to numerical rounding in the software that wrote the input file, they can be post-processed by passing ignore_zeros=False and then something like ‘X.data[X.data == 0] = 1e-8’.

  • min_cols (int) – Minimum number of columns that the output X object should have, in case some columns are all missing in the input data.

  • min_classes (int) – Minimum number of columns that the output y object should have, in case some columns are all missing in the input data. Only used when passing multilabel=True.

  • limit_nrows (int) – Maximum number of rows to read from the data. If there are more than this number of rows, it will only read the first ‘limit_nrows’ rows. If passing zero (the default), there will be no row limit.

  • no_trailing_ws (bool) – Whether to assume that lines in the file will never have extra whitespaces at the end before a new line. Parsing large files with this option set to ‘True’ can be 1.5x faster, but if the file does turn up to have e.g. extra spaces at the end of lines, the results will be incorrect.

  • use_int64 (bool) – Whether to use 64-bit integers for column and label indices (when passing multilabel=True). If passing False, will use the machine’s ‘int’ type (typically np.int32 but this could differ in non-standard CPU architectures). Using ‘int’ is faster and uses less memory, but cannot store values higher than the machine’s ‘INT_MAX’ (typically 2^31-1, or around 2.2 billion).

  • use_double (bool) – Whether to use C ‘double’ type (typically np.float64) for numeric values. If passing False, will use C ‘float’ type (typically np.float32), which uses less memory but might be very slightly slower to parse. Most machine learning software for Python works with ‘double’ data.

  • use_cpp (bool) – Whether to use C++ functions directly for file IO. If passing False, will read the contents using Python’s own functions into a string variable, from which the data will then be read. If passing True, will parse from the file directly, which is faster and uses less memory. Using the C++ engine can have issues on Windows if the file is larger than 2GB and the library was compiled with something other than MSVC or MinGW.

  • from_string (bool) – Whether to read the data from a string variable instead of a file. If passing from_string=True, then file is assumed to be a variable with the data contents on it.

Returns

data

A dict with the following entries:
  • ’X’ : the features, as a CSR Matrix from SciPy, with data of type np.float64 or np.float32 depending on argument use_double.

  • ’y’ : the labels. If passing multilabel=False (the default), will be a vector (NumPy 1-d array, with dtype float64 when passing integer_labels=False, or dtype equal to the indices of ‘X’ when passing ``integer_labels=True`), otherwise will be a binary CSR Matrix (same dtype as the values of ‘X’).

  • ’qid’ : the query IDs used for ranking, as an integer vector. This entry will only be present when passing has_qid=True

Return type

dict

References

Datasets in this format can be found here:

The format is also described at the SVMLight webpage: http://svmlight.joachims.org

Write Sparse Matrices

class readsparse.write_sparse(file, X, y, qid=None, integer_labels=True, index1=True, sort_indices=True, ignore_zeros=True, add_header=False, decimal_places=8, use_cpp=True, append=False, to_string=False)[source]

Write Sparse Matrix in Text Format

Write a labelled sparse matrix into text format as used by software such as SVMLight, LibSVM, ThunderSVM, LibFM, xLearn, XGBoost, LightGBM, and others - i.e.:

<labels(s)> <column:value> <column:value> …

For more information about the format and usage examples, see the documentation for ‘read_sparse’.

Can write labels for regression, classification (binary, multi-class, and multi-label), and ranking (with ‘qid’), but note that most software that makes use of this data format supports only regression and binary classification.

Note

Be aware that writing sparse matrices to text is not a lossless operation - that is, some information might be lost due to numeric precision, and metadata such as row and column names will not be saved. It is recommended to use save_npz or similar for saving data between Python sessions, or to use binary formats for passing between different software such as Python->R.

Note

The option ignore_zeros is implemented heuristically, by comparing ‘abs(x) >= 10^(-decimal_places)/2’, which might not match exactly with the rounding that is done implicitly in string conversions in the libc/libc++ functions - thus there might still be some corner cases of all-zeros written into features if the (absolute) values are very close to the rounding threshold.

Note

The function uses different code paths when writing to a file or to a string, and there might be slight differences between the generated texts from them. If any such difference is encountered, please submit a bug report in the package’s GitHub page.

Parameters
  • file (str or None) – Output file path into which to write the data. Will be ignored when passing to_string=True.

  • X (CSR(n_samples, n_labels)) – Sparse data to write. Can be a sparse matrix from SciPy, a dense array from NumPy, or a DataFrame from Pandas, but will be converted to a CSR matrix if it isn’t already.

    If X is a vector (1-d NumPy array), will be assumed to be a row vector and will thus write one row only.

  • y (array(n_samples,) or CSR(n_samples, n_labels)) – Labels for the data. Can be passed as a vector if each observation has one label, or as a sparse or dense matrix (same format as X) if each observation can have more than 1 label. In the latter case, only the non-missing column indices will be written, while the values are ignored.

  • qid (None or array(n_samples,)) – Secondary label information used for ranking algorithms. Must be an integer vector if passed. Note that not all software supports this.

  • integer_labels (bool) – Whether to write the labels as integers. If passing False, they will have a decimal point regardless of whether they are integers or not. If the file is meant to be used for a classification algorithm, one should pass True here (the default).

    For multilabel classification, the labels will always be written as integers.

  • index1 (bool) – Whether the column and label indices (if multi-label) should have numeration starting at 1. Most software assumes this is True.

  • sort_indices (bool) – Whether to sort the indices of X (and of y if multi-label) before writing the data. Note that this will cause in-place modifications if either X or y are passed as CSR matrices.

  • ignore_zeros (bool) – Whether to ignore (not write) features with a value of zero after rounding to the specified decimal places.

  • add_header (bool) – Whether to add a header with metadata as the first line (number of rows, number of columns, number of classes). If passing integer_label=False and y is a vector, will write zero as the number of labels. This is not supported by most software.

  • decimal_places (int) – Number of decimal places to use for numeric values. All values will have exactly this number of places after the decimal point. Be aware that values are rounded and might turn to zeros (will be skipped by default) if they are too small (one can do something like ‘X.data = np.where(X.data >= 0, X.data.clip(min=1e-8), X.data.clip(max=-1e-8))’ to avoid this).

  • use_cpp (bool) – Whether to use C++ functions directly for IO. If passing False, will first write the output text into a Python string and then use Python’s IO functions to write it to a file. If passing True, will write to the file directly bypassing Python. Passing True is faster, but less resilient to errors.

  • append (bool) – Whether to append text at the end of the file instead of overwriting or creating a new file. Ignored when passing to_string=True.

  • to_string (bool) – Whether to write the result into a string (which will be returned from the function) instead of into a file.

Returns

output – If passing to_string=False (the default), will return True if it completes successfully, or raise an error if it doesn’t. If passing to_string=False, will return the encoded matrix in text as a string variable.

Return type

bool or str

Indices and tables