About

SciLib is a laboratory for semantic analysis of scientific data. It brings together researchers from FRC CSC RAS (Dorodnicyn Computing Centre) and MIPT FAIE. We build the tools we use ourselves: ontological description of scientific objects, knowledge graphs over formalized corpora, multimodal vectorization, and contextual search systems on top of this infrastructure.

Most artifacts are released under open licenses with attribution. Some models and training corpora are distributed by request to keep quality control and joint authorship clear.

Modern science produces knowledge faster than it can organize it. The same scientific object exists in several parallel forms — a natural-language statement, a formula, a fragment of code in a proof assistant, an image in a paper, an entry in a dataset. These forms are rarely connected to each other semantically: text search does not find a formula, a formula is not linked to code, code does not refer to a journal publication. The shared meaning is what gets lost.

SciLib makes those connections explicit. At the core is a modular OWL/DL ontology that strictly separates three levels of description for any scientific object: interpretation — what the object means; representation — what form it is expressed in; resource — where and how it is stored. This separation provides addressee-invariance: the same knowledge is equally accessible to a researcher and a program, without conflating semantics with implementation details.

On top of the ontology there is a knowledge graph over Mathlib (formalized mathematics, Lean4): about 213 000 statements with typed dependency edges (usesInType, usesInValue) and a domain taxonomy of 660+ subclasses with provenance. The vector space is built on top of the same ontology as a semantic contract: formal entities and their embeddings refer to the same identifiers.

This is infrastructure, not a standalone product. Already running on it: premise retrieval for Lean4 goals without hallucinations, graph navigation for researchers, cross-modal search across five representations of an object (English text, Russian text, Lean4 code, LaTeX formula, image). Next: extending the ontology to adjacent fields (theoretical physics, formal logic, information theory) and linking formalized results to preprints and journal publications through ORCID and DOI.

The lab leans on FRC CSC RAS expertise in scientific ontologies and digital libraries, and on MIPT's long experience in formalization of mathematics and applying machine learning to symbolic reasoning. We are open to collaboration with journals, universities and research groups under transparent licenses and joint authorship.