Word embedding-based representation of small molecules (Master’s thesis)

Date:

Since huge amount and structural diversity, the tasks about similarity comparison and classification for small molecular compounds become difficult. Better representation method is the basic of solving and optimizing these tasks. Inspired by natural language processing (NLP), this project use “fragment vector” and “molecular vector” to represent fragments and molecules respectively. In order to solve the problem caused by the structure of multiple branches without orientation, i.e. the relations between each two fragments are hard to represent as a linear sequence, this article shows two methods to solve this problem: TandemFragment and ParallelFragment. This article also compares different methods and hyper-parameters for training fragment vectors systematically.

Fig. project3: the workflow of tandemFragment algorithm

github: https://github.com/OnlyBelter/fragParallel2vecX