Computers in Biology and Medicine
Informatics in Medicine Unlocked
Annals of Robotics and Automation
Technical reviewer of grant applicant projects at Zanjan Science and Technology Park
Technical reviewer of grant applicant projects at Zanjan Science and Technology Park
Technical reviewer of grant applicant projects at Zanjan Science and Technology Park
Technical reviewer of grant applicant projects at Zanjan Science and Technology Park
Tabriz Journal of Electrical Engineering (TJEE)
Journal of the American Medical Informatics (JAMIA)
Engineering Applications of Artificial Intelligence
Graph embedding techniques have gained increasing attention for their ability to encode the complex structural information of networks into low-dimensional vectors. Existing graph embedding methods have achieved considerable success in various applications. However, these methods have limitations in capturing global graph topology information and fail to provide insights into the underlying mechanisms of network function. In this paper, we propose IsoGloVe, a count-based method that encodes graph topology into vectors using the co-occurrence statistics of fixed-size routes in random walks. IsoGloVe calculates the final embeddings based on the geodesic distances of the node\'s neighbors on a manifold. This representation in geodesic space allows for the analysis of node interactions and contributes to a better understanding of complex network structure and function. The performance of IsoGloVe is evaluated on various protein-protein interactions (PPI) using graph reconstruction, node classification, and visualization. The findings reveal that IsoGloVe surpasses other comparable methods with a 30\\% increase in MAP for graph reconstruction and a 25\\% increase in model scores for node classification in the Yeast PPI network. In addition, IsoGloVe demonstrated a 6.9\\% increase in MAP for graph reconstruction on the Human PPI network.
https://caiac.pubpub.org/pub/pee1dnv0/release/1
Graph Embeddings, geodesic-distance, GloVe.
Agents often need a long time to explore state-action space in order to learn how to act expectedly in Partially Observable Markov Decision Processes (POMDPs). With the reward shaping method, real-time POMDP planning can be guided both in terms of reliability and speed. In this paper, we propose Low Dimensional Policy Graph (LDPG), a new reward shaping method for reducing the dimension of the value function to extract the best state-action pairs. The reward function is then shaped using these key pairs. For accelerating learning speed, we analyze the Transition Function graph to discover significant paths to the learning agent’s goal. Direct comparison on five standard testbeds indicates LDPG brings about the deterministic finding of optimal actions faster regardless of the task type. Our method is shown to reach the goals more quickly (by 41.48 % improvement) and performed 61.57 % better in receiving rewards in the 4×5×2 domain.
Dynamic reward shaping , Markov decision making, Planning , Dimension reduction , Reinforcement earning
CRISPR/Cas9 is a new genome-editing technology used in biomedical applications. To make genome editing with CRISPR far more precise and practical, we must concentrate on predicting CRISPR off-target effects and try to decrease them. Although numerous computational models have been developed to predict off-target activities, the existing methods suffer from low precision for gene editing at the clinical level. In addition, the inputs of most of these algorithms are gRNA sequences in on-hot vector encoding form. However, recent research illustrated that both gRNA and DNA strongly impact on the prediction precision of off-target activity. To address these problems, we propose a novel encoding scheme of gRNA-DNA sequences, and deploy it in several neural network-based architectures, including Convolutional Neural Networks (CNNs) and Fully-connected Neural Networks (FNNs) to predict off-target effects. The comparison of off-target prediction results based on our proposed gRNA-DNA encoding scheme with state-of-the-art on the popular gene-editing dataset, CRISPOR, reveals the superiority of our approach based on different criteria.
CRISPR/Cas, gRNA Design, Off-target, Encoding, Machine learning
This paper reports on modern approaches in Information Extraction (IE) and its two main sub-tasks of Named Entity Recognition (NER) and Relation Extraction (RE). Basic concepts and the most recent approaches in this area are reviewed, which mainly include Machine Learning (ML) based approaches and the more recent trend to Deep Learning (DL) based methods.
Information Extraction, Machine Learning, Deep Learning, Named Entity Recognition, Relation Extraction
Abbreviations and acronyms are widely used in the clinical documents. This paper describes using of a machine learner to automatically extract spans of abbreviations and acronyms from clinical notes and map them to the UMLS (Unified Medical Language System) CUI (Concept Unique Iden-tifier).
A Conditional Random Field (CRF) machine learner was used to identify abbreviations and acronyms. Firstly, the training data was converted to the CRF format. The different feature sets were applied with 10-fold cross vali-dation to find the best feature set to create the machine learning model. Second-ly, the identified spans for abbreviation/acronyms were mapped to the UMLS (Unified Medical Language System) CUIs. Thirdly, a rule based engine was applied for disambiguation of terms with multiple abbreviations or acronyms.
Approach: A novel supervised learning model was developed that incorpo-rates a machine learning algorithm and a rule-based engine. Evaluation of each step included precision, recall and F-score metrics for span detection and accu-racy for CUI mapping.
Several tools which were created in our laboratory were used, including a Text to SNOMED CT (TTSCT) service, Lexical Management Sys-tem (LMS) and Ring-fencing approach. Also a set of gazetteers which had been created from the training data was employed.
A 10-fold cross validation on the training data showed 0.911 of precision, 0.887 of recall and a F-score of 0.899 for detecting the boundary of abbreviation/acronyms and an accuracy of 0.760 for CUI mapping while the of-ficial results on the test data showed strict accuracy of 0.447 and relaxed accu-racy of 0.488 which is the third team out of the five participating teams. A supervised machine learning method with mixed computational strategies and rule based method for disambiguation of expansions seems to provide a near-optimal strategy for automated extraction of abbreviation/acronyms.
Normalization of acronyms, NER, CRF, SVM, SNOMED CT
There are abundant mentions of clinical conditions, anatomical sites, medications and procedures in clinical documents. This paper describes use of a cascade of machine learners to automatically extract mentions of named entities about disorders from clinical notes.
A Conditional Random Field (CRF) machine learner has been used for named entity recognition and to capture more complex (multiple word) named entities we have used Support Vector Machines (SVM). Firstly, the training data was converted to the CRF format. Different feature sets were ap-plied using 10-fold cross validation to find the best feature set for the machine learning model. Secondly, the identified named entities were passed to the SVM to find any relation among the identified disorder mentions to decide whether they are a part of a complex disorder.
Our approach was based on a novel supervised learning model which incorporates two machine learning algorithms (CRF and SVM). Evalua-tion of each step included precision, recall and F-score metrics.
We have used several tools which are created in our lab includ-ing TTSCT (Text to SNOMED CT) service, Lexical Management System (LMS) and Ring-fencing approach. A set of gazetteers was created from the training data and employed in analysis as well.
Evaluation results produced a precision of 0.766, recall of 0.726 and F-score of 0.746 for named entity recognition based on 10-fold cross vali-dation; and precision, recall and F-measure of 0.927 for relation extraction based on 5-fold cross validation on the training data. On the official test data on strict mode a precision of 0.686, recall of 0.539 and F-score of 0.604 was achieved. Based on the results our team was the 11th out of 25 participating teams. In the relaxed mode a precision of 0.912, recall of 0.701 and F-score of 0.793 was recorded and our team was the 12th. A multi stage supervised ma-chine learning method with mixed computational strategies seems to provide a reasonable strategy for automated extraction of disorders.
Named entity recognition, CRF, SVM, SNOMED-CT
The proposal of a special purpose language for Clinical Data Analytics (CliniDAL) is presented along with a general model for expressing temporal events in the language. A survey of temporal modeling in ontology and Natural Language Processing (NLP) disciplines is presented as a framework for the design. The temporal dimension of clinical data needs to be addressed from at least five different points of view. Firstly, how to attach the knowledge of time based constraints to queries; secondly, how to mine temporal data in different CISs with various data models like Entity Relationship (ER) or Document (Form) design model; thirdly,
how to deal with both relative time and absolute time in the query language; fourthly, how to tackle internal time-event
dependencies in queries, and finally, how to manage historical time events preserved in the patient’s narrative. The temporal
elements of the language are defined in BNF along with a UML schema. Its use in a designed taxonomy of a five class hierarchy
of data analytics tasks shows the solution to problems of time event dependencies in a highly complex cascade of queries
needed to evaluate scientific experiments. The issues in using the model in a practical way with the three different database
schema representations of Relational, EAV and XML are discussed.
Clinical Data Analytics, temporal model, EAV, patient’s narrative
This paper reports on the issues in mapping the terms of a query to the field names of the schema of an Entity Relationship (ER) model or to the data part of the Entity Attribute Value (EAV) model using similarity based Top-K algorithm in clinical information system together with an extension of EAV mapping for medication names. In addition, the details of the mapping algorithm and the required pre-processing including NLP (Natural Language Processing) tasks to prepare resources for mapping are explained. The experimental results on an example clinical information system demonstrate more than 84 per cent of accuracy in mapping. The results will be integrated into our proposed Clinical Data Analytics Language (CliniDAL) to automate mapping process in CliniDAL.
Data models, Dictionaries, Relational databases, Keyword search, Clinical diagnosis
Extracting knowledge from data is essential in clinical research, decision making and hypothesis testing. So, providing a general solution to create analytical tools is of prime importance. The objective of this paper is to intro-duce a special purpose query language, Clinical Data Ana-lytics Language (CliniDAL), based on features in an earli-er CliniDAL in which a user can express and can compute answers to any question that is answerable from a CIS. Question and answer categories include point-of-care retrieval queries, descriptive statistics, statistical hypothe-sis testing, scientific experiment complex hypotheses and semantic record retrieval. In addition due to the im-portance of time in the clinical domain a temporal model is proposed and integrated into CliniDAL. The experi-mental results reflect the capability of the language in creating desired queries via restricted natural language. Also integrating clinical ontologies like SNOMED helps unifying terminologies of various CISs.
Data representation and visualization, Health care information systems, data analytics
The ability to share knowledge is a necessity for agents in order to achieve both group and individual goals. To grant this ability many researchers have assumed to not only establish a common language among agents but a complete common understanding of all the concepts the agents communicate about. But these assumptions are often too strong or unrealistic. In this paper we present a comprehensive study of performance of agents learning ontology concepts from peer agents. Our methodology allows agents that are not sharing a common ontology to establish common grounds on concepts known only to some of them, when these common grounds are needed by learning the concepts. Although the concepts learned by an agent are only compromises among the views of the other agents, the method nevertheless enhances the autonomy of agents using it substantially. The experimental evaluation shows that the learner agent performs better than or close to teacher agents when it is tested against the objects from the whole world.
Ontology, Concept learning, Multi-agent communication
With the increase of interactions on the web and social media such as Twitter, a large
amount of content is produced by users, and as a result, it has led to the continuous
expansion of online harassment, insults, and attacks, which is called cyberbullying or
internet harassment. Identifying texts containing internet harassment has become
challenging in natural language processing tasks. Therefore, designing efficient methods to automatically identify such content has become integral to most social media platforms. In this study, we put forward several ALEBERT-based models to classify cyberbullying. To that end, we obtained data from Twitter and proposed the base model BPM, which solely utilizes the textual content of a tweet for categorization. Afterwards, we integrated social network relationships quantified by the number of
friends and followers, the number of likes, as well as the number of retweets. We investigated the effect of individual features and their combinations on the performance of the principal model. Our findings demonstrate that incorporating user communication attributes can enhance the accuracy of the baseline model. Specifically, the BPM_LC_RC_FC model, which involves tweet content and all suggested features, resulted in the best overall accuracy and F1-Score of 98.80 in comparison to previous
methods. This promising outcome is noteworthy as it represents the first multimodal approach to cyberbullying classification.
Cyberbullying, Internet Harassment, Social Media, Transformer, ALBERT
The CRISPR system, as a gene editing method, has revolutionized the field of biology, provided that the exact target sites (on-targets) for gene editing are determined accurately in order to avoid unintended side effects that could potentially harm cellular function. To address this issue, computational methods have been developed to accurately predict off-target locations. In this research, a hybrid deep learning model incorporating two neural networks, BiLSTM and CNN, has been proposed for identifying off-target sites in the CRISPR system. Due to the length and complexity of DNA sequences, a specialized encoding method is suggested for feeding information into the model. Utilizing k-mer sequence embeddings of various sizes using DNAtoVec and calculating sequence mismatches at both nucleotide and K-mer levels, this model is capable of identifying specific patterns and features in sequences. Furthermore, the use of data augmentation and under-sampling techniques has provided a balanced dataset to address the issue of data imbalance in this research. The evaluation results indicate that the proposed model surpasses the baseline models, achieving accuracy and F1-Measure metrics for predicting off-target sites that exceed 0.98.
CRISPR, Off-target positions, Data augmentation, Deep learning
In: Kashima, H., Ide, T., Peng, WC. (eds) .
Dynamic reward shaping, Markov decision making, Planning ,Dimension reduction, Reinforcement Learning
Clustered regularly interspaced short palindromic repeats (CRISPR)-based gene editing has been widely used in various cell types and organisms. To make genome editing with Clustered regularly interspaced short palindromic repeats far more precise and practical, we must concentrate on the design of optimal gRNA and the selection of appropriate Cas enzymes. Numerous computational tools have been created in recent years to help researchers design the best gRNA for Clustered regularly interspaced short palindromic repeats researches. There are two approaches for designing an appropriate gRNA sequence (which targets our desired sites with high precision): experimental and predicting-based approaches. It is essential to reduce off-target sites when designing an optimal gRNA. Here we review both traditional and machine learning-based approaches for designing an appropriate gRNA sequence and predicting off-target sites. In this review, we summarize the key characteristics of all available tools (as far as possible) and compare them together. Machine learning-based tools and web servers are believed to become the most effective and reliable methods for predicting ontarget and off-target activities of Clustered regularly interspaced short palindromic repeats in the future. However, these predictions are not so precise now and the performance of these algorithms -especially deep learning one’s-depends on the amount of data used during training phase. So, as more features are discovered and incorporated into these models, predictions become more in line with experimental observations. We must concentrate on the creation of ideal gRNA and the choice of suitable Cas enzymes in order to make genome editing with Clustered regularly interspaced short palindromic repeats far more accurate and feasible.
CRiSPR/Cas, gRNA design, On-target, Off-target, Computational approach, Machine learning.
With the expansion of the Internet and attractive social media infrastructures, people prefer to follow the news through these media. Despite the many advantages of these media in the news field, the lack of control and verification mechanism has led to the spread of fake news as one of the most critical threats to democracy, economy, journalism, health, and freedom of expression. So, designing and using efficient automated methods to detect fake news on social media has become a significant challenge. One of the most relevant entities in determining the authenticity of a news statement on social media is its publishers. This paper examines the publishers’ features in detecting fake news on social media, including Credibility, Influence, Sociality, Validity, and Lifetime. In this regard, we propose an algorithm, namely CreditRank, for evaluating publishers’ credibility on social networks. We also suggest a high accurate multi-modal framework, namely FR-Detect, for fake news detection using user-related and content-related features. Furthermore, a sentence-level convolutional neural network is provided to properly combine publishers’ features with latent textual content features. Experimental results show that the publishers’ features can improve the performance of content-based models by up to 16% and 31% in accuracy and F1, respectively. Also, the behavior of publishers in different news domains has been statistically studied and analyzed.
Fake news detection, CreditRank algorithm, Social media, Deep neural network, Machine learning , Text classification.
Protecting the information access pattern, which means preventing the disclosure of data and structural details of databases, is very important in working with data, especially in the cases of outsourced databases and databases with Internet access. The protection of the information access pattern indicates that mere data confidentiality is not sufficient and the privacy of queries and accesses must also be ensured. This is because by observing users' queries, attackers can extract the relationships between the queries and obtain a knowledge of the database to decrypt details of the database structure. In this paper, for the outsourcing model, the storing methods that are appropriate for providing confidentiality and protecting the access pattern are described. Finally, a segmentation-based approach is presented to protect the access pattern for outsourced data. Compared to previous methods, experiment results indicate our proposed method provides an acceptable level of preventing the information disclosure as well as not imposing large overhead of storing, computation, and communication.
Data outsourcing, Data confidentiality, Access pattern protection, Information disclosure prevention, Database security.
Much of the important patient information can only be found in patient narratives or in free text fields of structural schema of the Clinical Information System (CIS). So, the integration of free text search facilities will improve question answering on CISs. This paper describes a method for integrating free text search facility to the proposed Data Analytics Language (CliniDAL) to improve its capabilities at answering more common clinical questions. The proposed language constructs in CliniDAL’s grammar enables its parser to recognize the part of the Restricted Natural Language Query (RNLQ) of the CliniDAL interface, which needs a free text resolution mechanism. Then the Natural Language Processing (NLP) approach of the CliniSearch tool finds the correct matches with the query. The search result is integrated into the translated CliniDAL query which can be executed to return a more comprehensive answer to the initial text query. 160 queries are tested in the current work to investigate the improvements on answering more common questions from a CIS, which result in a simple taxonomy of four query categories of: unanswerable queries, queries that require more evidence to be answered, queries requiring user interpretation and queries with suitable answers. Compatibility of query results between the structural schema and patient progress notes is examined which showed the usability of the approach in answering queries, confirming the results from different sources and finding any inconsistency in the stored data in the CIS. The proposed solution provides a simple mechanism for extracting knowledge from CISs.
Knowledge discovery and reuse, Question answering, Free text search, Clinical information systems.
This paper reports on a generic framework to provide clinicians with the ability to conduct complex analyses on elaborate research topics using cascaded queries to resolve internal time-event dependencies in the research questions, as an extension to the proposed Clinical Data Analytics Language (CliniDAL).
A cascaded query model is proposed to resolve internal time-event dependencies in the queries which can have up to five levels of criteria starting with a query to define subjects to be admitted into a study, followed by a query to define the time span of the experiment. Three more cascaded queries can be required to define control groups, control variables and output variables which all together simulate a real scientific experiment. According to the complexity of the research questions, the cascaded query model has the flexibility of merging some lower level queries for simple research questions or adding a nested query to each level to compose more complex queries. Three different scenarios (one of them contains two studies) are described and used for evaluation of the proposed solution.
CliniDAL’s complex analyses solution enables answering complex queries with time-event dependencies at most in a few hours which manually would take many days.
An evaluation of results of the research studies based on the comparison between CliniDAL and SQL solutions reveals high usability and efficiency of CliniDAL’s solution.
Data analytics, Time-event dependency, Scientific experiment.
To elevate the level of care to the community it is essential to provide usable tools for healthcare professionals to extract knowledge from clinical data. In this paper a generic translation algorithm is proposed to translate a restricted natural language query (RNLQ) to a standard query language like SQL (Structured Query Language).
A special purpose clinical data analytics language (CliniDAL) has been introduced which provides scheme of six classes of clinical questioning templates. A translation algorithm is proposed to translate the RNLQ of users to SQL queries based on a similarity-based Top-k algorithm which is used in the mapping process of CliniDAL. Also a two layer rule-based method is used to interpret the temporal expressions of the query, based on the proposed temporal model. The mapping and translation algorithms are generic and thus able to work with clinical databases in three data design models, including Entity-Relationship (ER), Entity–Attribute–Value (EAV) and XML, however it is only implemented for ER and EAV design models in the current work.
It is easy to compose a RNLQ via CliniDAL’s interface in which query terms are automatically mapped to the underlying data models of a Clinical Information System (CIS) with an accuracy of more than 84% and the temporal expressions of the query comprising absolute times, relative times or relative events can be automatically mapped to time entities of the underlying CIS and to normalized temporal comparative values.
The proposed solution of CliniDAL using the generic mapping and translation algorithms which is enhanced by a temporal analyzer component provides a simple mechanism for composing RNLQ for extracting knowledge from CISs with different data design models for analytics purposes.
Restricted natural language querying, Knowledge discovery and reuse, Data analytics
Today the usage of electronic services in different taxies seems to be necessary. Organizing a system to pay the taxi electronically with the card by the passenger can decrease the need to carry cash. Designing and building of such a system by using a special taximeter with new features and also one or more card reader can be fixed into the taxi. Payment is with the smart credit card and from the client's account. The information of presented services is stored in the system and will be recorded as offered services. The driver can have access to taxi organization or some special terminals electronically to see his account and also he can receive a receipt.
Electronic Smart Card, Automation of Payment, Intelligent System.
Suggestion mining has become a popular subject in the field of natural language processing (NLP) that is useful in areas like a service/product improvement. The purpose of this study is to provide an automated machine learning (ML) based approach to extract suggestions from Persian text. In this research, first, a novel two-step semi-supervised method has been proposed to generate a Persian dataset called ParsSugg, which is then used in the automatic classification of the user's suggestions. The first step is manual labeling of data based on a proposed guideline, followed by a data augmentation phase. In the second step, using pre-trained Persian Bidirectional Encoder Representations from Transformers (ParsBERT) as a classifier and the data from the previous step, more data were labeled. The performance of various ML models, including Support Vector Machine (SVM), Random Forest (RF), Convolutional Neural Networks (CNN), Long Short Term Memory (LSTM), and the ParsBERT language model has been examined on the generated dataset. The F-score value of 97.27 for ParsBERT and about 94.5 for SVM and CNN classifiers were obtained for the suggestion class which is a promising result as the first research on suggestion classification on Persian texts. Also, the proposed guideline can be used for other NLP tasks, and the generated dataset can be used in other suggestion classification tasks.
Automatic classification of suggestions, Annotator, Neural networks, Pre-trained language model, Transformers.
In this study, we introduce StructmRNA, a new BERT-based model that was designed for the detailed analysis of mRNA
sequences and structures. The success of DNABERT in understanding the intricate language of non-coding DNA with
bidirectional encoder representations is extended to mRNA with StructmRNA. This new model uses a special dual-level
masking technique that covers both sequence and structure, along with conditional masking. This enables StructmRNA to
adeptly generate meaningful embeddings for mRNA sequences, even in the absence of explicit structural data, by capitalizing
on the intricate sequence-structure correlations learned during extensive pre-training on vast datasets. Compared to well-known
models like those in the Stanford OpenVaccine project, StructmRNA performs better in important tasks such as predicting RNA
degradation. Thus, StructmRNA can inform better RNA-based treatments by predicting the secondary structures and biological
functions of unseen mRNA sequences. The proficiency of this model is further confirmed by rigorous evaluations, revealing
its unprecedented ability to generalize across various organisms and conditions, thereby marking a significant advance in
the predictive analysis of mRNA for therapeutic design. With this work, we aim to set a new standard for mRNA analysis,
contributing to the broader field of genomics and therapeutic development.
Prime Editors (PEs) are CRISPR-based genome engineering tools with significant potential for rectifying patient mutations. However, their usage requires experimental optimization of the prime editing guide RNA (PegRNA) to achieve high editing efficiency. This paper introduces Deep Transformer-based Model for Predicting Prime Editing Efficiency (DTMP-Prime), a tool specifically designed to predict PegRNA activity and PE efficiency. DTMP-Prime facilitates the design of appropriate PegRNA and ngRNA. A transformer-based model was constructed to scrutinize a wide-ranging set of PE data, enabling the extraction of effective features of PegRNAs and target DNA sequences. The integration of these features with the proposed encoding strategy and DNABERT-based embedding has notably improved the predictive capabilities of DTMP-Prime for off-target sites. Moreover, DTMP-Prime is a promising tool for precisely predicting off-target sites in CRISPR experiments. The integration of a multi-head attention framework has additionally improved the precision and generalizability of DTMP-Prime across various PE models and cell lines. Evaluation results based on the Pearson and Spearman correlation coefficient demonstrate that DTMP-Prime outperforms other state-of-the-art models in predicting the efficiency and outcomes of PE experiments
CRISPR; Deep learning; prime editing; PegRNA; Off-target; Transfer Learning; DNABERT.