Classifiying Gene Mutations

This competition was a kaggle challenge to develop classification models which analyze medical articles and, based on their content accurately determine oncogenicity (4 classes) and mutation effect (9 classes) of the genes. See NIPS Competition Track for further submissions.
Approach
I applied Natural Language Processing and Understanding methods to transform each article into features for several machine learning classifiers to evaluate their performance in comparison.
First I went into testing the term-frequency–inverse-document-frequency (TFIDF) and inspect the output as a word cloud intuitively. Maybe this would give an intuition for the articles and facilitate my understanding of the matter. During the process, I realized, that this was not the case and my hope to be able to classify an article myself by simply looking at its word cloud was diminished drastically.
Models tested and evaluated:
- RandomForestClassifier
- XGBoost
NLP methods:
- Continuous Bag Of Words Models
- N-Gram Modelling Word Embeddings
- Bi-LSTM Conditional Random Field Layer Models
My exploration of various frameworks, including Keras, Tensorflow, and Theano, has been enlightening. However, it was PyTorch's comprehensive documentation and exemplary natural language processing examples that particularly resonated with me. The elegance and simplicity of PyTorch's pythonic architecture, especially its autograd tensor operations, have left a lasting impression, encouraging me to integrate them in future projects.
My initial experiments also included a comparison between the StanfordCoreNLP library and NLTK. While StanfordCoreNLP demonstrated superior performance in native Java, integrating it with a large 4GB dataset in a Python environment proved challenging. This experience led me to rely more on NLTK for lemmatizing and stemming, underscoring the importance of practicality and efficiency in data processing.
Stack used:
- NumPy, Pandas, sklearn
- PyTorch
- NLTK Corpora
- Stanford CoreNLP
Methodology:
- Aggregate new features with NLP methods to prepare the articles for the classifiers.
- XGBoost Classification
Notes to Self
Advanced Machine Learning Techniques:
Explore more advanced NLP techniques like transformer-based models (e.g., BERT, GPT) which have shown great promise in understanding the context and semantics in text.
Consider using attention mechanisms in your models, especially when dealing with long sequences in medical texts.
Data Preprocessing and Feature Engineering:
Experiment with more sophisticated text preprocessing techniques. Given the specialized language of medical texts, custom stopwords lists and domain-specific tokenization might improve model performance.
Explore feature selection methods to identify the most informative features, potentially reducing model complexity and improving performance.
Model Interpretability:
Given the critical nature of the domain, focus on model interpretability. Tools like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) can be used to understand feature importance and model decisions.
Performance Optimization:
Utilize skills in PySpark for distributed computing to handle large datasets more efficiently.
Experiment with optimization techniques in training models, like hyper-parameter tuning using tools like Hyperopt or Bayesian Optimization.
Cross-validation and Model Evaluation:
Implement rigorous cross-validation techniques to ensure the model's robustness and generalizability.
Use a range of evaluation metrics (beyond log loss) to assess model performance, especially focusing on metrics important in medical diagnostics like sensitivity, specificity, and area under the ROC curve.
kaggle: MSK Redefining Cancer Treatment
See notebooks for reference >> here