Notes on Using
Data Science and Machine Learning
I am a principal applied scientist at Spectrum Labs. I have recently transitioned from particle physics research at CERN to machine learning research.
This site is a collection of notes that I use for ease of reference to commonly used code snippets and to document some of the concepts I am learning.
Python
References
Basics
- CLI arguments and config with click
- Combinatorics
- Creating a python package
- Functools:reduce()
- Handling exception: ignore
- Hashing strings
- id() of an object
- Inspect module and get traceback
- Itertools:groupby()
- Jupyter kernel versions
- Jupyter Watermark
- Logging
- Manipulate maps
- Manipulation of directories and files
- map, zip, eval, ord, dir, pow function
- Membership test
- Most frequent element in a list
- Print statements
- Print string at a fixed width
- Python built-in keywords
- Remove everything after a character in a string
- Substring key match in a dictionary
- Time and profile your code
- tqdm cool progress meter
- Tuple and namedtuple
- Usage of underscores
- Using pipenv and saving python environment
- Using pyenv and upgrading python
Numpy
- Calculating cosine between two vectors
- Copy, shallow copy, deep copy in Numpy
- Covariance, Correlation, and eigenvalues
- Cumulative sum `cumsum`
- Dot product
- Find and count unique elements in array
- Find index of elment with where
- Majority vote: argmax, bincount, average
- Searching
- Select array entries with another array
Machine Learning
NLP
- Bag of words
- Clean text with Regex
- Embeding a ML model into a Web Application
- Online algorithms and out-of-core learning
- Processing documents into tokens (tokenization and stop words)
- Regular expressions
- Sentiment analysis in text
- Setting up a SQLite database for data storage
- Topic modeling with Latent Dirichlet Allocation
Preprocessing Voice
Preprocessing Text
Scikit-Learn
- Bagging in ensemble methods
- Boosting in ensemble methods
- Clustering: K-means, agglomerative with dendrograms, and DBSCAN
- Combining classifiers via majority vote
- Confusion matrix, ROC curve, evaluation metrics
- Decision Boundary
- Decision tree and random forest
- Feature scaling
- K-fold cross-validation
- K-nearest neighbors (knn)
- Learning and validation curves
- Linear regression and its scores: MSE and R^2
- Linear scikit-learn classifiers
- Logistic regression with L1 norm
- Non-linear SVM kernels
- Optimizing the precision and recall of a classification model
- PCA
- Polynomial regression
- Resampling for class imbalance
- ROC curve for multiclass problem
- Tuning hyperparameters via grid search
Linux
Basics
- Access and download files via sftp
- Choose random files from a file
- Copy random file from subfolder to new subfolders
- Details about a process and time it was running
- Downgrade java version in MacOS
- Find Files
- How many files in the folder?
- Linux system info
- Measure internet speed test
- Mount external hard drive
- Playing sound from linux server in client MacOS
- Search for a substring inside all files
- Setup a Deep Learning Server
- Setup port forwarding and ip addresses
- ssh keys in remote servers
- Working with remote files via ssh
- Zip Files and send over ssh
This site is inspired from this repo.