From 79de82482222e3d04028c6c8e5ee664a973a46c2 Mon Sep 17 00:00:00 2001
From: Benjamin Elizalde <26778834+bmartin1@users.noreply.github.com>
Date: Tue, 26 Sep 2023 14:40:08 -0700
Subject: [PATCH] Update README.md
---
README.md | 83 ++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 54 insertions(+), 29 deletions(-)
diff --git a/README.md b/README.md
index 2264c9f..f3d1380 100644
--- a/README.md
+++ b/README.md
@@ -1,74 +1,99 @@
+###### [Overview](#CLAP) | [Setup](#Setup) | [CLAP weights](#CLAP-weights) | [Usage](#Usage) | [Examples](#Examples) | [Citation](#Citation)
+
# CLAP
-CLAP (Contrastive Language-Audio Pretraining) is a neural network model that learns acoustic concepts from natural language supervision. It achieves SoTA in “Zero-Shot” classification, Audio-Text & Text-Audio Retrieval, and in some datasets when finetuned.
+CLAP (Contrastive Language-Audio Pretraining) is a model that learns acoustic concepts from natural language supervision and enables “Zero-Shot” inference. The model has been extensively evaluated in 26 audio downstream tasks achieving SoTA in several of them including classification, retrieval, and captioning.
-
-
-## Updates
-- A new CLAP version [[paper]](https://arxiv.org/abs/2309.05767) trained on 4.6M pairs will be released here soon.
+
## Setup
-You are required to just install the dependencies: `pip install -r requirements.txt` using Python 3 to get started.
+Install the dependencies: `pip install -r requirements.txt` using Python 3 to get started.
If you have [conda](https://www.anaconda.com) installed, you can run the following:
```shell
git clone https://github.com/microsoft/CLAP.git && \
cd CLAP && \
-conda create -n clap python=3.8 && \
+conda create -n clap python=3.10 && \
conda activate clap && \
pip install -r requirements.txt
```
## CLAP weights
-Download CLAP weights: [Pretrained Model \[Zenodo\]](https://zenodo.org/record/7312125#.Y22vecvMIQ9)
+Download CLAP weights: versions _2022_, _2023_, and _clapcap_: [Pretrained Model \[Zenodo\]](https://zenodo.org/record/7312125#.Y22vecvMIQ9)
+_clapcap_ is the audio captioning model that uses the 2023 encoders.
## Usage
-Please take a look at `src/examples` for usage examples.
-
-- Load model
+- Zero-Shot Classification and Retrieval
```python
+# Load model (Choose between versions '2022' or '2023')
from src import CLAP
-clap_model = CLAP("", use_cuda=False)
-```
+clap_model = CLAP("", version = '2023', use_cuda=False)
-- Extract text embeddings
-```python
+# Extract text embeddings
text_embeddings = clap_model.get_text_embeddings(class_labels: List[str])
-```
-- Extract audio embeddings
-```python
+# Extract audio embeddings
audio_embeddings = clap_model.get_audio_embeddings(file_paths: List[str])
+
+# Compute similarity between audio and text embeddings
+similarities = clap_model.compute_similarity(audio_embeddings, text_embeddings)
```
-- Compute similarity
+- Audio Captioning
```python
-sim = clap_model.compute_similarity(audio_embeddings, text_embeddings)
+# Load model (Choose version 'clapcap')
+from src import CLAP
+
+clap_model = CLAP("", version = 'clapcap', use_cuda=False)
+
+# Generate audio captions
+captions = clap_model.generate_caption(file_paths: List[str])
```
## Examples
-To run zero-shot evaluation on the ESC50 dataset or a single audio file from ESC50, check `CLAP\src\`. For zero-shot evaluation on the ESC50 dataset:
+Take a look at `CLAP\src\` for usage examples.
+
+To run Zero-Shot Classification on the ESC50 dataset try the following:
+
```bash
> cd src && python zero_shot_classification.py
```
-Output
+Output (version 2023)
```bash
-ESC50 Accuracy: 82.6%
+ESC50 Accuracy: 93.9%
```
## Citation
-https://arxiv.org/pdf/2206.04769.pdf
+
+Kindly cite our work if you find it useful.
+
+[CLAP: Learning Audio Concepts from Natural Language Supervision](https://ieeexplore.ieee.org/abstract/document/10095889)
```
-@article{elizalde2022clap,
- title={Clap: Learning audio concepts from natural language supervision},
- author={Elizalde, Benjamin and Deshmukh, Soham and Ismail, Mahmoud Al and Wang, Huaming},
- journal={arXiv preprint arXiv:2206.04769},
- year={2022}
+@inproceedings{CLAP2022,
+ title={Clap learning audio concepts from natural language supervision},
+ author={Elizalde, Benjamin and Deshmukh, Soham and Al Ismail, Mahmoud and Wang, Huaming},
+ booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+ pages={1--5},
+ year={2023},
+ organization={IEEE}
+}
+```
+
+[Natural Language Supervision for General-Purpose Audio Representations](https://arxiv.org/abs/2309.05767)
+```
+@misc{CLAP2023,
+ title={Natural Language Supervision for General-Purpose Audio Representations},
+ author={Benjamin Elizalde and Soham Deshmukh and Huaming Wang},
+ year={2023},
+ eprint={2309.05767},
+ archivePrefix={arXiv},
+ primaryClass={cs.SD},
+ url={https://arxiv.org/abs/2309.05767}
}
```