CSC111/assignments/A3/a3.tex

\documentclass[11pt]{article}
\usepackage{amsmath}
\usepackage[utf8]{inputenc}
\usepackage[margin=0.75in]{geometry}

\title{CSC111 Assignment 3: Graphs, Recommender Systems, and Clustering}
\author{Azalea Gui \& Peter Lin}
\date{\today}

\newcommand{\N}{\mathbb{N}}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\cO}{\mathcal{O}}
\newcommand{\floor}[1]{\left\lfloor #1 \right\rfloor}
\newcommand{\code}[1]{\texttt{#1}}

\begin{document}
\maketitle

\section*{Part 1: The book review graph and simple recommendations}

\begin{enumerate}

\item[1.]
Complete this part in the provided \texttt{a3\_part1.py} starter file.
Do \textbf{not} include your solution in this file.

\item[2.]
Running Time Analysis for \texttt{load\_review\_graph}:

Let $n$ be the number of lines in \texttt{book\_names\_file}, let $m$ be the number of lines in \texttt{reviews\_file}.

There are two operations that involves iteration in the function, one reads the book names file and creates the \texttt{mp} dictionary, and the other one reads the reviews file and adds vertices to the graph.

In creating $mp$, the program first opened the file and created a \texttt{csv.reader}, which are both constant-time operations. Then, the dictionary comprehension statement loops through all $n$ lines, running only constant-time operations in each iteration for adding the book id and name pair into the dictionary, resulting in a running time of $\Theta(n)$. Summing up all the operations for creating $mp$ and ignoring constant-time operations, the running time would be $\in \Theta(n)$.

For adding the vertices, it also opened the file and created a \texttt{csv.reader} in constant time. Then, the loop iterates through all $m$ lines. In each iteration, two vertices and one edge are added, and it also accessed $mp$ to retrieve the book name, which are all constant time operations. Therefore, the total running time would be $\in \Theta(m)$.

Since there are only constant-time operations outside the two iterating operations, the total running time of the function would be $\in \Theta(m + n)$

\item[3.]
Complete this part in the provided \texttt{a3\_part1.py} starter file.
Do \textbf{not} include your solution in this file.

\item[4.]
Complete this part in the provided \texttt{a3\_part1.py} starter file.
Do \textbf{not} include your solution in this file.

\end{enumerate}

\section*{Part 2: Weighted graphs, recommendations, review prediction}

Complete this part in the provided \texttt{a3\_part2\_recommendations.py} and \texttt{a3\_part2\_predictions.py} starter files.
Do \textbf{not} include your solution in this file.

\newpage

\section*{Part 3: Finding book clusters}

\begin{enumerate}

\item[1.]
Complete this part in the provided \texttt{a3\_part3.py} starter file.
Do \textbf{not} include your solution in this file.

\item[2.]
Complete this part in the provided \texttt{a3\_part3.py} starter file.
Do \textbf{not} include your solution in this file.

\item[3.]

\begin{enumerate}
\item[(a)]
Running Time Analysis for \texttt{cross\_cluster\_weight}:

Let $m_1$ be the size of \texttt{cluster1}, and let $m_2$ be the size of \texttt{cluster2}.

There is one nested loop in the function. The inner loop iterates $m_2$ times through all values in \texttt{cluster2}, and the outer loop iterates $m_1$ times through all values in \texttt{cluster1}. Inside the inner loop, there is only one function call \texttt{get\_weight}, which is constant-time because it only involves constant-time operations like dictionary accessing and variable assignment. After the function call, the returned value is added to \texttt{sw}, which is one constant-time operation. Let $c$ be a constant representing the number of constant time operations inside the inner loop. The nested loop will run $m_1 * m_2 * c$ operations in total, which is $\in \Theta(m_1 * m_2)$

Since there are only constant-time operations such as \texttt{len(set)}, variable assignment, and multiplication or division outside the nested loop, the running time of the function will be $\in \Theta(m_1 * m_2)$


\item[(b)]
Running Time Analysis for the inner loop in \texttt{find\_clusters\_random}:

Let $n$ be the number of vertices of \texttt{graph}.

There is one nested loop in the function.

For iteration $i$ of the outer loop, the inner loop iterates through all books in the \texttt{clusters} array, which has $n - i$ entires since each iteration of the outer loop removes an entry at the end. Inside the inner loop, the \texttt{if}, \texttt{is not}, comparisons, and variable assignment operations are all constant-time. For iteration $j$ of the inner loop, the \texttt{cross\_cluster\_weight} function call inside the inner loop has a running time of $\Theta(m_1 * m_j)$ as previously analyzed, where $m_1$ is the size of cluster $c_1$, and $m_j$ is the size of cluster $c_j$ (or $c_2$ for iteration $j$). Therefore, the running time of the function will be:


\begin{align}
RT_{\text{inner\_loop}} &= \sum_j^{n-i} m_1 * m_j \\
&= m_1 * \sum_j^{n-i} m_j
\end{align}

Since the if statement inside the inner loop ensures $m_1 \neq m_j$, and since we know that $\sum_c |c| = n$, the previously stated $\sum_j^{n-i} m_j$ will be equal to $n - |c_1| = n - m_1$. Therefore, the running time of the function is $m_1 * (n - m_1)$

Since every cluster is initialized to have one element, we know that $m_1 > 0$, and $n - m_1 < n$. Then, since $m_1$ is also $< n$, the running time $m_1 * (n - m_1)$ is bounded by $\cO(n^2)$.

\item[(c)]
Running Time Analysis for \texttt{find\_clusters\_random}:

Let $n$ be the number of vertices of graph, and let $k$ be the value of \texttt{num\_clusters}.

In the worst case scenario: Since the outer loop iterates over a range from $0$ to $n - k$, it iterates $n - k$ times. For each iteration, the inner loop have a running time of $\cO(n^2)$ as previously analyzed. Besides from the inner loop, other statements inside the outer loop includes constant-time operations print, f-string creation, \texttt{random.choice}, and varaible assignment. After that, all elements in the set $c_1$ are added to \texttt{best\_c2}, which has a worst-case running time of $\cO(n)$ because $|c_1| <= n$. Also, $c_1$ is removed from the list \texttt{clusters}, which also has a worst-case running time of $\cO(n)$ because \texttt{len(clusters)} $<= n$. Combining all of these operations, the running time of one iteration of the outer loop will be $\in \cO(n^2)$, and the runninng time of $n - k$ iterations will be $\in \cO(n^2 \cdot (n - k))$.

Outside the outer loop, there is only a return statement (constant-time) and a list comprehension that initializes the \texttt{clusters} array, which goes through $n$ iterations, executing a constant-time operation of creating a set for each iteration.

Therefore, the running time of the entire function would be $\in \cO(n^2 \cdot (n - k))$.

\item[(d)]
\emph{Not to be handed in.}
\end{enumerate}

\end{enumerate}
\end{document}