Road Labelling - Section 1¶

Intention¶

This is a personal project intended to let me experiment with using github copilot, to let me learn pytorch and to allow me a chance to experiment after a few dry'er take home projects from job applications.
I want to take my set of frames and see if I can label them. As much as possible I want to automate this process of labelling without providing input labels to the system.

approach¶

I intend to do as much of the coding as possible using copilot [In this first section all of the code was written that way, with only small tweaks]. I do not believe this is the correct approach for producing production code, and I wont do this for later sections, but I want to learn as much as possible by making mistakes and seeing what mistakes the LLM makes. Similarly I am sure some of the things I'm trying to do with pytorch and NNs are well trodden ground, but I want to try to figure out my own solutions before I go looking for what other people have done. Again because I want to learn what doesn't work and to have a chance to flex my creative thinking.

What is this doing?¶

I knew I wanted to work on some pytorch programs, but one of the awkward things to come up with is a good dataset, and I didn't want to spend hours collating data from somewhere online, or take well trodden datasets where I wouldn't feel there was any novelty. My solution was to capture an ~24 hour movie of the road outside my apt at a rate of one frame per second (this was done prior to this section 1 by me). I also didn't want to label those frames myself or with another model, so I decided to see what I could do with unlabelled data. If this sounds like laziness it probably was, but it was the kind of fun laziness that took me in an interesting direction.

I've included as much as was sensible of the initial investigation process below, with thoughts/explanations included in markdown where appropriate, and some final thoughts on this section at the bottom. The code however is entirely produced by copilot

Initial data preparation¶

In [ ]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import cv2
# Load every 32nd frame from the TIFF files listed in data.npy
import os

data_arr = np.load('data.npy', allow_pickle=True)
all_paths = data_arr['data'] if isinstance(data_arr, np.lib.npyio.NpzFile) else data_arr['data']
selected_paths = all_paths[::32]

frames = []
for fpath in selected_paths:
    frame = cv2.imread(fpath, cv2.IMREAD_GRAYSCALE)
    frames.append(frame)
movie = np.stack(frames, axis=0)
print('Movie shape:', movie.shape)  # (num_frames, height, width)
# Reshape the movie: each frame is a data point, each pixel is a feature
num_frames = movie.shape[0]
num_pixels = np.prod(movie.shape[1:])
X = movie.reshape(num_frames, num_pixels)
print('Reshaped data shape:', X.shape)

data investigation¶

I wanted to see if there was indeed variation here which I might be able to explain. I also tend to think that PCA is a good baseline, if you can't provide more information than PCA then you probably haven't done much of use.

In [11]:
# Perform PCA and plot explained variance ratio for top 20 components
pca = PCA(n_components=20)
pca.fit(X)
explained_var = pca.explained_variance_ratio_ * 100  # percent

plt.figure(figsize=(8, 5))
plt.bar(range(1, 21), explained_var)
plt.xlabel('Principal Component')
plt.ylabel('% Variance Explained')
plt.title('Top 20 PCA Components: % Variance Explained')
plt.show()
No description has been provided for this image
In [12]:
# Show 2 images with high and 2 with low principal component values for each of the top 5 PCs

# Project data onto principal components
X_pca = pca.transform(X)

num_pcs = 5
num_high = 2
num_low = 2

fig, axes = plt.subplots(num_pcs, num_high + num_low, figsize=(12, 2.5 * num_pcs))
if num_pcs == 1:
    axes = axes[np.newaxis, :]  # Ensure axes is 2D

for pc in range(num_pcs):
    # Get the projection values for this PC
    pc_values = X_pca[:, pc]
    # Indices of lowest and highest values
    low_idx = np.argsort(pc_values)[:num_low]
    high_idx = np.argsort(pc_values)[-num_high:][::-1]
    selected_idx = np.concatenate([low_idx, high_idx])
    for j, idx in enumerate(selected_idx):
        ax = axes[pc, j]
        ax.imshow(movie[idx], cmap='gray')
        if j < num_low:
            ax.set_title(f'PC{pc+1} Low #{j+1}')
        else:
            ax.set_title(f'PC{pc+1} High #{j-num_low+1}')
        ax.axis('off')
plt.tight_layout()
plt.show()
No description has been provided for this image

PCA results¶

These above results are cooler than I expected. Even though by doing PCA the dimensionality is lost, we are still seeing some clear demarcations between different 'types' of images that can be picked out by eye from the above images (night time vs daytime, morning vs afternoon, cars on the close side of the street vs cars on the far side of the street, etc.)

As I write this, it occurs to me it would be interesting to go deeper down into the more minor principal components and see how long it takes for the differences to not be clear. However I am writing this at the end of doing this section of work, so sadly I did not do that. All past me wanted to know was are there features/labels which could be extracted from and applied to these images and this result above told me that yes there were.

performing learning¶

A lot of the code for actual learning was done elsewhere, mostly because it was much more trial and error iterative and I wanted to be building functions that I could re-use for future sections of the project.

concept¶

The concept here was that rather than labelling the data, I could feed random labels into a NN to learn, use that NN to label the data, then feed a subset of that labelled data into a new NN and use that to label it again. By iterating in this way I reasoned that it would learn features of the data that are most prominent and then once labels were produced a user could do hand labelling to say what the labels are, but in a much faster way (i.e. I would have label1, and a glance at that label might tell me it was stating whether or not there was a car in the image, now I have a labelled dataset with very little interaction). In an ideal world an approach such as this might also be able to tell me how many labels I could get out of the data. I limited myself to binary labels (i.e. car = yes/no) as I hoped it would simplify things. I see a lot of ways this could go wrong, and I'm sure this is an approach that has either, been done and has a standard set of approaches, or has been shown not to work, but I believe the attempt on my own will be interesting

the code¶

the learning is split into a couple of functions :

train_random_labels(data_npy_path, label1=None, label2=None, random_seed=42, start_offset=None, epochs=7)

This takes a list of data files and two sets of labels (if these are empty it will assign them randomly, hence the name) and then it trains a NN classifier on every 32nd image to provide those 2 labels and then returns the results of labelling the entire dataset.

repeat_training(data_npy_path, num_rounds=10, epochs=7, random_seed=42)

This takes a data set and it runs train_random_labels for the defined number of times, feeding the labels from each iteration into the next and storing all of the labels back into the initial data structure

some issues¶

Most of my thoughts are at the end of this file, but a few quick thoughts.

  • There is a lot here that I did quickly without much thought, e.g. why every 32nd image used for training? (I'm not sure that's eggregiously wrong, but it's just a choice made in a rush).
  • Similarly I don't think having it assign 2 labels totally made sense, but this was initial testing and I have much more of an idea about how to approach this now (again see later)
  • I did very little to check the working that copilot did on the actual NN set up and training. I knew at the time this was likely a mistake, but I wanted to see how it would break/fail/need more guidance and this was a definite point of failure Other
In [3]:
# Load data.npy and display the head, plus percentage of '1' for each label column
import numpy as np

data_arr = np.load('data.npy', allow_pickle=True)
print("data.npy dtype:", data_arr.dtype)
print("First 5 entries:")
print(data_arr[:5])

# Print percentage of '1' for each label column
label_cols = [col for col in data_arr.dtype.names if col.startswith('label')]
for col in label_cols:
    values = data_arr[col]
    pct_ones = 100.0 * np.sum(values == 1) / len(values)
    print(f"{col}: {pct_ones:.2f}% are 1")
data.npy dtype: [('data', '<U256'), ('label1_training1', '<i4'), ('label2_training1', '<i4'), ('label1_training2', '<i4'), ('label2_training2', '<i4'), ('label1_training3', '<i4'), ('label2_training3', '<i4'), ('label1_training4', '<i4'), ('label2_training4', '<i4'), ('label1_training5', '<i4'), ('label2_training5', '<i4'), ('label1_training6', '<i4'), ('label2_training6', '<i4'), ('label1_training7', '<i4'), ('label2_training7', '<i4'), ('label1_training8', '<i4'), ('label2_training8', '<i4'), ('label1_training9', '<i4'), ('label2_training9', '<i4'), ('label1_training10', '<i4'), ('label2_training10', '<i4')]
First 5 entries:
[('data\\000000.tiff', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
 ('data\\000001.tiff', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
 ('data\\000002.tiff', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
 ('data\\000003.tiff', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
 ('data\\000004.tiff', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)]
label1_training1: 0.06% are 1
label2_training1: 1.04% are 1
label1_training2: 0.00% are 1
label2_training2: 10.41% are 1
label1_training3: 0.00% are 1
label2_training3: 8.01% are 1
label1_training4: 0.00% are 1
label2_training4: 15.91% are 1
label1_training5: 0.00% are 1
label2_training5: 15.53% are 1
label1_training6: 0.00% are 1
label2_training6: 24.86% are 1
label1_training7: 0.00% are 1
label2_training7: 29.37% are 1
label1_training8: 0.00% are 1
label2_training8: 26.50% are 1
label1_training9: 0.00% are 1
label2_training9: 29.27% are 1
label1_training10: 0.00% are 1
label2_training10: 30.95% are 1

initial results¶

This was my first set of checks of the labelling done through this training (looking for 2 labels, starting from random assignments and iterating for 10 rounds of training). The output shows what % of the labels for both types in each round are 1, i.e. how does it divide the data for that label. Clearly label1 found nothing (or rather, it found something that it believes is the same in every image, so not very useful). label2 however found something that seems real and it clearly becomes more confident about finding examples over subsequent iterations

In [4]:
# Compare each subsequent round of 'label2_training' and display the final round composite image

import importlib
label_utils = importlib.reload(__import__('label_utils'))
compare_labels = label_utils.compare_labels
show_labels = label_utils.show_labels
import matplotlib.pyplot as plt

# Load data.npy (already loaded as data_arr)
label_cols = [col for col in data_arr.dtype.names if col.startswith('label2_training')]
label_cols_sorted = sorted(label_cols, key=lambda x: int(''.join(filter(str.isdigit, x))))

# Print percentage agreement between each subsequent round
for i in range(len(label_cols_sorted) - 1):
    l1 = data_arr[label_cols_sorted[i]]
    l2 = data_arr[label_cols_sorted[i+1]]
    pct = compare_labels(l1, l2)
    print(f"Agreement between {label_cols_sorted[i]} and {label_cols_sorted[i+1]}: {pct:.2f}%")

# Display composite image for the final round of label2_training
final_label_col = label_cols_sorted[-1]
composite_img = show_labels(data_arr['data'], data_arr[final_label_col], label_value_1=0, label_value_2=1, max_per_group=25)

# Add a vertical gap between the two sets of images
import cv2
gap_width = 30
h = composite_img.shape[0]
w = composite_img.shape[1]
half = w // 2
left = composite_img[:, :half, :]
right = composite_img[:, half:, :]
gap = np.ones((h, gap_width, 3), dtype=np.uint8) * 128
composite_with_gap = np.hstack([left, gap, right])

# Display with larger title, and labels above left and right sides
plt.figure(figsize=(composite_with_gap.shape[1] / 100, composite_with_gap.shape[0] / 100 + 2.5))
plt.imshow(composite_with_gap[..., ::-1])
plt.axis('off')
plt.title(f"Composite for {final_label_col}", fontsize=42, pad=40)

# Add labels above left and right sides
plt.text(0, -20, "label2=0", fontsize=22, fontweight='bold', color='green', va='bottom', ha='left')
plt.text(composite_with_gap.shape[1], -20, "label2=1", fontsize=22, fontweight='bold', color='blue', va='bottom', ha='right')
plt.show()
Agreement between label2_training1 and label2_training2: 90.37%
Agreement between label2_training2 and label2_training3: 96.63%
Agreement between label2_training3 and label2_training4: 92.07%
Agreement between label2_training4 and label2_training5: 97.31%
Agreement between label2_training5 and label2_training6: 90.57%
Agreement between label2_training6 and label2_training7: 94.83%
Agreement between label2_training7 and label2_training8: 96.36%
Agreement between label2_training8 and label2_training9: 96.95%
Agreement between label2_training9 and label2_training10: 97.70%
No description has been provided for this image

detailed results¶

Since label2 was more successful this looks into it more. First the text output shows the agreement between different rounds for the label, that is, how it is converging between rounds. I'm pleasantly surprised by this as I wasn't sure there would be convergence.

The second output is the image, which shows the results for label2 in the final round of the labelling. In a way this is a nice result, it is clear there is a somewhat reliable difference between the left and right sides, however I am not sure I could confidently say what the difference is, as a result I think overall I am not happy with this

[both results use code I wrote for providing feedback on this and future rounds found in label_utils.py]

Final thoughts and summary¶

results¶

In terms of the actual results of my approach, I'm obviously not overjoyed, however it does seem as though there's some evidence that there's potential merit to this approach. As far as experimentation goes, I think this has been super useful, I'm learning a lot and having a good time doing it.

thoughts on copilot¶

As my first time using an LLM specifically tailored towards writing code, I've spent some time getting down my thoughts about it, included here.

  • it is unquestionably powerful, it easily spits out a function that would take me 5-10 minutes (more if I don't know coding in that area very well, for instance pytorch here)
  • reminds me of 'write instructions for making a PB+J sandwich' problem at times, as it will very happily misunderstand you
  • definitely possible to unspool a lot of code in not much time. The result of this means, I think a lot of discipline will be required to keep things clean and refactor
  • I absolutely see the possibility for this to mean writing code is less about line to line coding and more about design and thinking about overall structure, I like this for me as I love that part of the process.
  • There is a big danger in moving into unknown areas, I was able to train a model in pytorch to label images without looking up a single thing. This requires further discipline to make sure you gain undestanding, though it does mean that understanding isn't an immediate roadblock (so I could delay doing htat learning and still progress). Since finishing this section I spent a few hours checking what it did, understanding it and re-planning. In general it did a bad job of what I intended, most glaringly it completely removed spatial information from the training.
  • The ease of use makes it very easy to run with ideas before they are full formed or have been thought through (for instance allowing some idiot like yours truly to completely max out his memory or try training in a way that doesn't totally make sense)
  • for visual changes it's often better to just take control yourself, it doesn't do a good job of improving visualisations based on direction, presumably because it can't see/understand some of the concepts behind it

next steps¶

I'm excited to continue with this, I have a whole list of next steps, but some of the things I want to immediately include in section 2 are:

  • update the NN model with what I've learned to keep spatial information, do a better job training and better utilise the probabilities of my classifier
  • make the process more iterative (attempt to learn one label at a time)
  • keep using copilot, but interact with its output code more directly (and hopefully continue to learn it's limitaitons/how best to use it)