Hey,
I happen take tuition apart from my school to aid in my studies. They provide me with videos, notes, problems and solutions after each session, but viewing them on my tuition’s website is a hassle. They have terrible UI/UX design, making it infuriating to navigate around their website, and I can’t blame them, their job isn’t webdev, but teaching, which they’re good at.
So I decided to do something about it.
They don’t have download buttons for their videos on their websites, because they don’t want me to download their videos, which is understandable.
But I easily circumvented this by using XDMan, paired with its Firefox browser extension.
Their videos occupy a significant amount of storage despite them being of terrible quality.
I noticed the videos were using the H.264 codec, and thought that maybe transcoding them to H.265 would take up less space. Since they’re mostly slides and a 3-hour video, I didn’t expect much of a quality drop. Transcoding is a cpu heavy process, but luckily, I happen to have a decent computer I already use as my server to store my media.
I came up with the idea of writing a systemd
service to transcode the videos as soon as I upload them to my server.
This is the script I wrote for the process
# /etc/systemd/system/transcoding.service | |
[Unit] | |
Description=Transcode tuiton videos | |
[Service] | |
ExecStart=/home/agastya/Videos/Tuition/transcode.sh | |
WorkingDirectory=/home/agastya/Videos/Tuition/ | |
Restart=always | |
User=user | |
[Install] | |
WantedBy=multi-user.target |
#!/bin/bash | |
find . -name '*.temptrans*' -delete | |
folder_path="." | |
IFS=$'\n' | |
while true; do | |
files=$(find "$folder_path" -type f -name "*.mp4" -o -name "*.mkv" -o -name "*.ts") | |
for file in $files; do | |
if lsof -w "$file" > /dev/null; then | |
echo "$file busy" | |
sleep 2 | |
else | |
if [[ $(ffprobe -v error -select_streams v:0 -show_entries stream=codec_name -of default=noprint_wrappers=1:nokey=1 ""$file"") != "hevc" ]]; then | |
ffmpeg -hide_banner -i "$file" -c:v libx265 -crf 28 -c:a aac -b:a 128k "${file%.*}.h265.temptrans.mp4" | |
if [ $? -eq 0 ]; then | |
mv "${file%.*}.h265.temptrans.mp4" "${file%.*}.h265.mp4" | |
rm "$file" | |
fi | |
fi | |
fi | |
done | |
sleep 5 | |
done |
All it does is loop over the root directory of the videos, checking each video file for its codec with ffprobe
, and if it happens to not be H.265
, it starts transcoding it. I’ve also made it so that if a file is currently being written to, maybe while it’s still being uploaded from my laptop to the server, the script skips it. The script transcodes to a temporary file, and after finishing, overwrites the original file with the transcoded video file.
This, paired with my jellyfin server, gave me a beautiful and unparalleled studying experience, I might do a post on setting up a media server with jellyfin.
The gratifying sense of accomplishment derived from automating this process sparked within me a desire to pursue further automation, even if it seemed like a pursuit of something insignificant.
Then I remembered my tuition supplies us with PDF files of problems, which have text embedded as images. That’s frustrating, because it forces me to type out the entire question if I need to google it.
So I created another script to automatically OCR every document in my media folder.
#!/bin/bash | |
folder_path="." | |
ocr_script="./ocrpdf.sh" | |
while true; do | |
files=$(find "$folder_path" -type f -name "*Module.pdf") | |
IFS=$'\n' | |
for file in $files; do | |
t_file="${file%Module.pdf}Module_t.pdf" | |
if [ ! -f "$t_file" ]; then | |
if lsof -w "$file" > /dev/null; then | |
echo "File '$file' is being written to. Skipping OCR." | |
else | |
echo "Running OCR for file: '$file'" | |
"$ocr_script" "$file" | |
fi | |
fi | |
done | |
sleep 5 | |
done |
#!/bin/bash | |
if ! command -v ocrmypdf &>/dev/null; then | |
echo "Can't find ocrmypdf" | |
exit 1 | |
fi | |
if [[ $# -ne 1 ]]; then | |
echo "Usage: $0 <filename.pdf>" | |
exit 1 | |
fi | |
filename="$1" | |
if [[ ! -f "$filename" ]]; then | |
echo "File not found." | |
exit 1 | |
fi | |
new_filename="${filename%.*}_t.pdf" | |
ocrmypdf "$filename" "$new_filename" | |
echo "OCR completed" |
Well, 2 scripts technically.
The second one is a helper script whose job is to actually scan the pdf and OCR it. I use the ocrmypdf
utility to do this.
The first script is actually the service, it loops over all PDF files ending in “Module.pdf”, since those are the only ones that need OCRing, and invokes the second script. And like last time, the script ignores the files that are currently being written to by another program.
So there you have it, the script that took me an hour to write which saves me from a couple dozen minutes of manual work. At least it gives me satisfaction.