A module for extracting exams marks from official PDFs, for the Faculty of Information Technology Engineering at Damascus University
Students exams marks at the Faculty of Information Technology Engineering at Damascus University are published as PDF documents of excel tables.
The PDF documents doesn't allow the exams marks to be used in excel sheets and other programs, because they're only made to be displayed.
That's why the OurMarks module was created, the module extracts the marks records from the PDF documents into structured data items that can be exported as CSV tables, and used for any computational purposes.
This opens the opportunity for:
import * as fs from 'fs';
import * as path from 'path';
import { getDocument } from 'pdfjs-dist/legacy/build/pdf';
import { extractMarksFromDocument } from 'ourmarks';
// Read the document's data
const TARGET_DOCUMENT = path.resolve(__dirname, './documents/1617010032_programming 3 -2-f1-2021.pdf');
const documentData = fs.readFileSync(TARGET_DOCUMENT);
// Parse the marks
async function main() {
const document = await getDocument(documentData).promise;
const marksRecords = await extractMarksFromDocument(document);
document.destroy();
console.log(marksRecords);
}
// Run the asynchronous function
main().catch(console.error);
npm install ourmarks pdfjs-dist
or
yarn add ourmarks pdfjs-dist
The module provides 2 top-level asynchronous functions for extracting marks from PDF documents.
It's expected to have the document loaded using PDF.js first, which is very simple:
import { getDocument } from 'pdfjs-dist';
// Inside your main asynchronous function
async function main() {
const document = await getDocument(rawPDFBinaryData).promise;
// ...
// Don't forget to destroy the document inorder to free the resources allocated.
document.destroy();
}
// Run the asynchronous function
main().catch(console.error);
On
node.js
you have to importpdfjs-dist/legacy/build/pdf
instead due to compatibility reasons.
rawPDFBinaryData
can be a Node.jsBuffer
object, a url to the document, aUint8Array
and multiple other options as provided by PDF.js
Then the whole document can be processed at once using extractMarksFromDocument
:
import { extractMarksFromDocument } from 'ourmarks';
// Inside the main() function defined earlier:
const marksRecords = await extractMarksFromDocument(document);
Or it can be processed page by page using extractMarksFromPage
:
import { extractMarksFromPage, MarkRecord } from 'ourmarks';
const wholeRecords: MarkRecord[] = [];
// Inside the main() function defined earlier:
for (let i = 1; i <= document.numPages; i++) {
const page = await document.getPage(i);
const pageRecords = await extractMarksFromPage(page);
wholeRecords.push(...pageRecords);
}
In addition to the top-level extractMarksFromDocument
and extractMarksFromPage
functions, there are a bunch of other lower-level functions for advanced users.
It's completely unnecessary to use them, but if you want to play around with how the module internally works, you can check the api documentation and read the 'how it works' section below.
The marks extractor works through a list of 7 steps:
The PDF document is loaded using the PDF.js
library so it can be parsed.
Once the document has been loaded, it's possible to load each of its pages.
Each page in the document is loaded.
Once a page is loaded, it's possible to read its content for processing.
For each page, a list of all the text items in it is created.
Each text item has the following data structure:
Field Name | Type | Description |
---|---|---|
string | string |
The content of the item |
direction | 'ttb' 'ltr' 'rtl' |
The direction of the item's content |
width | number |
The width of the item, in document units |
height | number |
The height of the item, in document units |
tranform | number[] |
The 3x3 transformation matrix of the item, with only 6 values stored |
tranform[0] |
number |
The (0,0) value in the item's tranformation matrix, represents scale x |
tranform[1] |
number |
The (1,0) value in the item's tranformation matrix, represents skew |
tranform[2] |
number |
The (0,1) value in the item's tranformation matrix, represents skew |
tranform[3] |
number |
The (1,1) value in the item's tranformation matrix, represents scale y |
tranform[4] |
number |
The (0,2) value in the item's tranformation matrix, represents translate x |
tranform[5] |
number |
The (1,2) value in the item's tranformation matrix, represents translate y |
With the text items stored in a list, the loaded PDF document can be discarded safely as it's no longer needed.
The items list is filtered from:
ttb
direction, we're only intereseted in English and Arabic items.tranform[1]
and tranform[2]
, we're not interested in any items with any rotation/skewing.''
content.transform[4]
or tranform[5]
, as they are invisible/invalid.Then each item is mapped into a more simplified data structure:
Each item is determined as Arabic if it has
rtl
direction
Field Name | Type | Description |
---|---|---|
value | string |
The content of the simplified item |
arabic | 'true' 'false' |
Whether the item contains any Arabic characters or not |
x | number |
The X coordinates of the item, equal to tranform[4] |
y | number |
The Y coordinates of the item, equal to tranform[5] |
width | number |
The width of the item |
height | number |
The height of the item |
Update at 2022-09-21: The new versions of pdf-js no longer produce this issue!
As of OurMarks 3.0.0 this step has been disabled by default but still available behind an option.
It was found that Arabic content is stored as independent text items of each character.
And so the characters has to be merged back into proper items.
A simple algorithm was created to solve that, here's an overview:
Please note that the coordinates in the PDF documents are bottom-left corner based.
errorTolerance = currentItem.height / 10
.currentItem.x <= previousItem.X + previousItem.width + errorTolerance
should be met.currentItem.x + currentItem.width - previousItem.x
.Please note that the previous item is the item on the left, and the next item is the item on the right. That's because how the items list was sorted.
The text items can be now shaped into a table structure, which is a 2-dimensional list of the items.
The first dimension is for the rows, and the second dimension is for the cells.
itemA.y >= itemB.y && itemA.y <= itemB.y + itemB.height
or itemB.y >= itemA.y && itemB.y <= itemA.y + itemA.height
Now the simplified text items has been stored in a table structure, it's possible to iterate over its rows and extract marks records.
Mark records have the following data structure:
Field Name | Type | Description |
---|---|---|
studentId | number |
The exam ID of the student, a 5 digits number |
studentName | string / null |
The full name of the student, may contain his father's name in some situations |
studentFatherName | string / null |
The name of the student's father when not included in the full name |
practicalMark | number / null |
The practical mark of the exam, usually out of 20 or 30 |
theoreticalMark | number / null |
The theoretical mark of the exam, usually out of 80 or 70 |
examMark | number / null |
The total mark of the exam, should be out of 100 |
All the fields (except the
studentId
) can benull
because they might be missing from the table, or malformed with other values.
The marks records are extracted using the following algorithm:
studentId
is that item.studentName
is null
, then set it to the item.studentFatherName
is null
then set it to the item.examMark
.practicalMark
, theoreticalMark
and examMark
in this order.Generated using TypeDoc