Sorry for the clickbait title, but I was searching for a rhyme for the library I will present to you. 😆
Introduction
There are many use cases where we need to know the correct type of a file:
When returning/rendering a file in a web application or API, knowing the mime type can help the browser display it correctly.
In webmail or drives such as Dropbox or Google Drive, knowing the file type can help display the correct image, giving users a clue as to what the file contains.
Again, in mail clients, some malicious users may want to pass executable scripts using innocent file type extensions such as txt, so knowing the correct type of file is crucial to preventing some attacks.
So far, the best tool to detect the type of a file is the file/libmagic project. It relies on file format heuristics. This approach can sometimes be error-prone because malicious users can confuse detection with adversarially-crafted payloads. This is why a team at Google decided to tackle this problem using deep learning techniques with the Keras library. They developed a small, highly accurate model for detecting file types and integrated it into a library called Magika (now you know why I chose the title for this article). This usage is straightforward as we will see in the next sections.
Installation
You will need Python 3.8 or higher to install Magika.
$ pip install magika
# with poetry
$ poetry add magika
# with uv
$ uv pip install magika
If you don’t know poetry or uv, I have introduction articles on both of them.
Usage
Let’s start with the classic Hello World example.
from dataclasses import asdict
from magika import Magika
m = Magika()
result = m.identify_bytes(b'hello world')
print('path:', result.path)
print('\n== model output fields ==')
for key, value in asdict(result.dl).items():
print(key, ':', value)
print('\n== magika output fields ==')
for key, value in asdict(result.output).items():
print(key, ':', value)
Here is what the output looks like.
path -
== model output fields ==
ct_label : None
score : None
group : None
mime_type : None
magic : None
description : None
== magika output fields ==
ct_label : txt
score : 1.0
group : text
mime_type : text/plain
magic : ASCII text
description : Generic text document
In the previous code, we instantiate a Magika
object and use one of the third methods it provides to identify files, identify_bytes
. It returns a MagikaResult
instance with three information:
path: the current file analyzed. In this case, it is
-
because we are passing bytes directly to the function.dl: an instance of the
ModelOutputFields
dataclass. It returns information about the prediction with the deep learning model.output: an instance of the
MagikaOutputFields
dataclass. This is probably the object you are looking for. It is the final result, based on the predictions of the deep learning models (dl
) and taking into account other information. It gives you information about the file type, the mime type, the confidence score, and some general information about the file format.
In the previous case, since the content to analyze was too small, the deep learning model was not called at all (it is used for content greater than 15 bytes at the moment of writing), this is why we see None
over every piece of information under the dl
object. When this is the case, depending on the input it will return Generic text document or Unknown binary data for the description.
You can try to test another byte string like b'print("hello world")'
and see the result yourself. 🙃
Another important piece of information is that the team behind this project insists that we should do the file type mapping using the ct_label
information instead of the mime type which can be error-prone. For more information, I recommend you to read this section of their FAQ.
To know all the labels and therefore all the file formats supported, you can read this page. The list will probably increase over time.
The two other methods to identify files take Path objects as input. I use this sample code for testing. Write the following content in a file called add.py for example.
def add(a: int, b: int) -> int:
return a + b
The testing code is almost identical to the first one, just the sixth line changed to use identity_path
. The add.py module is in the same folder as the test code.
from pathlib import Path
from dataclasses import asdict
from magika import Magika
m = Magika()
result = m.identify_path(Path('add.py'))
print('path:', result.path)
print('\n== model output fields ==')
for key, value in asdict(result.dl).items():
print(key, ':', value)
print('\n== magika output fields ==')
for key, value in asdict(result.output).items():
print(key, ':', value)
Interestingly, dl
and output
don't have the same output, dl
is more accurate this time. This is because the file is too small, so the analysis is not perfect. Try to add a single comment at the top of the file like # add.py
on the first line (this is exactly what I do) and you will see that it recognizes correctly the Python file format. Remember that it is a young library, so it will improve with time. 😉
The last method takes a list of paths instead of a single path, so we can slightly change the previous example.
from pathlib import Path
from dataclasses import asdict
from magika import Magika
m = Magika()
results = m.identify_paths([Path('add.py')])
for result in results:
print('path:', result.path)
print('\n== model output fields ==')
for key, value in asdict(result.dl).items():
print(key, ':', value)
print('\n== magika output fields ==')
for key, value in asdict(result.output).items():
print(key, ':', value)
These two methods are recommended over identify_bytes
if you need to analyze big files. They are optimized to read only a subset of the file and avoid running out of memory. Also, if you want to iterate over a folder, I recommend an approach like the following leveraging the yield keyword to consume one file at a time using a generator.
from typing import Iterator
from magika import Magika
def iterate_dir(path: Path) -> Iterator[Path]:
for file in path.rglob('*'):
if file.is_file():
yield file
m = Magika()
# you can replace the path with one you want to test
for path in iterate_dir(Path.home() / 'Downloads'):
result = m.identify_path(path)
print(result.path, '--', result.output)
I tried the previous code in my Downloads folder which has 86 files, it was correct on 85 files. 😁 The only one it was not able to guess was a PDF with a big image inside it. I guess it was confused by the image.
Magika also has a command line you can use.
# I use my test file, use whatever you want :)
$ magika main.py
main.py: Python source (code)
You can have an output in JSON or JSONL format if you want.
$ magika main.py --json
[
{
"path": "main.py",
"dl": {
"ct_label": "python",
"score": 0.9999998807907104,
"group": "code",
"mime_type": "text/x-python",
"magic": "Python script",
"description": "Python source"
},
"output": {
"ct_label": "python",
"score": 0.9999998807907104,
"group": "code",
"mime_type": "text/x-python",
"magic": "Python script",
"description": "Python source"
}
}
]
$ magika main.py --jsonl
{"path": "main.py", "dl": {"ct_label": "python", "score": 0.9999998807907104, "group": "code", "mime_type": "text/x-python", "magic": "Python script", "description": "Python source"}, "output": {"ct_label": "python", "score": 0.9999998807907104, "group": "code", "mime_type": "text/x-python", "magic": "Python script", "description": "Python source"}}
The JSONL format is more interesting in automated workflows because you can easily parse the JSON output.
Of course, you can analyze a folder.
$ magika -r /path/to/folder
Overall it is a shiny library in the file format space. It is still young and may have some bugs, but since it is already used in production in Google products, I think you can use it to complement other techniques you already use to detect file formats such as the file utility.
That is all for this article. I hope you enjoy reading it. Take care and see you soon. 🙃