-1

I recently tried to understand how sockets module works in python, I opened the source code and by tracing socket class I found out it uses something like _socket.socket. When i scroll up and find import _socket i traced it down and found out that the module is located in another folder named DLLs(i have no idea how but i do know that you can import files from python install location no matter where your file is located, but how?cool if you could answer this doubt too)so opening the file with notepad(it had no default extension association)tells me that it has an awkward encoding. Here's the first few lines in _socket.pyd :

MZ       ÿÿ  ¸       @                                     º ´  Í!¸LÍ!This program cannot be run in DOS mode.

    $       jâò.ƒã¡.ƒã¡.ƒã¡'ûp¡(ƒã¡B÷â ,ƒã¡B÷æ "ƒã¡B÷ç &ƒã¡B÷à -ƒã¡÷÷â ,ƒã¡uëâ )ƒã¡.ƒâ¡”ƒã¡÷÷î /ƒã¡÷÷ã /ƒã¡÷÷¡/ƒã¡÷÷á /ƒã¡Rich.ƒã¡                PE  d† ;3`        ð "  z   ¤      ¨(        €                        `    ð=  `                                   põ  P   Àõ  ´    @       0 €
        °   P ¸  ô¡  T                           P¢  8                                       .text   ny      z                    `.rdata  ¬y      z   ~              @  @.data   (        ø              @  À.pdata  €
       0                  @  @.rsrc       @  
                    @  @.reloc  ¸   P                  @  B                                                                                                                                                                                                                                                                H‰\$H‰t$UWATAVAWH¬$ÿÿÿHìð  H‹àÿ  H3ÄH‰…è   ¹  HT$ ÿf  …À…>,  H
      ÿ9„  3ÿƒ=è ÿDg…„   3ÒÇD$   A¸   H‰|$,HL$4è5(  3Àf‰}6A°‰E8W3Éÿ0€  A°A‹ÔH‹Èÿ!€  A°W H‹Èÿ€  W#ÇD$$   L‹ÀD‰d$(HL$ fD‰e4ÿß  ‹Ï…À•Á‰
    Z H‹£ƒ  H

anyone have any idea how do i decode this to simple python code(i only know that .pyd files are DLL files but in python format)? I also found out from hours of googling that DLL and EXE files have same encoding, so it would be cool if anyone could give me the link for a decoding tool or at least give me a table of this encoding's characters so i can decode it on my own.

  • 2
    It's a binary file that was the result of compiling C code. You can't convert it to Python code. – mkrieger1 Jun 16 '21 at 08:56
  • If you want to understand how the socket module works, you can read the original C source code, since Python is open source software. – mkrieger1 Jun 16 '21 at 08:56
  • so how was a C compilation imported into python? –  Jun 16 '21 at 08:57
  • By using the import mechanism of Python. – mkrieger1 Jun 16 '21 at 08:57
  • @mkrieger1 i can't find the original source code, can you give a link if u know where it is –  Jun 16 '21 at 08:58
  • Search here: https://github.com/python/cpython/ – Seems like this is the file: https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Modules/socketmodule.c – mkrieger1 Jun 16 '21 at 08:59
  • r u sure this is the file??(the starting code somewhat makes me feel tht this is for MAC not WIndows 10) –  Jun 16 '21 at 09:02
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/233833/discussion-between-dev-and-mkrieger1). –  Jun 16 '21 at 09:25

2 Answers2

1

DLL's and EXE's files are all "binary" formats. On Windows, this is the PE format. It is compiled machine code and cannot be reverted back to (nor was it started from) python code. Python supports calling Python extensions that are written in C, but called via Python. The socket library in Python is all written in C, and Python knows how to call into it.

Too look at the socket's code, you'll need to go find the corresponding C source file in the CPython repository. Alternatively, you can use a dissassembler like IDA Pro or Ghidra to give you an assembly representation, though if you don't yet understand binary formats, this may not be of much use. Ghidra (and HexRays for IDA Pro) will also attempt to Decompile the assembly giving you an approximation of the original source, but without variable names and inferred types and such.

But if you are looking for the python code that sits behind _socket, none exists.

Program Differences

  • Compiled Languages

    • These languages take source code (C, C++, etc) and turn it into Machine Code, which is most directly represented by an Assembly language. The program runs natively on the host machine, meaning it doesn't need any sort of interpreter. It's in a format the OS understands. The original source code is lost in that there is no direct mapping back to the original code. Inferences can be made with advanced decompilers but they are often imperfect and give some general guesses as to what the original source code looked liked. But there is no encoding such that you can parse out the source code from the Binary format.
  • Interpretive Languages

    • These languages run an interpreter (which is a native program in a format the OS understands, i.e. PE) which will interpret source code and dynamically turn it into Machine code the processor understand. This is how Python works and is why the source code is inside the program you run. But you can only run the Python code through a Python interpreter.
  • Managed Languages

    • These are a bit of a hybrid. They have a compilation step that takes source code and converts it into byte code. This byte code is then run through an interpreter that converts it down into Machine code. So you still need an interpreter (or VM is the more common term) that can run the byte-code, but the source code itself does not have to be present. Many of these can also be decompiled and may give better output than compiled languages, but it's simply inferences made from the underlying code, and not the actual source code that was used to build the binary.

Python can also behave like a managed language in that it's interpretation compiles the source into a byte code representation. Then it acts like a VM in that it executes that byte code. This is what the .pyc files are. The byte code representations of their corresponding .py files.

saquintes
  • 1,074
  • 3
  • 11
  • ok so i just want to be assured for this one thing, so this is coded in C(not C# or C++) and reverting this will give me C code not any binary stuff, right? –  Jun 16 '21 at 09:05
  • C code _is_ compiled to "binary stuff" (machine language opcodes). A DLL or PYD file does not contain code in any programming language as you understand it. Some very clever disassemblers may be able to map it back to equivalent C code, but it will not be what the programmer wrote. – alexis Jun 16 '21 at 09:15
0

Python itself is a program, and it runs on your computer. Python code is read and interpreted, and ultimately executed by executing code in your computer's native instruction set. Well, Python is set up so that it can import and run code that is already in the form of "native" instructions (in this case, written in C and compiled to machine code).

To get a feel for how this works, take a look at the official python.org documentation: Extending Python with C or C++. Enjoy!

alexis
  • 48,685
  • 16
  • 101
  • 161