7

Yes, I'm dead serious with this question. How does searching with pip work?

The documentation of the keyword search refers to a "pip search reference" at https://pip.pypa.io/en/stable/user_guide/#searching-for-packages which is everything but a reference.

I can't conclude from search attempts how searching works. E.g. if I search for "exec" I get a variety of results such as exec-pypeline (0.4.2) - an incredible python package. I even get results with package names that have nothing to do with "exec" as long as the term "exec" is in the description.

But strangely I don't see one of my own packages in the list though one of the packages contains exec in it's name. That alone now would lead us to the conclusion that pip (at least) searches for complete search terms in the package description (which my package doesn't have).

But building on that assumption if I search for other terms that are provided in the package description I don't get my package listed either. And that applies to other packages as well: E.g. if I search for "projects" I don't get flask-macros in the result set though the term "projects" clearly exists in the description of flask-macros. So as this contradicts the assumption above this is clearly not the way how searching works.

And interestingly I can search for "macro" and get "flask-macros" as a result, but if I search for "macr" "flask-macros" is not found.

So how exactly is searching performed by pip? Where can a suitable reference be found for this?

khelwood
  • 55,782
  • 14
  • 81
  • 108
Regis May
  • 3,070
  • 2
  • 30
  • 51
  • One (of probably more) aspects related to this: I just found out that the web page lists at least 9300 entries for the term "logging". Interestingly `pip` returns just about 100. Could it be that `pip` simply leaves out most of the packages provided by a search? – Regis May Jul 10 '18 at 15:48
  • Not your question, but [qypi](https://github.com/jwodder/qypi) looks lke a reasonable alternative to `pip search`: a real query language, json output, well-written doc. – denis Feb 19 '20 at 16:49

1 Answers1

4

pip search looks for substring contained in the distribution name or the distribution summary. I can not see this documented anywhere, and found it by following the command in the source code directly.

The code for the search feature, which dates from Feb 2010, is still using an old xmlrpc_client. There is issue395 to change this, open since 2011, since the XML-RPC API is now considered legacy and should not be used. Somewhat surprisingly, the endpoint was not deprecated in the pypi-legacy to warehouse move, as the legacy routes are still there.

flask-macros did not show up in a search for "project" because this is too common a search term. Only 100 results are returned, this is a hardcoded limit in the elasticsearch view which handles the requests to those PyPI search routes. Note that this was reduced from 1000 fairly recently in PR3827.

Code to do a search with an API client directly:

import xmlrpc.client

client = xmlrpc.client.ServerProxy('https://pypi.org/pypi')
query = 'project'
results = client.search({'name': query, 'summary': query}, 'or')
print(len(results), 'results returned')
for result in sorted(results, key=lambda data: data['name'].lower()):
    print(result)

edit: The 100 result limit is now documented here.

wim
  • 338,267
  • 99
  • 616
  • 750
  • If that would be the case the package `flask-macros` should be found if I search for `project`. Which is not. And if that would be the case a search for `simpleexec` would return one of my packages. Which it does not. And if I search for `macr` the package `flask-macros` is not returned as it should be. See my question above. That's why I'm asking. – Regis May Jul 10 '18 at 15:41
  • Take your time. – Regis May Jul 10 '18 at 15:49
  • That only 100 results is returned is NOT documented. – Regis May Jul 10 '18 at 16:13
  • I know it is not documented. But it is tested: https://github.com/pypa/warehouse/blob/43ff0a7696fe47ab42d94ca7dea7f234ed2911d9/tests/unit/legacy/api/xmlrpc/test_xmlrpc.py#L78 – wim Jul 10 '18 at 16:24
  • @RegisMay pinged the project maintainers on github and they've [just updated the docs](https://github.com/pypa/warehouse/pull/4281/files#diff-edd1e4a1cc697e4c036d8fc2a91cf9ea). – wim Jul 10 '18 at 17:10
  • Thank you for doing that. From your action I conclude that my confusion was correct. But a search for `simpleexec` still does not return one of my packages. Same for `macr`. That still contradicts your explanation. But as elastic search seems to be used this would explain why some kind of lemmatization is performed on search terms. Returning only 100 results is extremely limited. And nevertheless all this is quite inconvenient as apparently an underscore is not configured to be a word delimiter. Adding all this I consider the search in `pip` as currently being completely unusable. :-( – Regis May Jul 13 '18 at 21:36
  • But thank you for your information. It was quite helpful and shed some light on the details of the implementation. Thank you! – Regis May Jul 13 '18 at 21:41
  • Yeah, I agree it's not particularly usable. But warehouse is open source, nothing stops you to contribute to the project. I'm sure it would be a welcome pull-request to move the search views away from the XML-RPC API and/or addressing those issues, are you interested in fixing it? – wim Jul 13 '18 at 22:04
  • I originally wanted to give you a quite sarcastic answer as a) python is around for 27 years now, b) I tend to not get involved in projects where I feel there is not a solid technical foundation after such a long time and c) I'm involved in three other important open source projects besides my regular job and family consuming very much time. But you were friendly in helping with my question so therefore I give you a different answer: Let me think about it. – Regis May Jul 15 '18 at 18:13