1 //! Management of the index of a registry source
2 //!
3 //! This module contains management of the index and various operations, such as
4 //! actually parsing the index, looking for crates, etc. This is intended to be
5 //! abstract over remote indices (downloaded via git) and local registry indices
6 //! (which are all just present on the filesystem).
7 //!
8 //! ## Index Performance
9 //!
10 //! One important aspect of the index is that we want to optimize the "happy
11 //! path" as much as possible. Whenever you type `cargo build` Cargo will
12 //! *always* reparse the registry and learn about dependency information. This
13 //! is done because Cargo needs to learn about the upstream crates.io crates
14 //! that you're using and ensure that the preexisting `Cargo.lock` still matches
15 //! the current state of the world.
16 //!
17 //! Consequently, Cargo "null builds" (the index that Cargo adds to each build
18 //! itself) need to be fast when accessing the index. The primary performance
19 //! optimization here is to avoid parsing JSON blobs from the registry if we
20 //! don't need them. Most secondary optimizations are centered around removing
21 //! allocations and such, but avoiding parsing JSON is the #1 optimization.
22 //!
23 //! When we get queries from the resolver we're given a `Dependency`. This
24 //! dependency in turn has a version requirement, and with lock files that
25 //! already exist these version requirements are exact version requirements
26 //! `=a.b.c`. This means that we in theory only need to parse one line of JSON
27 //! per query in the registry, the one that matches version `a.b.c`.
28 //!
29 //! The crates.io index, however, is not amenable to this form of query. Instead
30 //! the crates.io index simply is a file where each line is a JSON blob. To
31 //! learn about the versions in each JSON blob we would need to parse the JSON,
32 //! defeating the purpose of trying to parse as little as possible.
33 //!
34 //! > Note that as a small aside even *loading* the JSON from the registry is
35 //! > actually pretty slow. For crates.io and remote registries we don't
36 //! > actually check out the git index on disk because that takes quite some
37 //! > time and is quite large. Instead we use `libgit2` to read the JSON from
38 //! > the raw git objects. This in turn can be slow (aka show up high in
39 //! > profiles) because libgit2 has to do deflate decompression and such.
40 //!
41 //! To solve all these issues a strategy is employed here where Cargo basically
42 //! creates an index into the index. The first time a package is queried about
43 //! (first time being for an entire computer) Cargo will load the contents
44 //! (slowly via libgit2) from the registry. It will then (slowly) parse every
45 //! single line to learn about its versions. Afterwards, however, Cargo will
46 //! emit a new file (a cache) which is amenable for speedily parsing in future
47 //! invocations.
48 //!
49 //! This cache file is currently organized by basically having the semver
50 //! version extracted from each JSON blob. That way Cargo can quickly and easily
51 //! parse all versions contained and which JSON blob they're associated with.
52 //! The JSON blob then doesn't actually need to get parsed unless the version is
53 //! parsed.
54 //!
55 //! Altogether the initial measurements of this shows a massive improvement for
56 //! Cargo null build performance. It's expected that the improvements earned
57 //! here will continue to grow over time in the sense that the previous
58 //! implementation (parse all lines each time) actually continues to slow down
59 //! over time as new versions of a crate are published. In any case when first
60 //! implemented a null build of Cargo itself would parse 3700 JSON blobs from
61 //! the registry and load 150 blobs from git. Afterwards it parses 150 JSON
62 //! blobs and loads 0 files git. Removing 200ms or more from Cargo's startup
63 //! time is certainly nothing to sneeze at!
64 //!
65 //! Note that this is just a high-level overview, there's of course lots of
66 //! details like invalidating caches and whatnot which are handled below, but
67 //! hopefully those are more obvious inline in the code itself.
68 
69 use crate::core::dependency::Dependency;
70 use crate::core::{PackageId, SourceId, Summary};
71 use crate::sources::registry::{RegistryData, RegistryPackage, INDEX_V_MAX};
72 use crate::util::interning::InternedString;
73 use crate::util::{internal, CargoResult, Config, Filesystem, OptVersionReq, ToSemver};
74 use anyhow::bail;
75 use cargo_util::{paths, registry::make_dep_path};
76 use log::{debug, info};
77 use semver::Version;
78 use std::collections::{HashMap, HashSet};
79 use std::convert::TryInto;
80 use std::fs;
81 use std::path::Path;
82 use std::str;
83 
84 /// Crates.io treats hyphen and underscores as interchangeable, but the index and old Cargo do not.
85 /// Therefore, the index must store uncanonicalized version of the name so old Cargo's can find it.
86 /// This loop tries all possible combinations of switching hyphen and underscores to find the
87 /// uncanonicalized one. As all stored inputs have the correct spelling, we start with the spelling
88 /// as-provided.
89 struct UncanonicalizedIter<'s> {
90     input: &'s str,
91     num_hyphen_underscore: u32,
92     hyphen_combination_num: u16,
93 }
94 
95 impl<'s> UncanonicalizedIter<'s> {
new(input: &'s str) -> Self96     fn new(input: &'s str) -> Self {
97         let num_hyphen_underscore = input.chars().filter(|&c| c == '_' || c == '-').count() as u32;
98         UncanonicalizedIter {
99             input,
100             num_hyphen_underscore,
101             hyphen_combination_num: 0,
102         }
103     }
104 }
105 
106 impl<'s> Iterator for UncanonicalizedIter<'s> {
107     type Item = String;
108 
next(&mut self) -> Option<Self::Item>109     fn next(&mut self) -> Option<Self::Item> {
110         if self.hyphen_combination_num > 0
111             && self.hyphen_combination_num.trailing_zeros() >= self.num_hyphen_underscore
112         {
113             return None;
114         }
115 
116         let ret = Some(
117             self.input
118                 .chars()
119                 .scan(0u16, |s, c| {
120                     // the check against 15 here's to prevent
121                     // shift overflow on inputs with more than 15 hyphens
122                     if (c == '_' || c == '-') && *s <= 15 {
123                         let switch = (self.hyphen_combination_num & (1u16 << *s)) > 0;
124                         let out = if (c == '_') ^ switch { '_' } else { '-' };
125                         *s += 1;
126                         Some(out)
127                     } else {
128                         Some(c)
129                     }
130                 })
131                 .collect(),
132         );
133         self.hyphen_combination_num += 1;
134         ret
135     }
136 }
137 
138 #[test]
no_hyphen()139 fn no_hyphen() {
140     assert_eq!(
141         UncanonicalizedIter::new("test").collect::<Vec<_>>(),
142         vec!["test".to_string()]
143     )
144 }
145 
146 #[test]
two_hyphen()147 fn two_hyphen() {
148     assert_eq!(
149         UncanonicalizedIter::new("te-_st").collect::<Vec<_>>(),
150         vec![
151             "te-_st".to_string(),
152             "te__st".to_string(),
153             "te--st".to_string(),
154             "te_-st".to_string()
155         ]
156     )
157 }
158 
159 #[test]
overflow_hyphen()160 fn overflow_hyphen() {
161     assert_eq!(
162         UncanonicalizedIter::new("te-_-_-_-_-_-_-_-_-st")
163             .take(100)
164             .count(),
165         100
166     )
167 }
168 
169 /// Manager for handling the on-disk index.
170 ///
171 /// Note that local and remote registries store the index differently. Local
172 /// is a simple on-disk tree of files of the raw index. Remote registries are
173 /// stored as a raw git repository. The different means of access are handled
174 /// via the [`RegistryData`] trait abstraction.
175 ///
176 /// This transparently handles caching of the index in a more efficient format.
177 pub struct RegistryIndex<'cfg> {
178     source_id: SourceId,
179     /// Root directory of the index for the registry.
180     path: Filesystem,
181     /// Cache of summary data.
182     ///
183     /// This is keyed off the package name. The [`Summaries`] value handles
184     /// loading the summary data. It keeps an optimized on-disk representation
185     /// of the JSON files, which is created in an as-needed fashion. If it
186     /// hasn't been cached already, it uses [`RegistryData::load`] to access
187     /// to JSON files from the index, and the creates the optimized on-disk
188     /// summary cache.
189     summaries_cache: HashMap<InternedString, Summaries>,
190     /// [`Config`] reference for convenience.
191     config: &'cfg Config,
192 }
193 
194 /// An internal cache of summaries for a particular package.
195 ///
196 /// A list of summaries are loaded from disk via one of two methods:
197 ///
198 /// 1. Primarily Cargo will parse the corresponding file for a crate in the
199 ///    upstream crates.io registry. That's just a JSON blob per line which we
200 ///    can parse, extract the version, and then store here.
201 ///
202 /// 2. Alternatively, if Cargo has previously run, we'll have a cached index of
203 ///    dependencies for the upstream index. This is a file that Cargo maintains
204 ///    lazily on the local filesystem and is much faster to parse since it
205 ///    doesn't involve parsing all of the JSON.
206 ///
207 /// The outward-facing interface of this doesn't matter too much where it's
208 /// loaded from, but it's important when reading the implementation to note that
209 /// we try to parse as little as possible!
210 #[derive(Default)]
211 struct Summaries {
212     /// A raw vector of uninterpreted bytes. This is what `Unparsed` start/end
213     /// fields are indexes into. If a `Summaries` is loaded from the crates.io
214     /// index then this field will be empty since nothing is `Unparsed`.
215     raw_data: Vec<u8>,
216 
217     /// All known versions of a crate, keyed from their `Version` to the
218     /// possibly parsed or unparsed version of the full summary.
219     versions: HashMap<Version, MaybeIndexSummary>,
220 }
221 
222 /// A lazily parsed `IndexSummary`.
223 enum MaybeIndexSummary {
224     /// A summary which has not been parsed, The `start` and `end` are pointers
225     /// into `Summaries::raw_data` which this is an entry of.
226     Unparsed { start: usize, end: usize },
227 
228     /// An actually parsed summary.
229     Parsed(IndexSummary),
230 }
231 
232 /// A parsed representation of a summary from the index.
233 ///
234 /// In addition to a full `Summary` we have information on whether it is `yanked`.
235 pub struct IndexSummary {
236     pub summary: Summary,
237     pub yanked: bool,
238     /// Schema version, see [`RegistryPackage`].
239     v: u32,
240 }
241 
242 /// A representation of the cache on disk that Cargo maintains of summaries.
243 /// Cargo will initially parse all summaries in the registry and will then
244 /// serialize that into this form and place it in a new location on disk,
245 /// ensuring that access in the future is much speedier.
246 #[derive(Default)]
247 struct SummariesCache<'a> {
248     versions: Vec<(Version, &'a [u8])>,
249 }
250 
251 impl<'cfg> RegistryIndex<'cfg> {
new( source_id: SourceId, path: &Filesystem, config: &'cfg Config, ) -> RegistryIndex<'cfg>252     pub fn new(
253         source_id: SourceId,
254         path: &Filesystem,
255         config: &'cfg Config,
256     ) -> RegistryIndex<'cfg> {
257         RegistryIndex {
258             source_id,
259             path: path.clone(),
260             summaries_cache: HashMap::new(),
261             config,
262         }
263     }
264 
265     /// Returns the hash listed for a specified `PackageId`.
hash(&mut self, pkg: PackageId, load: &mut dyn RegistryData) -> CargoResult<&str>266     pub fn hash(&mut self, pkg: PackageId, load: &mut dyn RegistryData) -> CargoResult<&str> {
267         let req = OptVersionReq::exact(pkg.version());
268         let summary = self
269             .summaries(pkg.name(), &req, load)?
270             .next()
271             .ok_or_else(|| internal(format!("no hash listed for {}", pkg)))?;
272         summary
273             .summary
274             .checksum()
275             .ok_or_else(|| internal(format!("no hash listed for {}", pkg)))
276     }
277 
278     /// Load a list of summaries for `name` package in this registry which
279     /// match `req`
280     ///
281     /// This function will semantically parse the on-disk index, match all
282     /// versions, and then return an iterator over all summaries which matched.
283     /// Internally there's quite a few layer of caching to amortize this cost
284     /// though since this method is called quite a lot on null builds in Cargo.
summaries<'a, 'b>( &'a mut self, name: InternedString, req: &'b OptVersionReq, load: &mut dyn RegistryData, ) -> CargoResult<impl Iterator<Item = &'a IndexSummary> + 'b> where 'a: 'b,285     pub fn summaries<'a, 'b>(
286         &'a mut self,
287         name: InternedString,
288         req: &'b OptVersionReq,
289         load: &mut dyn RegistryData,
290     ) -> CargoResult<impl Iterator<Item = &'a IndexSummary> + 'b>
291     where
292         'a: 'b,
293     {
294         let source_id = self.source_id;
295         let config = self.config;
296         let namespaced_features = self.config.cli_unstable().namespaced_features;
297         let weak_dep_features = self.config.cli_unstable().weak_dep_features;
298 
299         // First up actually parse what summaries we have available. If Cargo
300         // has run previously this will parse a Cargo-specific cache file rather
301         // than the registry itself. In effect this is intended to be a quite
302         // cheap operation.
303         let summaries = self.load_summaries(name, load)?;
304 
305         // Iterate over our summaries, extract all relevant ones which match our
306         // version requirement, and then parse all corresponding rows in the
307         // registry. As a reminder this `summaries` method is called for each
308         // entry in a lock file on every build, so we want to absolutely
309         // minimize the amount of work being done here and parse as little as
310         // necessary.
311         let raw_data = &summaries.raw_data;
312         let max_version = if namespaced_features || weak_dep_features {
313             INDEX_V_MAX
314         } else {
315             1
316         };
317         Ok(summaries
318             .versions
319             .iter_mut()
320             .filter_map(move |(k, v)| if req.matches(k) { Some(v) } else { None })
321             .filter_map(
322                 move |maybe| match maybe.parse(config, raw_data, source_id) {
323                     Ok(summary) => Some(summary),
324                     Err(e) => {
325                         info!("failed to parse `{}` registry package: {}", name, e);
326                         None
327                     }
328                 },
329             )
330             .filter(move |is| {
331                 if is.v > max_version {
332                     debug!(
333                         "unsupported schema version {} ({} {})",
334                         is.v,
335                         is.summary.name(),
336                         is.summary.version()
337                     );
338                     false
339                 } else {
340                     true
341                 }
342             })
343             .filter(move |is| {
344                 is.summary
345                     .unstable_gate(namespaced_features, weak_dep_features)
346                     .is_ok()
347             }))
348     }
349 
load_summaries( &mut self, name: InternedString, load: &mut dyn RegistryData, ) -> CargoResult<&mut Summaries>350     fn load_summaries(
351         &mut self,
352         name: InternedString,
353         load: &mut dyn RegistryData,
354     ) -> CargoResult<&mut Summaries> {
355         // If we've previously loaded what versions are present for `name`, just
356         // return that since our cache should still be valid.
357         if self.summaries_cache.contains_key(&name) {
358             return Ok(self.summaries_cache.get_mut(&name).unwrap());
359         }
360 
361         // Prepare the `RegistryData` which will lazily initialize internal data
362         // structures.
363         load.prepare()?;
364 
365         // let root = self.config.assert_package_cache_locked(&self.path);
366         let root = load.assert_index_locked(&self.path);
367         let cache_root = root.join(".cache");
368         let index_version = load.current_version();
369 
370         // See module comment in `registry/mod.rs` for why this is structured
371         // the way it is.
372         let fs_name = name
373             .chars()
374             .flat_map(|c| c.to_lowercase())
375             .collect::<String>();
376         let raw_path = make_dep_path(&fs_name, false);
377 
378         // Attempt to handle misspellings by searching for a chain of related
379         // names to the original `raw_path` name. Only return summaries
380         // associated with the first hit, however. The resolver will later
381         // reject any candidates that have the wrong name, and with this it'll
382         // along the way produce helpful "did you mean?" suggestions.
383         for path in UncanonicalizedIter::new(&raw_path).take(1024) {
384             let summaries = Summaries::parse(
385                 index_version.as_deref(),
386                 root,
387                 &cache_root,
388                 path.as_ref(),
389                 self.source_id,
390                 load,
391                 self.config,
392             )?;
393             if let Some(summaries) = summaries {
394                 self.summaries_cache.insert(name, summaries);
395                 return Ok(self.summaries_cache.get_mut(&name).unwrap());
396             }
397         }
398 
399         // If nothing was found then this crate doesn't exists, so just use an
400         // empty `Summaries` list.
401         self.summaries_cache.insert(name, Summaries::default());
402         Ok(self.summaries_cache.get_mut(&name).unwrap())
403     }
404 
query_inner( &mut self, dep: &Dependency, load: &mut dyn RegistryData, yanked_whitelist: &HashSet<PackageId>, f: &mut dyn FnMut(Summary), ) -> CargoResult<()>405     pub fn query_inner(
406         &mut self,
407         dep: &Dependency,
408         load: &mut dyn RegistryData,
409         yanked_whitelist: &HashSet<PackageId>,
410         f: &mut dyn FnMut(Summary),
411     ) -> CargoResult<()> {
412         if self.config.offline()
413             && self.query_inner_with_online(dep, load, yanked_whitelist, f, false)? != 0
414         {
415             return Ok(());
416             // If offline, and there are no matches, try again with online.
417             // This is necessary for dependencies that are not used (such as
418             // target-cfg or optional), but are not downloaded. Normally the
419             // build should succeed if they are not downloaded and not used,
420             // but they still need to resolve. If they are actually needed
421             // then cargo will fail to download and an error message
422             // indicating that the required dependency is unavailable while
423             // offline will be displayed.
424         }
425         self.query_inner_with_online(dep, load, yanked_whitelist, f, true)?;
426         Ok(())
427     }
428 
query_inner_with_online( &mut self, dep: &Dependency, load: &mut dyn RegistryData, yanked_whitelist: &HashSet<PackageId>, f: &mut dyn FnMut(Summary), online: bool, ) -> CargoResult<usize>429     fn query_inner_with_online(
430         &mut self,
431         dep: &Dependency,
432         load: &mut dyn RegistryData,
433         yanked_whitelist: &HashSet<PackageId>,
434         f: &mut dyn FnMut(Summary),
435         online: bool,
436     ) -> CargoResult<usize> {
437         let source_id = self.source_id;
438         let summaries = self
439             .summaries(dep.package_name(), dep.version_req(), load)?
440             // First filter summaries for `--offline`. If we're online then
441             // everything is a candidate, otherwise if we're offline we're only
442             // going to consider candidates which are actually present on disk.
443             //
444             // Note: This particular logic can cause problems with
445             // optional dependencies when offline. If at least 1 version
446             // of an optional dependency is downloaded, but that version
447             // does not satisfy the requirements, then resolution will
448             // fail. Unfortunately, whether or not something is optional
449             // is not known here.
450             .filter(|s| (online || load.is_crate_downloaded(s.summary.package_id())))
451             // Next filter out all yanked packages. Some yanked packages may
452             // leak throguh if they're in a whitelist (aka if they were
453             // previously in `Cargo.lock`
454             .filter(|s| !s.yanked || yanked_whitelist.contains(&s.summary.package_id()))
455             .map(|s| s.summary.clone());
456 
457         // Handle `cargo update --precise` here. If specified, our own source
458         // will have a precise version listed of the form
459         // `<pkg>=<p_req>o-><f_req>` where `<pkg>` is the name of a crate on
460         // this source, `<p_req>` is the version installed and `<f_req> is the
461         // version requested (argument to `--precise`).
462         let name = dep.package_name().as_str();
463         let precise = match source_id.precise() {
464             Some(p) if p.starts_with(name) && p[name.len()..].starts_with('=') => {
465                 let mut vers = p[name.len() + 1..].splitn(2, "->");
466                 let current_vers = vers.next().unwrap().to_semver().unwrap();
467                 let requested_vers = vers.next().unwrap().to_semver().unwrap();
468                 Some((current_vers, requested_vers))
469             }
470             _ => None,
471         };
472         let summaries = summaries.filter(|s| match &precise {
473             Some((current, requested)) => {
474                 if dep.version_req().matches(current) {
475                     // Unfortunately crates.io allows versions to differ only
476                     // by build metadata. This shouldn't be allowed, but since
477                     // it is, this will honor it if requested. However, if not
478                     // specified, then ignore it.
479                     let s_vers = s.version();
480                     match (s_vers.build.is_empty(), requested.build.is_empty()) {
481                         (true, true) => s_vers == requested,
482                         (true, false) => false,
483                         (false, true) => {
484                             // Strip out the metadata.
485                             s_vers.major == requested.major
486                                 && s_vers.minor == requested.minor
487                                 && s_vers.patch == requested.patch
488                                 && s_vers.pre == requested.pre
489                         }
490                         (false, false) => s_vers == requested,
491                     }
492                 } else {
493                     true
494                 }
495             }
496             None => true,
497         });
498 
499         let mut count = 0;
500         for summary in summaries {
501             f(summary);
502             count += 1;
503         }
504         Ok(count)
505     }
506 
is_yanked(&mut self, pkg: PackageId, load: &mut dyn RegistryData) -> CargoResult<bool>507     pub fn is_yanked(&mut self, pkg: PackageId, load: &mut dyn RegistryData) -> CargoResult<bool> {
508         let req = OptVersionReq::exact(pkg.version());
509         let found = self
510             .summaries(pkg.name(), &req, load)?
511             .any(|summary| summary.yanked);
512         Ok(found)
513     }
514 }
515 
516 impl Summaries {
517     /// Parse out a `Summaries` instances from on-disk state.
518     ///
519     /// This will attempt to prefer parsing a previous cache file that already
520     /// exists from a previous invocation of Cargo (aka you're typing `cargo
521     /// build` again after typing it previously). If parsing fails or the cache
522     /// isn't found, then we take a slower path which loads the full descriptor
523     /// for `relative` from the underlying index (aka typically libgit2 with
524     /// crates.io) and then parse everything in there.
525     ///
526     /// * `index_version` - a version string to describe the current state of
527     ///   the index which for remote registries is the current git sha and
528     ///   for local registries is not available.
529     /// * `root` - this is the root argument passed to `load`
530     /// * `cache_root` - this is the root on the filesystem itself of where to
531     ///   store cache files.
532     /// * `relative` - this is the file we're loading from cache or the index
533     ///   data
534     /// * `source_id` - the registry's SourceId used when parsing JSON blobs to
535     ///   create summaries.
536     /// * `load` - the actual index implementation which may be very slow to
537     ///   call. We avoid this if we can.
parse( index_version: Option<&str>, root: &Path, cache_root: &Path, relative: &Path, source_id: SourceId, load: &mut dyn RegistryData, config: &Config, ) -> CargoResult<Option<Summaries>>538     pub fn parse(
539         index_version: Option<&str>,
540         root: &Path,
541         cache_root: &Path,
542         relative: &Path,
543         source_id: SourceId,
544         load: &mut dyn RegistryData,
545         config: &Config,
546     ) -> CargoResult<Option<Summaries>> {
547         // First up, attempt to load the cache. This could fail for all manner
548         // of reasons, but consider all of them non-fatal and just log their
549         // occurrence in case anyone is debugging anything.
550         let cache_path = cache_root.join(relative);
551         let mut cache_contents = None;
552         if let Some(index_version) = index_version {
553             match fs::read(&cache_path) {
554                 Ok(contents) => match Summaries::parse_cache(contents, index_version) {
555                     Ok(s) => {
556                         log::debug!("fast path for registry cache of {:?}", relative);
557                         if cfg!(debug_assertions) {
558                             cache_contents = Some(s.raw_data);
559                         } else {
560                             return Ok(Some(s));
561                         }
562                     }
563                     Err(e) => {
564                         log::debug!("failed to parse {:?} cache: {}", relative, e);
565                     }
566                 },
567                 Err(e) => log::debug!("cache missing for {:?} error: {}", relative, e),
568             }
569         }
570 
571         // This is the fallback path where we actually talk to libgit2 to load
572         // information. Here we parse every single line in the index (as we need
573         // to find the versions)
574         log::debug!("slow path for {:?}", relative);
575         let mut ret = Summaries::default();
576         let mut hit_closure = false;
577         let mut cache_bytes = None;
578         let err = load.load(root, relative, &mut |contents| {
579             ret.raw_data = contents.to_vec();
580             let mut cache = SummariesCache::default();
581             hit_closure = true;
582             for line in split(contents, b'\n') {
583                 // Attempt forwards-compatibility on the index by ignoring
584                 // everything that we ourselves don't understand, that should
585                 // allow future cargo implementations to break the
586                 // interpretation of each line here and older cargo will simply
587                 // ignore the new lines.
588                 let summary = match IndexSummary::parse(config, line, source_id) {
589                     Ok(summary) => summary,
590                     Err(e) => {
591                         // This should only happen when there is an index
592                         // entry from a future version of cargo that this
593                         // version doesn't understand. Hopefully, those future
594                         // versions of cargo correctly set INDEX_V_MAX and
595                         // CURRENT_CACHE_VERSION, otherwise this will skip
596                         // entries in the cache preventing those newer
597                         // versions from reading them (that is, until the
598                         // cache is rebuilt).
599                         log::info!("failed to parse {:?} registry package: {}", relative, e);
600                         continue;
601                     }
602                 };
603                 let version = summary.summary.package_id().version().clone();
604                 cache.versions.push((version.clone(), line));
605                 ret.versions.insert(version, summary.into());
606             }
607             if let Some(index_version) = index_version {
608                 cache_bytes = Some(cache.serialize(index_version));
609             }
610             Ok(())
611         });
612 
613         // We ignore lookup failures as those are just crates which don't exist
614         // or we haven't updated the registry yet. If we actually ran the
615         // closure though then we care about those errors.
616         if !hit_closure {
617             debug_assert!(cache_contents.is_none());
618             return Ok(None);
619         }
620         err?;
621 
622         // If we've got debug assertions enabled and the cache was previously
623         // present and considered fresh this is where the debug assertions
624         // actually happens to verify that our cache is indeed fresh and
625         // computes exactly the same value as before.
626         if cfg!(debug_assertions) && cache_contents.is_some() && cache_bytes != cache_contents {
627             panic!(
628                 "original cache contents:\n{:?}\n\
629                  does not equal new cache contents:\n{:?}\n",
630                 cache_contents.as_ref().map(|s| String::from_utf8_lossy(s)),
631                 cache_bytes.as_ref().map(|s| String::from_utf8_lossy(s)),
632             );
633         }
634 
635         // Once we have our `cache_bytes` which represents the `Summaries` we're
636         // about to return, write that back out to disk so future Cargo
637         // invocations can use it.
638         //
639         // This is opportunistic so we ignore failure here but are sure to log
640         // something in case of error.
641         if let Some(cache_bytes) = cache_bytes {
642             if paths::create_dir_all(cache_path.parent().unwrap()).is_ok() {
643                 let path = Filesystem::new(cache_path.clone());
644                 config.assert_package_cache_locked(&path);
645                 if let Err(e) = fs::write(cache_path, cache_bytes) {
646                     log::info!("failed to write cache: {}", e);
647                 }
648             }
649         }
650 
651         Ok(Some(ret))
652     }
653 
654     /// Parses an open `File` which represents information previously cached by
655     /// Cargo.
parse_cache(contents: Vec<u8>, last_index_update: &str) -> CargoResult<Summaries>656     pub fn parse_cache(contents: Vec<u8>, last_index_update: &str) -> CargoResult<Summaries> {
657         let cache = SummariesCache::parse(&contents, last_index_update)?;
658         let mut ret = Summaries::default();
659         for (version, summary) in cache.versions {
660             let (start, end) = subslice_bounds(&contents, summary);
661             ret.versions
662                 .insert(version, MaybeIndexSummary::Unparsed { start, end });
663         }
664         ret.raw_data = contents;
665         return Ok(ret);
666 
667         // Returns the start/end offsets of `inner` with `outer`. Asserts that
668         // `inner` is a subslice of `outer`.
669         fn subslice_bounds(outer: &[u8], inner: &[u8]) -> (usize, usize) {
670             let outer_start = outer.as_ptr() as usize;
671             let outer_end = outer_start + outer.len();
672             let inner_start = inner.as_ptr() as usize;
673             let inner_end = inner_start + inner.len();
674             assert!(inner_start >= outer_start);
675             assert!(inner_end <= outer_end);
676             (inner_start - outer_start, inner_end - outer_start)
677         }
678     }
679 }
680 
681 // Implementation of serializing/deserializing the cache of summaries on disk.
682 // Currently the format looks like:
683 //
684 // +--------------------+----------------------+-------------+---+
685 // | cache version byte | index format version | git sha rev | 0 |
686 // +--------------------+----------------------+-------------+---+
687 //
688 // followed by...
689 //
690 // +----------------+---+------------+---+
691 // | semver version | 0 |  JSON blob | 0 | ...
692 // +----------------+---+------------+---+
693 //
694 // The idea is that this is a very easy file for Cargo to parse in future
695 // invocations. The read from disk should be quite fast and then afterwards all
696 // we need to know is what versions correspond to which JSON blob.
697 //
698 // The leading version byte is intended to ensure that there's some level of
699 // future compatibility against changes to this cache format so if different
700 // versions of Cargo share the same cache they don't get too confused. The git
701 // sha lets us know when the file needs to be regenerated (it needs regeneration
702 // whenever the index itself updates).
703 //
704 // Cache versions:
705 // * `1`: The original version.
706 // * `2`: Added the "index format version" field so that if the index format
707 //   changes, different versions of cargo won't get confused reading each
708 //   other's caches.
709 // * `3`: Bumped the version to work around an issue where multiple versions of
710 //   a package were published that differ only by semver metadata. For
711 //   example, openssl-src 110.0.0 and 110.0.0+1.1.0f. Previously, the cache
712 //   would be incorrectly populated with two entries, both 110.0.0. After
713 //   this, the metadata will be correctly included. This isn't really a format
714 //   change, just a version bump to clear the incorrect cache entries. Note:
715 //   the index shouldn't allow these, but unfortunately crates.io doesn't
716 //   check it.
717 
718 const CURRENT_CACHE_VERSION: u8 = 3;
719 
720 impl<'a> SummariesCache<'a> {
parse(data: &'a [u8], last_index_update: &str) -> CargoResult<SummariesCache<'a>>721     fn parse(data: &'a [u8], last_index_update: &str) -> CargoResult<SummariesCache<'a>> {
722         // NB: keep this method in sync with `serialize` below
723         let (first_byte, rest) = data
724             .split_first()
725             .ok_or_else(|| anyhow::format_err!("malformed cache"))?;
726         if *first_byte != CURRENT_CACHE_VERSION {
727             bail!("looks like a different Cargo's cache, bailing out");
728         }
729         let index_v_bytes = rest
730             .get(..4)
731             .ok_or_else(|| anyhow::anyhow!("cache expected 4 bytes for index version"))?;
732         let index_v = u32::from_le_bytes(index_v_bytes.try_into().unwrap());
733         if index_v != INDEX_V_MAX {
734             bail!(
735                 "index format version {} doesn't match the version I know ({})",
736                 index_v,
737                 INDEX_V_MAX
738             );
739         }
740         let rest = &rest[4..];
741 
742         let mut iter = split(rest, 0);
743         if let Some(update) = iter.next() {
744             if update != last_index_update.as_bytes() {
745                 bail!(
746                     "cache out of date: current index ({}) != cache ({})",
747                     last_index_update,
748                     str::from_utf8(update)?,
749                 )
750             }
751         } else {
752             bail!("malformed file");
753         }
754         let mut ret = SummariesCache::default();
755         while let Some(version) = iter.next() {
756             let version = str::from_utf8(version)?;
757             let version = Version::parse(version)?;
758             let summary = iter.next().unwrap();
759             ret.versions.push((version, summary));
760         }
761         Ok(ret)
762     }
763 
serialize(&self, index_version: &str) -> Vec<u8>764     fn serialize(&self, index_version: &str) -> Vec<u8> {
765         // NB: keep this method in sync with `parse` above
766         let size = self
767             .versions
768             .iter()
769             .map(|(_version, data)| (10 + data.len()))
770             .sum();
771         let mut contents = Vec::with_capacity(size);
772         contents.push(CURRENT_CACHE_VERSION);
773         contents.extend(&u32::to_le_bytes(INDEX_V_MAX));
774         contents.extend_from_slice(index_version.as_bytes());
775         contents.push(0);
776         for (version, data) in self.versions.iter() {
777             contents.extend_from_slice(version.to_string().as_bytes());
778             contents.push(0);
779             contents.extend_from_slice(data);
780             contents.push(0);
781         }
782         contents
783     }
784 }
785 
786 impl MaybeIndexSummary {
787     /// Parses this "maybe a summary" into a `Parsed` for sure variant.
788     ///
789     /// Does nothing if this is already `Parsed`, and otherwise the `raw_data`
790     /// passed in is sliced with the bounds in `Unparsed` and then actually
791     /// parsed.
parse( &mut self, config: &Config, raw_data: &[u8], source_id: SourceId, ) -> CargoResult<&IndexSummary>792     fn parse(
793         &mut self,
794         config: &Config,
795         raw_data: &[u8],
796         source_id: SourceId,
797     ) -> CargoResult<&IndexSummary> {
798         let (start, end) = match self {
799             MaybeIndexSummary::Unparsed { start, end } => (*start, *end),
800             MaybeIndexSummary::Parsed(summary) => return Ok(summary),
801         };
802         let summary = IndexSummary::parse(config, &raw_data[start..end], source_id)?;
803         *self = MaybeIndexSummary::Parsed(summary);
804         match self {
805             MaybeIndexSummary::Unparsed { .. } => unreachable!(),
806             MaybeIndexSummary::Parsed(summary) => Ok(summary),
807         }
808     }
809 }
810 
811 impl From<IndexSummary> for MaybeIndexSummary {
from(summary: IndexSummary) -> MaybeIndexSummary812     fn from(summary: IndexSummary) -> MaybeIndexSummary {
813         MaybeIndexSummary::Parsed(summary)
814     }
815 }
816 
817 impl IndexSummary {
818     /// Parses a line from the registry's index file into an `IndexSummary` for
819     /// a package.
820     ///
821     /// The `line` provided is expected to be valid JSON.
parse(config: &Config, line: &[u8], source_id: SourceId) -> CargoResult<IndexSummary>822     fn parse(config: &Config, line: &[u8], source_id: SourceId) -> CargoResult<IndexSummary> {
823         // ****CAUTION**** Please be extremely careful with returning errors
824         // from this function. Entries that error are not included in the
825         // index cache, and can cause cargo to get confused when switching
826         // between different versions that understand the index differently.
827         // Make sure to consider the INDEX_V_MAX and CURRENT_CACHE_VERSION
828         // values carefully when making changes here.
829         let RegistryPackage {
830             name,
831             vers,
832             cksum,
833             deps,
834             mut features,
835             features2,
836             yanked,
837             links,
838             v,
839         } = serde_json::from_slice(line)?;
840         let v = v.unwrap_or(1);
841         log::trace!("json parsed registry {}/{}", name, vers);
842         let pkgid = PackageId::new(name, &vers, source_id)?;
843         let deps = deps
844             .into_iter()
845             .map(|dep| dep.into_dep(source_id))
846             .collect::<CargoResult<Vec<_>>>()?;
847         if let Some(features2) = features2 {
848             for (name, values) in features2 {
849                 features.entry(name).or_default().extend(values);
850             }
851         }
852         let mut summary = Summary::new(config, pkgid, deps, &features, links)?;
853         summary.set_checksum(cksum);
854         Ok(IndexSummary {
855             summary,
856             yanked: yanked.unwrap_or(false),
857             v,
858         })
859     }
860 }
861 
split(haystack: &[u8], needle: u8) -> impl Iterator<Item = &[u8]>862 fn split(haystack: &[u8], needle: u8) -> impl Iterator<Item = &[u8]> {
863     struct Split<'a> {
864         haystack: &'a [u8],
865         needle: u8,
866     }
867 
868     impl<'a> Iterator for Split<'a> {
869         type Item = &'a [u8];
870 
871         fn next(&mut self) -> Option<&'a [u8]> {
872             if self.haystack.is_empty() {
873                 return None;
874             }
875             let (ret, remaining) = match memchr::memchr(self.needle, self.haystack) {
876                 Some(pos) => (&self.haystack[..pos], &self.haystack[pos + 1..]),
877                 None => (self.haystack, &[][..]),
878             };
879             self.haystack = remaining;
880             Some(ret)
881         }
882     }
883 
884     Split { haystack, needle }
885 }
886