1

I have the following simple unit test:

  1. Create base.dll assembly in memory - get its byte array.
  2. Create main.dll assembly depending on base.dll in memory - get its byte array.
  3. Create CSharpCompilation object from both dlls

I will post the complete unit test at the end, but for now here is the relevant fragment:

const string BASE_CODE = "public interface I {}";
const string MAIN_CODE = "public class T: I {}";

MetadataReference systemDllRef = MetadataReference.CreateFromFile(typeof(object).Assembly.Location);

var baseDllBytes = GetDllBytes("base", BASE_CODE);
MetadataReference baseDllRef = MetadataReference.CreateFromStream(new MemoryStream(baseDllBytes), filePath: "base.dll");

var mainDllBytes = GetDllBytes("main", MAIN_CODE, systemDllRef, baseDllRef);
MetadataReference mainDllRef = MetadataReference.CreateFromStream(new MemoryStream(mainDllBytes), filePath: "main.dll");

var compilation = CSharpCompilation.Create("temp", null, new[] { systemDllRef, baseDllRef, mainDllRef });

As you can see main.dll defines a single type T which implements interface I defined in base.dll.

Next I would like to obtain the type symbol for T and answer the following questions:

  1. What assembly owns it?
  2. What assembly owns its interface?
  3. Does it have any declaring syntax references?
  4. Does its interface have any declaring syntax references?

Here is the code:

var mainTypeSymbol = compilation.GetTypeByMetadataName("T");
Assert.AreEqual("main", mainTypeSymbol.ContainingAssembly.Name);                    // GOOD
Assert.AreEqual("base", mainTypeSymbol.Interfaces[0].ContainingAssembly.Name);      // GOOD
CollectionAssert.IsEmpty(mainTypeSymbol.DeclaringSyntaxReferences);                 // BAD !!!
CollectionAssert.IsEmpty(mainTypeSymbol.Interfaces[0].DeclaringSyntaxReferences);   // BAD !!!

Of course, it is expected that the declaring syntax references are empty, after all the compilation object contains no syntax trees at all. I am going to fix it now:

compilation = compilation.AddSyntaxTrees(CSharpSyntaxTree.ParseText(MAIN_CODE));
mainTypeSymbol = compilation.GetTypeByMetadataName("T");
Assert.AreEqual("temp", mainTypeSymbol.ContainingAssembly.Name);                    // BAD !!!
Assert.AreEqual("base", mainTypeSymbol.Interfaces[0].ContainingAssembly.Name);      // GOOD
CollectionAssert.IsNotEmpty(mainTypeSymbol.DeclaringSyntaxReferences);              // GOOD
CollectionAssert.IsEmpty(mainTypeSymbol.Interfaces[0].DeclaringSyntaxReferences);   // BAD !!!

IBYP? Now I have the declaring syntax reference for T, but its assembly is reported as temp, not main !!! Now if I add the syntax tree for I:

compilation = compilation.AddSyntaxTrees(CSharpSyntaxTree.ParseText(BASE_CODE));
mainTypeSymbol = compilation.GetTypeByMetadataName("T");
Assert.AreEqual("temp", mainTypeSymbol.ContainingAssembly.Name);                        // BAD !!!
Assert.AreEqual("temp", mainTypeSymbol.Interfaces[0].ContainingAssembly.Name);          // BAD !!!
CollectionAssert.IsNotEmpty(mainTypeSymbol.DeclaringSyntaxReferences);                  // GOOD
CollectionAssert.IsNotEmpty(mainTypeSymbol.Interfaces[0].DeclaringSyntaxReferences);    // GOOD

All the assembly results are now botched, but the declaring syntax references are returned.

The complete unit test code is:

[Test]
public void SymbolAssembly()
{
    const string BASE_CODE = "public interface I {}";
    const string MAIN_CODE = "public class T: I {}";

    MetadataReference systemDllRef = MetadataReference.CreateFromFile(typeof(object).Assembly.Location);
    
    var baseDllBytes = GetDllBytes("base", BASE_CODE);
    MetadataReference baseDllRef = MetadataReference.CreateFromStream(new MemoryStream(baseDllBytes), filePath: "base.dll");

    var mainDllBytes = GetDllBytes("main", MAIN_CODE, systemDllRef, baseDllRef);
    MetadataReference mainDllRef = MetadataReference.CreateFromStream(new MemoryStream(mainDllBytes), filePath: "main.dll");

    var compilation = CSharpCompilation.Create("temp", null, new[] { systemDllRef, baseDllRef, mainDllRef });
    var mainTypeSymbol = compilation.GetTypeByMetadataName("T");
    Assert.AreEqual("main", mainTypeSymbol.ContainingAssembly.Name);                    // GOOD
    Assert.AreEqual("base", mainTypeSymbol.Interfaces[0].ContainingAssembly.Name);      // GOOD
    CollectionAssert.IsEmpty(mainTypeSymbol.DeclaringSyntaxReferences);                 // BAD !!!
    CollectionAssert.IsEmpty(mainTypeSymbol.Interfaces[0].DeclaringSyntaxReferences);   // BAD !!!

    compilation = compilation.AddSyntaxTrees(CSharpSyntaxTree.ParseText(MAIN_CODE));
    mainTypeSymbol = compilation.GetTypeByMetadataName("T");
    Assert.AreEqual("temp", mainTypeSymbol.ContainingAssembly.Name);                    // BAD !!!
    Assert.AreEqual("base", mainTypeSymbol.Interfaces[0].ContainingAssembly.Name);      // GOOD
    CollectionAssert.IsNotEmpty(mainTypeSymbol.DeclaringSyntaxReferences);              // GOOD
    CollectionAssert.IsEmpty(mainTypeSymbol.Interfaces[0].DeclaringSyntaxReferences);   // BAD !!!

    compilation = compilation.AddSyntaxTrees(CSharpSyntaxTree.ParseText(BASE_CODE));
    mainTypeSymbol = compilation.GetTypeByMetadataName("T");
    Assert.AreEqual("temp", mainTypeSymbol.ContainingAssembly.Name);                        // BAD !!!
    Assert.AreEqual("temp", mainTypeSymbol.Interfaces[0].ContainingAssembly.Name);          // BAD !!!
    CollectionAssert.IsNotEmpty(mainTypeSymbol.DeclaringSyntaxReferences);                  // GOOD
    CollectionAssert.IsNotEmpty(mainTypeSymbol.Interfaces[0].DeclaringSyntaxReferences);    // GOOD
}

private static byte[] GetDllBytes(string name, string code, params MetadataReference[] metadataReferences)
{
    var syntaxTree = CSharpSyntaxTree.ParseText(code);
    var c = CSharpCompilation.Create(name, new[] { syntaxTree }, metadataReferences, new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary));
    var stream = new MemoryStream();
    var res = c.Emit(stream);
    Assert.IsTrue(res.Success);
    var bytes = stream.GetBuffer();
    if (bytes.Length > stream.Position)
    {
        bytes = new byte[stream.Position];
        Array.Copy(stream.GetBuffer(), bytes, stream.Position);
    }
    return bytes;
}

I can explain the results like this:

  • When there is no matching syntax tree:
    • There is no declaring syntax references. Understandably so.
    • The ISymbol.ContainingAssembly property returns the actual assembly represented by the respective MetadataReference object.
  • When there is a matching syntax tree:
    • There is the declaring syntax references. Makes sense too.
    • For some reason, finding the matching syntax tree in the Compilation object changes the result of the ISymbol.ContainingAssembly property - it is now the name of the Compilation object.

Now my question - how can I get both the containing assembly and the respective declaring syntax references from a Compilation object containing all the right MetadataReference and SyntaxTree objects?

Rationale

We are in the process of decomposing our monolithic application. This includes a lot of "dumb" refactoring. By "dumb" I mean those that can be reasonably automated. For example, suppose there are two Dependency Injected interfaces that are used very frequently and I want to move a method from one to another. There is a lot of similar changes to be done in all the places where the moved method is used. 95% of them can be automated and so I wrote a tool that does it. But instead of trying to guess all the places where the code must be adjusted it compiles the code and then resolves the build errors automatically. Maybe this is a wrong approach, but that is what I am currently doing:

  • I map all the types in the code across all the solutions (we have many and refactoring is across all of them) including the source file paths and the types that are using and are used by the type in question. This is a preliminary operation before refactoring starts. It is quite smart as it knows to deal with "good" dynamic calls and known constants. The generated map (~100MB in size) is used subsequently.
  • The code moves the method (it so happens there are very little dependencies to be moved in this particular case)
  • The code starts build-fix loop, where each error is parsed and the code is fixed accordingly.

The fix involves creating a Compilation object from all the relevant MetadataReference objects and adding SyntaxTree objects as deemed necessary for fixing the error. Right now the name of the Compilation object matches the name of the assembly being built and so as long as the fix is limited to that same assembly all is working well. But, if in order to fix project X I need to go back and update project Y it means the Compilation object now has SyntaxTree objects both from X and Y and that is no good, because it changes the ContainingAssembly property. So, right now I only have one Compilation object per error fixing session, but it seems I cannot use this model anymore.

Maybe this is all a stupid idea, but it does work nicely and produces good results, again, as long as I do not have to reach back to other projects while fixing an error in the current project. The build-fix loop allows for manual intervention if it is unable to fix the code (because it does not know how) and it is capable to do about 95% of the changes automatically.

Clarification 1

When a compilation error occurs I create Compilation object with the following pieces:

  1. The DLL (i.e. MetadataReference created from it) from the last successful build of the project.
  2. All the DLLs referenced by the DLL from (1)
  3. The syntax tree of the file mentioned in an error.

Then I start working from there adding syntax trees as needed. So I never add all the syntax trees for all the source files. Only a few as needed and sometimes some of them would correspond to symbols from dependency projects. This is how it happens that in order to fix an error in the project associated with that error I need to go back and change something in a source file owned by some dependency project. During this process some syntax trees from that other project are added to the Compilation object and this is how I end up with syntax trees from different projects in the same Compilation object.

mark
  • 59,016
  • 79
  • 296
  • 580

1 Answers1

1

finding the matching syntax tree in the Compilation object changes the result of the ISymbol.ContainingAssembly property - it is now the name of the Compilation object.

This is implementing the C# behavior that if you have a metadata reference defining a type and also a source file defining the same type, the source file wins. So once you add the source definition of T, that's hiding the metadata implementation. Similarly, once you added the definition of I, that was also from source and hiding the metadata definition of I.

It's not clear to me what you're trying to achieve by adding source files and the metadata references at the same time; you might want to update your question to clarify your ultimate goal here.

Jason Malinowski
  • 18,148
  • 1
  • 38
  • 55
  • I posted the rationale. A bit long, but I hope it makes sense. – mark Nov 09 '21 at 02:07
  • I guess I can use my type map to deduce the assembly from the full type name. There are test projects that link to the source files thus producing identical full type name in different assemblies, but these never occur in the same Compilation object, of course. So the match would be unique. Still, I am curious if it is possible to somehow figure out which MetadataReference owns which symbol, even if the source tree is given. – mark Nov 09 '21 at 02:42
  • So reading your rationale, I'm still not understanding why you're adding syntax nodes that mirror metadata references like this. Specifically: "But, if in order to fix project X I need to go back and update project Y it means the Compilation object now has SyntaxTree objects both from X and Y and that is no good, because it changes the ContainingAssembly property." It feels like in this case I'm not sure if you should be editing Y's contents and then updating the reference to X. If managing all the objects is tricky we have the Workspaces layer for managing cross-project refs. – Jason Malinowski Nov 09 '21 at 18:39
  • In your case are the references coming from DLLs that built elsewhere or also built in your application too? Because generally if it's from your same application and you're making Compilation objects for both, don't be emitting to memory -- we allow Compilation-to-Compilation references directly that skip the emit step and are also a lot faster too. And in those cases symbols actually appear from source because there's not a step in the middle. – Jason Malinowski Nov 09 '21 at 18:43
  • I need to process your input. – mark Nov 11 '21 at 03:01
  • I had a brief look at the `Workspace` concept, but I think it does not operate across multiple Solutions. In our case we have multiple solutions. I need to play with it again. – mark Nov 11 '21 at 03:21
  • Not quite as cleanly as you might want, but nothing stops a "Solution" encompassing multiple projects. Or, it should still hopefully be easier to do than managing your Compilations manually. – Jason Malinowski Nov 22 '21 at 16:48
  • I understand that `Solution` encompasses multiple projects. We have a dozen of such solutions, each encompassing multiple projects. Moreover, we do not allow cross solution project references, so when a project in solution X references a project in a previously built solution Y it is necessarily a DLL reference. The code that processes them all must know that this dll reference is not a third party dll. We have a solution list file (a YAML file also driving the PR/CI builds) which gives the solutions and their build order. – mark Nov 23 '21 at 00:15
  • You shouldn't have too much trouble making a "merged" solution object for the purposes of your tool here that stitches the multiple solutions together, by your description. If you need to teach the in-memory representation that the DLLs are really a reference to a project you can also use https://learn.microsoft.com/en-us/dotnet/api/microsoft.codeanalysis.workspace.updatereferencesafteradd to do so, but that would by default only impact the in-memory representation. – Jason Malinowski Nov 23 '21 at 19:23